Ticket 7356

Summary:	Compute nodes flapping between responding and not responding
Product:	Slurm	Reporter:	Hjalti Sveinsson <hjalti.sveinsson>
Component:	slurmd	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex
Version:	18.08.7
Hardware:	Linux
OS:	Linux
Site:	deCODE	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	RHEL	Machine Name:
CLE Version:		Version Fixed:	18.08.7
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Hjalti Sveinsson 2019-07-04 09:13:26 MDT

Hi, 

we just started re-using some machines that were put in DOWN state while we were moving them between datacenters. They did get new IP addresses and now they seem to be flapping in and out. 

From responding to not responding and back in a matter of minutes. Even though they never lost connectivity.

grep from slurmctld.log

[2019-07-04T14:47:31.627] Node lhpc-0279 now responding
[2019-07-04T14:49:07.920] error: Nodes lhpc-0279 not responding
[2019-07-04T14:54:07.801] error: Nodes lhpc-0279 not responding
[2019-07-04T14:54:16.399] requeue job JobId=5497427 due to failure of node lhpc-0279
[2019-07-04T14:54:16.400] requeue job JobId=5780346 due to failure of node lhpc-0279
[2019-07-04T14:54:16.400] requeue job JobId=5780404 due to failure of node lhpc-0279
[2019-07-04T14:54:16.400] requeue job JobId=5780474 due to failure of node lhpc-0279
[2019-07-04T14:54:16.401] error: Nodes lhpc-0279 not responding, setting DOWN
[2019-07-04T14:54:16.424] Node lhpc-0279 now responding
[2019-07-04T14:54:16.425] node_did_resp: node lhpc-0279 returned to service
[2019-07-04T14:54:16.445] sched: Allocate JobId=5497551 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:16.449] sched: Allocate JobId=5722961 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:16.451] sched: Allocate JobId=5722966 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:16.455] sched: Allocate JobId=5722970 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:19.185] sched: Allocate JobId=5722986 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:19.188] sched: Allocate JobId=5497555 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:19.191] sched: Allocate JobId=5497559 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:19.193] sched: Allocate JobId=5497563 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:30.010] sched: Allocate JobId=5497567 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:30.014] sched: Allocate JobId=5497571 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:30.285] sched: Allocate JobId=5497595 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:30.351] sched: Allocate JobId=5497599 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:30.452] sched: Allocate JobId=5497603 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:54:30.516] sched: Allocate JobId=5780524 NodeList=lhpc-0279 #CPUs=4 Partition=rhel72
[2019-07-04T14:55:54.818] Node lhpc-0295 now responding
[2019-07-04T14:57:39.379] Node lhpc-0279 now responding
[2019-07-04T15:00:55.810] Node lhpc-0295 now responding
[2019-07-04T15:02:37.243] Node lhpc-0279 now responding
[2019-07-04T15:08:01.249] Node lhpc-0279 now responding
[2019-07-04T15:09:18.228] sched: Allocate JobId=5497539 NodeList=lhpc-0295 #CPUs=4 Partition=rhel72
[2019-07-04T15:10:03.597] backfill: Started JobId=5497375 in rhel72 on lhpc-0295

Comment 1 Hjalti Sveinsson 2019-07-05 04:08:05 MDT

I have upped the importance of this issue since this is causing all of the jobs  to fail on these nodes and these nodes are really important since they are running jobs on a partition that has nodes with a special setup (os, packages etc).

Comment 2 Alejandro Sanchez 2019-07-05 04:58:42 MDT

Hi Hjalti,

can you attach your config files and slurmctld.log?

do all the nodes in the cluster have the same configuration?

thanks

Comment 3 Hjalti Sveinsson 2019-07-05 05:23:31 MDT

Hi,

I just changed the Ethernet Adapter policy in Cisco UCS manager on these nodes to Linux policy and now they have stopped showing up as not responding. 

I just changed this at 10:10 this morning and no problem so far. I will update this issue after and hour or so, if I see no issue I will consider the issue fixed an then it has nothing to do with Slurm.

Comment 4 Alejandro Sanchez 2019-07-05 05:27:25 MDT

Ok, thanks for your feedback.

Comment 5 Alejandro Sanchez 2019-07-08 03:03:27 MDT

Any further issues? Can we close this out? thanks.

Comment 6 Alejandro Sanchez 2019-07-15 04:12:02 MDT

Hjalti, I'm lowering the severity of this. Please, let us know if the policy change solved the problem and we can close this out. Thanks.

Comment 7 Hjalti Sveinsson 2019-07-17 04:07:16 MDT

This issue can be closed as it was fixed with the adpater policy in Cisco UCS manager. Thank you.