Ticket 15647

Summary: node fail with not responding
Product: Slurm Reporter: Xing Huang <x.huang>
Component: OtherAssignee: Jason Booth <jbooth>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 22.05.3   
Hardware: Linux   
OS: Linux   
Site: WA St. Louis Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Xing Huang 2022-12-16 12:20:26 MST
[root@mgt ~]# grep "not responding" /var/log/messages
Dec 11 16:26:02 mgt slurmctld[2711177]: slurmctld: error: Nodes node24 not responding, setting DOWN
Dec 11 16:27:08 mgt slurmctld[2711177]: slurmctld: error: Nodes node26 not responding
Dec 11 16:32:08 mgt slurmctld[2711177]: slurmctld: error: Nodes node24 not responding
Dec 11 19:47:08 mgt slurmctld[2711177]: slurmctld: error: Nodes gpu08 not responding
Dec 11 19:52:08 mgt slurmctld[2711177]: slurmctld: error: Nodes gpu08 not responding
Dec 12 09:17:10 mgt slurmctld[2711177]: slurmctld: error: Nodes gpu08 not responding
Dec 12 11:22:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node17 not responding
Dec 12 11:22:13 mgt slurmctld[3142209]: slurmctld: error: Nodes node17 not responding, setting DOWN
Dec 12 12:02:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node17 not responding
Dec 12 12:02:13 mgt slurmctld[3142209]: slurmctld: error: Nodes node17 not responding, setting DOWN
Dec 12 12:32:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node16 not responding
Dec 12 12:37:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node16 not responding
Dec 12 12:42:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node15 not responding
Dec 12 12:47:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node15 not responding
Dec 12 13:32:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu01 not responding
Dec 12 13:37:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu01 not responding
Dec 12 17:12:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node27 not responding
Dec 12 17:15:47 mgt slurmctld[3142209]: slurmctld: error: Nodes node27 not responding, setting DOWN
Dec 12 18:52:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu08 not responding
Dec 12 18:57:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu08 not responding
Dec 12 21:32:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node29 not responding
Dec 12 21:52:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node28 not responding
Dec 12 21:57:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node28 not responding
Dec 13 08:37:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu06 not responding
Dec 13 08:42:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu06 not responding
Dec 13 10:07:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu04 not responding
Dec 13 10:10:48 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu08 not responding, setting DOWN
Dec 13 10:12:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu[04,06],node27 not responding
Dec 13 10:27:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu05 not responding
Dec 13 10:32:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu05 not responding
Dec 13 11:17:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu05 not responding
Dec 13 11:22:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu05 not responding
Dec 13 14:42:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu05 not responding
Dec 13 14:47:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu05 not responding
Dec 13 19:47:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu05 not responding
Dec 13 20:02:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu03 not responding
Dec 13 20:07:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu03 not responding
Dec 13 22:02:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node27 not responding
Dec 13 22:07:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node27 not responding
Dec 13 22:08:19 mgt slurmctld[3142209]: slurmctld: error: Nodes node27 not responding, setting DOWN
Dec 14 00:12:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu08 not responding
Dec 14 00:17:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu08 not responding
Dec 14 09:41:39 mgt slurmctld[3142209]: slurmctld: error: Nodes node[15,27] not responding, setting DOWN
Dec 14 09:42:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu04 not responding
Dec 14 10:52:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu07 not responding
Dec 14 10:53:19 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu07 not responding, setting DOWN
Dec 14 11:37:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu04 not responding
Dec 14 12:12:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node30 not responding
Dec 14 12:27:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu03 not responding
Dec 14 12:32:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu03 not responding
Dec 14 12:42:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu01,node15 not responding
Dec 14 12:44:59 mgt slurmctld[3142209]: slurmctld: error: Nodes node15 not responding, setting DOWN
Dec 14 12:47:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu01 not responding
Dec 14 13:12:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node25 not responding
Dec 14 13:13:19 mgt slurmctld[3142209]: slurmctld: error: Nodes node25 not responding, setting DOWN
Dec 14 15:37:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu[03-04] not responding
Dec 14 15:52:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu04 not responding
Dec 14 15:57:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu04 not responding
Dec 14 16:36:02 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu06 not responding, setting DOWN
Dec 14 17:32:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu03 not responding
Dec 14 17:42:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu02 not responding
Dec 14 18:37:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node05 not responding
Dec 14 18:39:22 mgt slurmctld[3142209]: slurmctld: error: Nodes node05 not responding, setting DOWN
Dec 14 21:32:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu05 not responding
Dec 14 21:56:03 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu07 not responding, setting DOWN
Dec 14 22:22:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu05 not responding
Dec 14 22:22:43 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu05 not responding, setting DOWN
Dec 15 10:06:34 mgt slurmctld[3142209]: slurmctld: error: Nodes node25 not responding, setting DOWN
Dec 15 10:07:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node[05,15,30] not responding
Dec 15 11:17:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu03 not responding
Dec 15 11:22:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu03 not responding
Dec 15 11:52:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node30 not responding
Dec 15 14:32:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu08 not responding
Dec 15 14:37:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu08 not responding
Dec 15 17:12:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu07 not responding
Dec 15 17:17:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu07 not responding
Dec 15 17:22:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node30 not responding
Dec 15 17:32:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node30 not responding
Dec 15 19:07:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu06 not responding
Dec 15 19:09:10 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu06 not responding, setting DOWN
Dec 16 07:32:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node17 not responding
Dec 16 07:37:09 mgt slurmctld[3142209]: slurmctld: error: Nodes node17 not responding
Dec 16 09:07:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu07,node17 not responding
Dec 16 09:09:10 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu07,node17 not responding, setting DOWN
Dec 16 10:07:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu02 not responding
Dec 16 11:32:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu01 not responding
Dec 16 12:12:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu08 not responding
Dec 16 12:22:09 mgt slurmctld[3142209]: slurmctld: error: Nodes gpu05 not responding

I checked two jobs on the gpu01 and gpu02 for additional details.

=======================================================================
This is job 276364 died on gpu01

[2022-12-16T10:14:41.376] task/affinity: task_p_slurmd_batch_request: task_p_slurmd
_batch_request: 276364
[2022-12-16T10:14:41.376] task/affinity: batch_bind: job 276364 CPU input mask for
node: 0x0F0000F0
[2022-12-16T10:14:41.376] task/affinity: batch_bind: job 276364 CPU final HW mask for node: 0x0F0000F0
[2022-12-16T10:14:41.438] [276364.extern] task/cgroup: _memcg_initialize: job: alloc=73728MB mem.limit=73728MB memsw.limit=73728MB job_swappiness=18446744073709551614
[2022-12-16T10:14:41.438] [276364.extern] task/cgroup: _memcg_initialize: step: alloc=73728MB mem.limit=73728MB memsw.limit=73728MB job_swappiness=18446744073709551614
[2022-12-16T10:14:41.451] Launching batch job 276364 for UID 1966782
[2022-12-16T10:14:41.464] [276364.batch] task/cgroup: _memcg_initialize: job: alloc=73728MB mem.limit=73728MB memsw.limit=73728MB job_swappiness=18446744073709551614
[2022-12-16T10:14:41.464] [276364.batch] task/cgroup: _memcg_initialize: step: alloc=73728MB mem.limit=73728MB memsw.limit=73728MB job_swappiness=18446744073709551614
[2022-12-16T11:30:30.159] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job276364/slurm_script

===========================================================================
This is job 276361 on gpu01

[2022-12-16T10:07:45.190] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job276361/slurm_script
[2022-12-16T10:07:45.190] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job268202/slurm_script
[2022-12-16T10:07:45.191] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job268203/slurm_script
[2022-12-16T10:07:45.191] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job275631/slurm_script
[2022-12-16T10:07:45.221] CPU frequency setting not configured for this node
[2022-12-16T10:07:45.224] slurmd version 22.05.3 started
[2022-12-16T10:07:45.236] slurmd started on Fri, 16 Dec 2022 10:07:45 -0600
[2022-12-16T10:07:45.236] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job276361/slurm_script
[2022-12-16T10:07:45.237] CPUs=32 Boards=1 Sockets=2 Cores=16 Threads=1 Memory=772422 TmpDisk=237936 Uptime=29 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2022-12-16T10:07:45.237] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job276361/slurm_script
[2022-12-16T10:07:45.247] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job276361/slurm_script
[2022-12-16T10:07:45.247] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job276361/slurm_script
[2022-12-16T10:07:45.247] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job276361/slurm_script
[2022-12-16T10:07:45.247] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job276361/slurm_script
[2022-12-16T10:07:49.796] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job276361/slurm_script


Below is message from slurmctld.log

[2022-12-16T10:07:44.877] Batch JobId=276361 missing from batch node gpu02 (not found BatchStartTime after startup)
[2022-12-16T10:07:44.877] _job_complete: JobId=276361 WTERMSIG 126
[2022-12-16T10:07:44.877] _job_complete: JobId=276361 cancelled by node failure
[2022-12-16T10:07:44.877] _job_complete: JobId=276361 done
[2022-12-16T10:07:44.877] Node gpu02 now responding

[2022-12-16T11:30:30.050] Batch JobId=276364 missing from batch node gpu01 (not found BatchStartTime after startup)
[2022-12-16T11:30:30.050] _job_complete: JobId=276364 WTERMSIG 126
[2022-12-16T11:30:30.050] _job_complete: JobId=276364 cancelled by node failure
[2022-12-16T11:30:30.050] _job_complete: JobId=276364 done
[2022-12-16T11:30:30.051] Node gpu01 now responding

What is the issue with this job and the node failure?

Best,
Xing
Comment 1 Jason Booth 2022-12-16 13:10:56 MST
Please attach your slurm.conf and the slurmd.log from that node.
Comment 2 Jason Booth 2022-12-16 14:12:51 MST
While we wait on the slurm.conf, it is worth mentioning that the controller periodically connects to the slurmd's over a TCP connection. If it can not make a connection in the SlurmdTimeout period, then the slurmd will be considered down not responding.


You may want to consider increasing SlurmdTimeout. If your site uses a value of 300 then you may want to consider a value of 500-600.

Normally when we see these type of messages, nodes are busy doing other work like copying large amounts of data to the site storage solution or just have CPU intensive workflow.
Comment 3 Xing Huang 2022-12-16 14:19:31 MST
(In reply to Jason Booth from comment #2)
> While we wait on the slurm.conf, it is worth mentioning that the controller
> periodically connects to the slurmd's over a TCP connection. If it can not
> make a connection in the SlurmdTimeout period, then the slurmd will be
> considered down not responding.
> 
> 
> You may want to consider increasing SlurmdTimeout. If your site uses a value
> of 300 then you may want to consider a value of 500-600.
> 
> Normally when we see these type of messages, nodes are busy doing other work
> like copying large amounts of data to the site storage solution or just have
> CPU intensive workflow.


That is a good suggestion. At this point, there is no more information in slurmd.log than what I have shown you.
I will first bump up this value in slurm.conf and see if this helps.
Can you update you in a few days after I make change?
Comment 4 Jason Booth 2022-12-16 14:26:02 MST
> Can you update you in a few days after I make change?

Yes, for now I will proceed to downgrade the severity.
Comment 5 Jason Booth 2023-01-23 14:41:43 MST
I am closing this out. Please feel free to re-open, however, I would consider the bump in the slurmdtimeout to be sufficient.