Ticket 17458

Summary: We see sometimes slurmstepd hanging and keeping the job in CG indefinitely
Product: Slurm Reporter: devops <richard>
Component: slurmdAssignee: Nate Rini <nate>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: nate
Version: 23.02.2   
Hardware: Linux   
OS: Linux   
Site: Stability AI Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description devops@stability.ai 2023-08-18 02:56:40 MDT
this is what we see in ps auxf

\_ /opt/slurm/sbin/slurmstepd spank epilog

the backtrace says

0x0000147fc698f392 in __libc_read (fd=0, buf=0x7ffe840b1adc, nbytes=4) at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x00005636ff9350a9 in read (__nbytes=4, __buf=0x7ffe840b1adc, __fd=0) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
#2  _read_slurmd_conf_lite (fd=0) at slurmstepd.c:315
#3  0x00005636ff935e4b in _handle_spank_mode (argv=<optimized out>, argc=3) at slurmstepd.c:445
#4  _process_cmdline (argv=<optimized out>, argc=<optimized out>) at slurmstepd.c:485
#5  main (argc=<optimized out>, argv=<optimized out>) at slurmstepd.c:128
Comment 1 devops@stability.ai 2023-08-18 02:57:51 MDT
we are using a spank plugin (Nvidia pyxis)
Comment 2 Nate Rini 2023-08-18 08:25:19 MDT
(In reply to devops@stability.ai from comment #0)
> #2  _read_slurmd_conf_lite (fd=0) at slurmstepd.c:315
> #3  0x00005636ff935e4b in _handle_spank_mode (argv=<optimized out>, argc=3)
> at slurmstepd.c:445
> #4  _process_cmdline (argv=<optimized out>, argc=<optimized out>) at
> slurmstepd.c:485
> #5  main (argc=<optimized out>, argv=<optimized out>) at slurmstepd.c:128

This backtrace suggests that slurmd is not sending the (cached) slurm.conf, which should not be delayed by a epilog or a prolog.

How often does this happen?

Please provide slurmd logs atleast debug2 from when this happens. Please ask if instructions are required for setting up or getting the logs.
Comment 3 devops@stability.ai 2023-08-18 08:30:27 MDT
it is not often. but it blocks the job in CR and I need to manually ssh into the nodes to reboot them

I saw this happening for one node out of 8, 7 nodes are closing fine but the 8th will hang

please let me know how to generate debug2 on the workers

I will probably need to enable cluster-wide and hope it will happen soon
Comment 4 Nate Rini 2023-08-18 08:47:29 MDT
(In reply to devops@stability.ai from comment #3)
> it is not often. but it blocks the job in CR and I need to manually ssh into
> the nodes to reboot them
> 
> I saw this happening for one node out of 8, 7 nodes are closing fine but the
> 8th will hang
> 
> please let me know how to generate debug2 on the workers
> 
> I will probably need to enable cluster-wide and hope it will happen soon

Set the following in slurm.conf on the node:
> SlurmdDebug=debug2
> SlurmdLogFile=/path/to/place/logs

restart slurmd and start a test job.

If slurm.conf is on a shared filesystem, then make a new slurm.conf locally:
> include /path/to/slurm.conf
> SlurmdDebug=debug2
> SlurmdLogFile=/path/to/place/logs

and then start slurmd manually using "-f". Here is an example:
> slurmd -D -f /path/to/local/slurm.conf

Once testing is complete, send SIGINT (ctrl-C in terminal) and start slurmd normally in systemd.
Comment 5 devops@stability.ai 2023-08-18 08:49:52 MDT
hm, the problem is I cannot reproduce the issue on demand. I need a way to catch it among all nodes of the cluster whenever it will happen next time

could take days, weeks
Comment 6 Nate Rini 2023-08-18 09:17:45 MDT
(In reply to devops@stability.ai from comment #5)
> hm, the problem is I cannot reproduce the issue on demand. I need a way to
> catch it among all nodes of the cluster whenever it will happen next time

In comment#0, there is a backtrace. Was a core taken and do you still have it?

> could take days, weeks

Reducing ticket severity due to the rarity of this issue. We can always increase it later if it starts happening more.
Comment 7 devops@stability.ai 2023-08-18 09:21:06 MDT
I do not have it anymore but when it happens again I will try to move the node to FAIL so we can inspect it properly
Comment 8 Nate Rini 2023-08-24 14:48:55 MDT
(In reply to devops@stability.ai from comment #7)
> I do not have it anymore but when it happens again I will try to move the
> node to FAIL so we can inspect it properly

I'm going to mark this ticket as timed out while we wait for the issue to happen again. Once it does, please just reply to the ticket and it will re-open automatically.