Ticket 17458

Summary:	We see sometimes slurmstepd hanging and keeping the job in CG indefinitely
Product:	Slurm	Reporter:	devops <richard>
Component:	slurmd	Assignee:	Nate Rini <nate>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	nate
Version:	23.02.2
Hardware:	Linux
OS:	Linux
Site:	Stability AI	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description devops@stability.ai 2023-08-18 02:56:40 MDT

this is what we see in ps auxf

\_ /opt/slurm/sbin/slurmstepd spank epilog

the backtrace says

0x0000147fc698f392 in __libc_read (fd=0, buf=0x7ffe840b1adc, nbytes=4) at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x00005636ff9350a9 in read (__nbytes=4, __buf=0x7ffe840b1adc, __fd=0) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
#2  _read_slurmd_conf_lite (fd=0) at slurmstepd.c:315
#3  0x00005636ff935e4b in _handle_spank_mode (argv=<optimized out>, argc=3) at slurmstepd.c:445
#4  _process_cmdline (argv=<optimized out>, argc=<optimized out>) at slurmstepd.c:485
#5  main (argc=<optimized out>, argv=<optimized out>) at slurmstepd.c:128

Comment 1 devops@stability.ai 2023-08-18 02:57:51 MDT

we are using a spank plugin (Nvidia pyxis)

Comment 2 Nate Rini 2023-08-18 08:25:19 MDT

(In reply to devops@stability.ai from comment #0)
> #2  _read_slurmd_conf_lite (fd=0) at slurmstepd.c:315
> #3  0x00005636ff935e4b in _handle_spank_mode (argv=<optimized out>, argc=3)
> at slurmstepd.c:445
> #4  _process_cmdline (argv=<optimized out>, argc=<optimized out>) at
> slurmstepd.c:485
> #5  main (argc=<optimized out>, argv=<optimized out>) at slurmstepd.c:128

This backtrace suggests that slurmd is not sending the (cached) slurm.conf, which should not be delayed by a epilog or a prolog.

How often does this happen?

Please provide slurmd logs atleast debug2 from when this happens. Please ask if instructions are required for setting up or getting the logs.

Comment 3 devops@stability.ai 2023-08-18 08:30:27 MDT

it is not often. but it blocks the job in CR and I need to manually ssh into the nodes to reboot them

I saw this happening for one node out of 8, 7 nodes are closing fine but the 8th will hang

please let me know how to generate debug2 on the workers

I will probably need to enable cluster-wide and hope it will happen soon

Comment 4 Nate Rini 2023-08-18 08:47:29 MDT

(In reply to devops@stability.ai from comment #3)
> it is not often. but it blocks the job in CR and I need to manually ssh into
> the nodes to reboot them
> 
> I saw this happening for one node out of 8, 7 nodes are closing fine but the
> 8th will hang
> 
> please let me know how to generate debug2 on the workers
> 
> I will probably need to enable cluster-wide and hope it will happen soon

Set the following in slurm.conf on the node:
> SlurmdDebug=debug2
> SlurmdLogFile=/path/to/place/logs

restart slurmd and start a test job.

If slurm.conf is on a shared filesystem, then make a new slurm.conf locally:
> include /path/to/slurm.conf
> SlurmdDebug=debug2
> SlurmdLogFile=/path/to/place/logs

and then start slurmd manually using "-f". Here is an example:
> slurmd -D -f /path/to/local/slurm.conf

Once testing is complete, send SIGINT (ctrl-C in terminal) and start slurmd normally in systemd.

Comment 5 devops@stability.ai 2023-08-18 08:49:52 MDT

hm, the problem is I cannot reproduce the issue on demand. I need a way to catch it among all nodes of the cluster whenever it will happen next time

could take days, weeks

Comment 6 Nate Rini 2023-08-18 09:17:45 MDT

(In reply to devops@stability.ai from comment #5)
> hm, the problem is I cannot reproduce the issue on demand. I need a way to
> catch it among all nodes of the cluster whenever it will happen next time

In comment#0, there is a backtrace. Was a core taken and do you still have it?

> could take days, weeks

Reducing ticket severity due to the rarity of this issue. We can always increase it later if it starts happening more.

Comment 7 devops@stability.ai 2023-08-18 09:21:06 MDT

I do not have it anymore but when it happens again I will try to move the node to FAIL so we can inspect it properly

Comment 8 Nate Rini 2023-08-24 14:48:55 MDT

(In reply to devops@stability.ai from comment #7)
> I do not have it anymore but when it happens again I will try to move the
> node to FAIL so we can inspect it properly

I'm going to mark this ticket as timed out while we wait for the issue to happen again. Once it does, please just reply to the ticket and it will re-open automatically.