Ticket 16126 - Shared node getting pummelled with jobs occasionally has a batch "missing from batch node" due to prolog issue
Summary: Shared node getting pummelled with jobs occasionally has a batch "missing fro...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 24.11.x
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks: 18483 20114
  Show dependency treegraph
 
Reported: 2023-02-25 01:51 MST by Chris Samuel (NERSC)
Modified: 2024-10-01 12:04 MDT (History)
8 users (show)

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 24.11.0rc1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
NERSC node prolog script for Perlmutter (1001 bytes, application/x-sh)
2023-02-28 18:27 MST, Chris Samuel (NERSC)
Details
backtrace of parent slurmd whose child is stuck in __lll_lock_wait_private (7.61 KB, text/plain)
2023-03-27 11:31 MDT, James Botts
Details
backtrace of child slurmd deadlocked in __lll_lock_wait_private (7.32 KB, text/plain)
2023-03-27 11:31 MDT, James Botts
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Chris Samuel (NERSC) 2023-02-25 01:51:25 MST
Hi there,

On Perlmutter we occasionally see a CPU node that is being used for shared jobs have a whole slew fail with (to pick one job as an example):

[2023-02-24T23:57:03.905] sched/backfill: _job_runnable_now: BACKFILL: JobId=5690995_3643(5694684) started in other partition during bf yield
[2023-02-25T01:36:42.761] Batch JobId=5690995_3643(5694684) missing from batch node nid004127 (not found BatchStartTime after startup), Requeuing job
[2023-02-25T01:36:42.761] _job_complete: JobId=5690995_3643(5694684) WTERMSIG 126
[2023-02-25T01:36:42.761] _job_complete: JobId=5690995_3643(5694684) cancelled by node failure
[2023-02-25T01:36:42.761] _job_complete: requeue JobId=5690995_3643(5694684) due to node failure
[2023-02-25T01:36:42.768] _job_complete: JobId=5690995_3643(5694684) done
[2023-02-25T01:36:42.791] agent(batch_launch): removed pending request for cancelled JobId=5694684
[2023-02-25T01:43:54.269] Resending TERMINATE_JOB request JobId=5690995_3643(5694684) Nodelist=nid004127
[2023-02-25T01:50:55.304] Resending TERMINATE_JOB request JobId=5690995_3643(5694684) Nodelist=nid004127
[2023-02-25T01:58:02.900] Resending TERMINATE_JOB request JobId=5690995_3643(5694684) Nodelist=nid004127
[2023-02-25T02:05:04.209] Resending TERMINATE_JOB request JobId=5690995_3643(5694684) Nodelist=nid004127

That job (with its many friends) then shows up stuck completing with:

JobId=5694684 EndTime=Unknown CompletingTime=19413-07:35:26 Nodes(COMPLETING)=nid004127 

Grabbing a core file of the running slurmd on the node with gcore I noticed that the thread to terminate these jobs is stuck here:

Thread 96 (Thread 0x14ce6b0f0700 (LWP 98972)):
#0  0x000014ce78cd1a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000041214a in _wait_for_job_running_prolog (job_id=5694690) at req.c:6027
#2  0x00000000004154a4 in _rpc_terminate_job (msg=msg@entry=0x14cab800cba0) at req.c:5310
#3  0x000000000041d838 in slurmd_req (msg=msg@entry=0x14cab800cba0) at req.c:336
#4  0x000000000040e07e in _service_connection (arg=<optimized out>) at slurmd.c:620
#5  0x000014ce78cca6ea in start_thread () from /lib64/libpthread.so.0
#6  0x000014ce7847fa8f in clone () from /lib64/libc.so.6

This appears to be because the prologs for the running jobs are stuck in poll() on a file descriptor for the prolog, for example:

Thread 44 (Thread 0x14cd2a8e8700 (LWP 165018)):
#0  0x000014ce784731e9 in poll () from /lib64/libc.so.6
#1  0x000014ce79c4c2a2 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x14cd2a8e7768) at /usr/include/bits/poll2.h:46
#2  run_command (args=args@entry=0x14cd2a8e7900) at run_command.c:298
#3  0x000014ce728bdba2 in _run_subpath_command (x=<optimized out>, arg=0x14cd2a8e7900) at prep_script_slurmd.c:122
#4  0x000014ce79c209eb in list_for_each_max (l=l@entry=0x14cd5c01b720, max=max@entry=0x14cd2a8e785c, f=f@entry=0x14ce728bdb81 <_run_subpath_command>, arg=arg@entry=0x14cd2a8e7900, break_on_fail=break_on_fail@entry=1, write_lock=write_lock@entry=1) at list.c:613
#5  0x000014ce79c20ad3 in list_for_each (l=l@entry=0x14cd5c01b720, f=f@entry=0x14ce728bdb81 <_run_subpath_command>, arg=arg@entry=0x14cd2a8e7900) at list.c:580
#6  0x000014ce728be90a in slurmd_script (job_env=<optimized out>, cred=0x14cd5c014b90, is_epilog=is_epilog@entry=false) at prep_script_slurmd.c:205
#7  0x000014ce728bd9b1 in prep_p_prolog (job_env=<optimized out>, cred=<optimized out>) at prep_script.c:126
#8  0x000014ce79c39d19 in prep_g_prolog (job_env=job_env@entry=0x14cd2a8e7b80, cred=cred@entry=0x14cd5c014b90) at prep.c:221
#9  0x00000000004125cc in _run_prolog (job_env=job_env@entry=0x14cd2a8e7b80, cred=0x14cd5c014b90, remove_running=remove_running@entry=false) at req.c:5767
#10 0x00000000004199cb in _rpc_prolog (msg=msg@entry=0x14cd5c001c10) at req.c:2314
#11 0x000000000041bf8d in slurmd_req (msg=msg@entry=0x14cd5c001c10) at req.c:299
#12 0x000000000040e07e in _service_connection (arg=<optimized out>) at slurmd.c:620
#13 0x000014ce78cca6ea in start_thread () from /lib64/libpthread.so.0
#14 0x000014ce7847fa8f in clone () from /lib64/libc.so.6

What I saw with ps on the node (and I'm afraid I've lost it from my scrollback so you'll need to take my word on this) is that there were a lot of defunct prologs (zombie processes), so Slurm was never getting around to waiting on them so they could be reaped

We've only seen this on shared nodes, and only when a heap of processes are launched at the same time, so I'm wondering if this is some sort of race condition?

I've been looking at the code around line 298 of src/common/run_command.c where the poll() call is but I can't see anything obvious.

The only thing I had noticed was that the code asks for these notifications:

fds.events = POLLIN | POLLHUP | POLLRDHUP;

but before it checks them it does:

                        if (i == 0) {
                                continue;

so if poll() returns 0 but sets POLLHUP it'll never notice.

I'm not sure if that's likely/possible though, with a simple test I cobbled up based on the code here http://www.greenend.org.uk/rjk/tech/poll.c it appears that the poll() on a closed pipe returns 1 and then a read on that returns 0. But I could be missing something (or the way slurmd works perhaps it does return 0).

All the best,
Chris
Comment 1 Chris Samuel (NERSC) 2023-02-25 17:13:04 MST
Hi there,

This has happened again today and strace appears to confirm that poll is returning 0, meaning the code just goes back around the loop again.

Here you can see a bunch of threads in that state:

70396 poll([{fd=400, events=POLLIN|POLLHUP|POLLRDHUP}], 1, 500 <unfinished ...>
70427 poll([{fd=154, events=POLLIN|POLLHUP|POLLRDHUP}], 1, 500 <unfinished ...>
70341 <... restart_syscall resumed>)    = 0
70341 poll([{fd=362, events=POLLIN|POLLHUP|POLLRDHUP}], 1, 500 <unfinished ...>
70429 <... restart_syscall resumed>)    = 0
70429 poll([{fd=392, events=POLLIN|POLLHUP|POLLRDHUP}], 1, 500 <unfinished ...>
70204 <... restart_syscall resumed>)    = 0
70204 poll([{fd=257, events=POLLIN|POLLHUP|POLLRDHUP}], 1, 500 <unfinished ...>
70205 <... restart_syscall resumed>)    = 0
70205 poll([{fd=288, events=POLLIN|POLLHUP|POLLRDHUP}], 1, 500 <unfinished ...>
70208 <... restart_syscall resumed>)    = 0
70208 poll([{fd=276, events=POLLIN|POLLHUP|POLLRDHUP}], 1, 500 <unfinished ...>
70407 <... restart_syscall resumed>)    = 0
70407 poll([{fd=422, events=POLLIN|POLLHUP|POLLRDHUP}], 1, 500 <unfinished ...>
70244 <... restart_syscall resumed>)    = 0
70244 poll([{fd=189, events=POLLIN|POLLHUP|POLLRDHUP}], 1, 500 <unfinished ...>
70349 <... restart_syscall resumed>)    = 0
70349 poll([{fd=239, events=POLLIN|POLLHUP|POLLRDHUP}], 1, 500 <unfinished ...>
70223 <... restart_syscall resumed>)    = 0
70223 poll([{fd=271, events=POLLIN|POLLHUP|POLLRDHUP}], 1, 500 <unfinished ...>


Unfortunately I don't think I can use strace to catch the value of that structure after it returns so I tried to use gdb to set a breakpoint immediately after the poll() call but if I'm reading this right it's not setting an indication that the process has gone.

Thread 16 "slurmd" hit Breakpoint 1, run_command (args=args@entry=0x151e446b0900) at run_command.c:299
299	run_command.c: No such file or directory.
(gdb) print i
$1 = <optimized out>
(gdb) print fds->revents
$2 = 0
(gdb) print new_wait
$3 = <optimized out>
(gdb) print fds->events
$4 = 8209



Any ideas?

All the best,
Chris
Comment 2 Chris Samuel (NERSC) 2023-02-26 00:01:40 MST
One final update for today, when looking at one of the nodes slurmd logs I see it's trying to launch about ~80 jobs at the same time, and most (~50) end up in this state, but the others launch OK.  But for them I do see logs complaining that:

[2023-02-25T19:33:32.207] Warning: slurmstepd startup took 6 sec, possible file system problem or full memory

Now these are diskless nodes, but we do tricks on boot to ensure that critical programs (like the various Slurm RPMs) are copied up into the in-memory overlayfs from the squashfs and also we set MemSpecLimit=27298 to account for that and also for the GPFS page pool on the node.

All the best,
Chris
Comment 3 Chris Samuel (NERSC) 2023-02-26 11:59:49 MST
Hi there,

I forgot to mention I did some more experimentation with poll() and can confirm that with a process forked but not outputting anything we see poll() return 0, and then when it's gone away it returns 1 and sets POLLHUP.

So how we get into a position where it's returning 0 for a defunct process I cannot understand unless there's some brief moment where some non-threadsafe behaviour is occurring?  Pure speculation though.

For the moment I've set up a procedure with our 24x7 SRE folks to restart slurmd on compute nodes when they see this strange situation (along with recording it via our ticketing system so we can track it).

I did look at the 23.02 code and it looks like this has been dramatically reworked since 22.05.

All the best,
Chris
Comment 5 Marshall Garey 2023-02-28 17:38:31 MST
Hi Chris,

Thanks for your detailed research. It was very helpful.

poll() will return 0 for a defunct process as long as its stdout file descriptor is still open.

Example prolog script that creates a defunct process:

#!/bin/bash
sleep 30 &

This will create a defunct prolog process that still has its stdout file descriptor open. After 30 seconds the sleep command exits, poll() returns 1, and run_command() exits the while(1) loop. Then it reaps the process.



With a prolog script that closes its stdout handle and then becomes a zombie, we exit the while(1) poll loop and do run_command_waitpid_timeout() which calls wait() and reaps the process right away. For example:

#!/bin/bash
log="/path/to/log/file"
exec >> $log 2>&1
sleep 30 &

If a bunch of processes are defunct and never exit, the prolog could wait indefinitely. You can set PrologEpilogTimeout to make sure that slurmd doesn't wait indefinitely. When the timeout is reached, run_command() kills the child process group.


We could call wait(cpid, &status, WNOHANG) to check for a defunct process if poll returns 0 instead of infinitely polling until the defunct child process exits. I am exploring this idea as a patch.
Comment 6 Chris Samuel (NERSC) 2023-02-28 18:23:25 MST
Hi Marshall,

Thanks so much for that, very much appreciated!

I'll attach our node prolog in case that helps shed any light, but I'm not seeing anything obvious that might cause this (and from what I can remember of what I was testing looking at a node with the --forest option to ps I wasn't seeing any child processes of the prologs hanging around).

The PrologEpilogTimeout option is a good idea, can't believe I missed that, somehow I got it in my head from the source that it was the WaitTime setting that was governing this.

All the best,
Chris
Comment 7 Chris Samuel (NERSC) 2023-02-28 18:27:19 MST
Created attachment 29095 [details]
NERSC node prolog script for Perlmutter
Comment 8 Marshall Garey 2023-03-02 11:24:02 MST
Hi Chris,

I also don't see anything obvious in your prolog script that would lead to a defunct process; however, you did mention that you saw a lot of defunct prolog processes, so I'd like to pursue this angle. Do you have a test system on which you can reliably reproduce this, or a single node on Perlmutter? If so, would you be willing to try a patch?
Comment 9 Chris Samuel (NERSC) 2023-03-02 17:53:43 MST
(In reply to Marshall Garey from comment #8)

> Hi Chris,

Hi Marshall,

> I also don't see anything obvious in your prolog script that would lead to a
> defunct process; however, you did mention that you saw a lot of defunct
> prolog processes, so I'd like to pursue this angle. Do you have a test
> system on which you can reliably reproduce this, or a single node on
> Perlmutter? If so, would you be willing to try a patch?

Sadly no to either of those questions about reproducing it, I've been trying to reproduce it on a test system without luck, and it only seems to happen occasionally on Perlmutter (and hasn't since the weekend). I'm going to try and load up a node on one of the test systems outside of Slurm and then try and pummel it with a couple of thousand jobs.

If I can reproduce then yes, definitely interested in trying a patch!

All the best,
Chris
Comment 14 Marshall Garey 2023-03-03 15:31:18 MST
Hi Chris,

I just made a private patch public - it's 2205 v3 attachment 29142 [details]. Would you be able to test this?
Comment 16 Chris Samuel (NERSC) 2023-03-03 19:09:42 MST
(In reply to Marshall Garey from comment #14)

> Hi Chris,

Hi Marshall,

> I just made a private patch public - it's 2205 v3 attachment 29142 [details]
> [details]. Would you be able to test this?

Thanks so much for that, I will take a look and try and get that onto a test system this weekend so I can check it out, and then if OK it'll go to Perlmutter on Wednesday as I've still not been able to trigger this on the test system.

All the best,
Chris
Comment 17 Chris Samuel (NERSC) 2023-03-07 15:13:41 MST
(In reply to Chris Samuel (NERSC) from comment #16)

> Thanks so much for that, I will take a look and try and get that onto a test
> system this weekend so I can check it out, and then if OK it'll go to
> Perlmutter on Wednesday as I've still not been able to trigger this on the
> test system.

Sorry for forgetting to update this yesterday, the testing went well and so we've got new Slurm RPMs which will go on with this patch tomorrow (and next Wednesday on to Cori).

We will also change the PrologEpilogTimeout setting to 10 minutes as a backstop for this on Perlmutter tomorrow.
Comment 18 Marshall Garey 2023-03-14 10:09:15 MDT
Hi Chris,

Have the patches been applied in production? Have you seen this issue happen again with the patches applied?
Comment 19 Chris Samuel (NERSC) 2023-03-22 17:19:31 MDT
(In reply to Marshall Garey from comment #18)

> Hi Chris,

Hiya!

> Have the patches been applied in production? Have you seen this issue happen
> again with the patches applied?

We have, but we did see this again, this time it appears that the prolog timed out but the attempt to kill the slurmd process failed for reasons that are unclear.

perlmutter:nid004171:~ # fgrep -e 'killing pg' -e 'slurmd start' -e 72462 journalctl.log 
Mar 20 23:12:27 nid004171 slurmd[110853]: slurmd: slurmd started on Mon, 20 Mar 2023 23:12:27 +0000
Mar 20 23:12:27 nid004171 slurmd[110853]: slurmd started on Mon, 20 Mar 2023 23:12:27 +0000
Mar 22 16:02:33 nid004171 slurmd[110853]: slurmd: error: prolog: timeout after 600000 ms: killing pgid 72462
Mar 22 16:02:33 nid004171 slurmd[110853]: error: prolog: timeout after 600000 ms: killing pgid 72462
Mar 22 21:30:11 nid004171 systemd[1]: slurmd.service: Unit process 72462 (slurmd) remains running after unit stopped.
Mar 22 21:30:12 nid004171 systemd[1]: slurmd.service: Unit process 72462 (slurmd) remains running after unit stopped.
Mar 22 22:33:14 nid004171 slurmd[199302]: slurmd: slurmd started on Wed, 22 Mar 2023 22:33:14 +0000
Mar 22 22:33:14 nid004171 slurmd[199302]: slurmd started on Wed, 22 Mar 2023 22:33:14 +0000


You can see there that it did say it was going to kill the process, but it failed (as when our SREs tried to restart the slurmd that one process refused to go away).

I did use gcore to grab a core file of it, and it looks like it was wedged waiting for a lock, despite being the only thread in its process!

perlmutter:nid004171:~ # gdb -batch -ex 'thread apply all bt' /usr/sbin/slurmd core.72462

warning: Can't open file /run/nscd/db1HLr62 (deleted) during file-backed mapping note processing
[New LWP 72462]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmd'.
#0  0x00001491677beb6e in __lll_lock_wait_private () from /lib64/libc.so.6

Thread 1 (Thread 0x1490b32f2700 (LWP 72462)):
#0  0x00001491677beb6e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00001491677f8841 in __tz_convert () from /lib64/libc.so.6
#2  0x00001491690a55c6 in _xiso8601timecat (buf=buf@entry=0x1490b32ef3e0, msec=msec@entry=true) at xstring.c:256
#3  0x0000149168fe4190 in vxstrfmt (fmt=0x1491690bb066 "M] %s%s%s", ap=ap@entry=0x1490b32ef588) at log.c:981
#4  0x0000149168fe4931 in xlogfmtcat (dst=dst@entry=0x1490b32ef698, fmt=fmt@entry=0x1491690bb064 "[%M] %s%s%s") at log.c:1125
#5  0x0000149168fe4cef in _log_msg (level=level@entry=LOG_LEVEL_ERROR, sched=sched@entry=false, spank=spank@entry=false, fmt=fmt@entry=0x149162384889 "%s: open failed for %s: %s", args=args@entry=0x1490b32ef738) at log.c:1284
#6  0x0000149168fe5ef3 in error (fmt=fmt@entry=0x149162384889 "%s: open failed for %s: %s") at log.c:1387
#7  0x0000149162383bb1 in container_p_join (job_id=<optimized out>, uid=<optimized out>) at job_container_tmpfs.c:752
#8  0x0000000000420204 in container_g_join (job_id=job_id@entry=6326436, uid=0) at job_container_plugin.c:210
#9  0x000014916217d3f3 in _run_spank_job_script (job_id=6326436, env=0x1490280087b0, mode=0x14916217d9fc "prolog") at prep_script_slurmd.c:387
#10 slurmd_script (job_env=0x1490b32f1b80, cred=0x149028011860, is_epilog=is_epilog@entry=false) at prep_script_slurmd.c:175
#11 0x000014916217c9b1 in prep_p_prolog (job_env=<optimized out>, cred=<optimized out>) at prep_script.c:126
#12 0x0000149168ffb09e in prep_g_prolog (job_env=job_env@entry=0x1490b32f1b80, cred=cred@entry=0x149028011860) at prep.c:221
#13 0x00000000004125cc in _run_prolog (job_env=job_env@entry=0x1490b32f1b80, cred=0x149028011860, remove_running=remove_running@entry=false) at req.c:5767
#14 0x00000000004199cb in _rpc_prolog (msg=msg@entry=0x1490280113c0) at req.c:2314
#15 0x000000000041bf8d in slurmd_req (msg=msg@entry=0x1490280113c0) at req.c:299
#16 0x000000000040e07e in _service_connection (arg=<optimized out>) at slurmd.c:620
#17 0x00001491680896ea in start_thread () from /lib64/libpthread.so.0
#18 0x000014916783ea6f in clone () from /lib64/libc.so.6


I can see from the code that it's meant to kill -9 it in _kill_pg() but I'm not sure it did so (as that process did go away when I did a kill -9 manually on it).

All the best,
Chris
Comment 22 James Botts 2023-03-27 11:31:04 MDT
Created attachment 29536 [details]
backtrace of parent slurmd whose child is stuck in __lll_lock_wait_private
Comment 23 James Botts 2023-03-27 11:31:45 MDT
Created attachment 29537 [details]
backtrace of child slurmd deadlocked in __lll_lock_wait_private
Comment 24 James Botts 2023-03-27 11:33:28 MDT
Hi - I've attached two backtraces of slurmd on a perlmutter login node that has the same sort of issues and backtraces given previously.

We have the parent slurmd and a child and the child is stuck.

We see this sort of thing occasionally on perlmutter shared computed nodes as well.
Comment 25 Marshall Garey 2023-03-28 13:11:37 MDT
Created attachment 29561 [details]
22.05 backport of commit f91b546119

Thanks for the additional information. This deadlock actually does not touch any part of the code that was changed in the previous patch that you applied. This patch (v4) will avoid calling container_g_join() for job_container/tmpfs in the spot that is deadlocking. The container_g_join() fails and calls error() which is what is deadlocking. So, in theory this patch will avoid this deadlock. I am still trying to reproduce this and identify the fundamental issue. I have a theory but have not proven it yet.

Please remove the previous patch (attachment 29142 [details] 2205 v3) and apply this patch instead.

If you see a stuck slurmd with lots of defunct prolog processes, apply attachment 29142 [details] 2205 v3 and see if that combined with v4 prevents slurmd's from getting stuck.

Will you let us know how it goes in a few days, or earlier if you run into this issue again?
Comment 26 Chris Samuel (NERSC) 2023-03-28 15:25:12 MDT
Hi Marshall,

(In reply to Marshall Garey from comment #25)

> Will you let us know how it goes in a few days, or earlier if you run into
> this issue again?

Sorry been in meetings most of the day. I'm not sure we've got the time to get this in and tested before tomorrow's maintenance, but will see what I can do.

All the best,
Chris
Comment 28 Chris Samuel (NERSC) 2023-04-10 23:40:39 MDT
Hi Marshall,

This is going on to Perlmutter in the current maintenance.

All the best,
Chris
Comment 29 Marshall Garey 2023-05-05 13:44:17 MDT
Hey Chris,

Did this patch go live on perlmutter?

Also, I have a suspicion of what the problem is fundamentally and this patch just works around it, but I have failed to reproduce it so far.
Comment 30 Marshall Garey 2023-05-22 14:58:24 MDT
Hey Chris, just checking if the current patch (22.05 backport of commit f91b546119) has solved this issue for you.
Comment 33 Marshall Garey 2023-06-15 15:09:26 MDT
Hi Chris,

Just another ping if backporting commit f91b546119 to 22.05 solved this issue for you.

I believe the fundamental cause is the unsafe use of logging in a child after a fork in a multi-threaded program (slurmd in this case). I'm working on either minimizing these calls (which is what the patch does), and am also looking at reworking how we run these scripts to totally eliminate unsafe calls in a child after a fork.
Comment 34 Chris Samuel (NERSC) 2023-07-03 08:57:38 MDT
Hi Marshall,

Sorry for not seeing your previous prods on this. :-(

Yes, that did indeed help us, thank you!

I came back to this just now having seen:

commit 5a3f79271b8f584f390714fa89861b4e9e093097
Author: Marshall Garey <marshall@schedmd.com>
Date:   Fri Mar 3 13:59:17 2023 -0700

    run_command() - Avoid infinitely waiting for a defunct process
    
    Reap defunct processes right away instead of waiting for the stdout file
    descriptor of the script process to be closed. A defunct process could hold
    a stdout file descriptor open - for example, if it had spawned children
    processes, those children would still be holding the stdout file descriptor
    open. This would cause poll() to return 0 and run_command() could
    potentially wait indefinitely in the while loop calling poll().
    
    Bug 16126

which I found looking through the NEWS file for 23.02.3 as in 23.02.2 we're seeing jobs trip the unkillable step script due to getting blocked - and in some cases I can see slurmstepd trying to poll a zombie job for accounting info. Does that sound related to that change?

All the best,
Chris
Comment 35 Marshall Garey 2023-07-03 09:02:19 MDT
(In reply to Chris Samuel (NERSC) from comment #34)
> Hi Marshall,
> 
> Sorry for not seeing your previous prods on this. :-(
> 
> Yes, that did indeed help us, thank you!

Good to hear, thanks for the update. I am working on something for 23.11 to reduce and remove a lot more unsafe behavior in order to avoid more deadlocks like the one you saw.


> I came back to this just now having seen:
> 
> commit 5a3f79271b8f584f390714fa89861b4e9e093097
> Author: Marshall Garey <marshall@schedmd.com>
> Date:   Fri Mar 3 13:59:17 2023 -0700
> 
>     run_command() - Avoid infinitely waiting for a defunct process
>     
>     Reap defunct processes right away instead of waiting for the stdout file
>     descriptor of the script process to be closed. A defunct process could
> hold
>     a stdout file descriptor open - for example, if it had spawned children
>     processes, those children would still be holding the stdout file
> descriptor
>     open. This would cause poll() to return 0 and run_command() could
>     potentially wait indefinitely in the while loop calling poll().
>     
>     Bug 16126
> 
> which I found looking through the NEWS file for 23.02.3 as in 23.02.2 we're
> seeing jobs trip the unkillable step script due to getting blocked - and in
> some cases I can see slurmstepd trying to poll a zombie job for accounting
> info. Does that sound related to that change?
> 
> All the best,
> Chris

This is unlikely related. run_command() is an internal function used to launch various scripts such as the various prologs/epilogs, but not the actual tasks launched by srun. Unkillable step timeout happens when tasks launched by srun (or processes spawned by those tasks) are unable to be killed.
Comment 36 Chris Samuel (NERSC) 2023-07-03 09:17:40 MDT
(In reply to Marshall Garey from comment #35)

[...]
> Good to hear, thanks for the update. I am working on something for 23.11 to
> reduce and remove a lot more unsafe behavior in order to avoid more
> deadlocks like the one you saw.

No worries, thank you!

[...]
> This is unlikely related. run_command() is an internal function used to
> launch various scripts such as the various prologs/epilogs, but not the
> actual tasks launched by srun. Unkillable step timeout happens when tasks
> launched by srun (or processes spawned by those tasks) are unable to be
> killed.

Ah OK, of course, thank you! [insert facepalm emoji here]

I've been going through gathering info and will see what I can pull together for another bug report or two.

All the best,
Chris
Comment 40 Marshall Garey 2023-10-25 11:21:14 MDT
I have a prototype for a "script launcher" that avoids the non-async-signal-safe function calls after a fork. However, there are still a lot of things to consider and potentially a lot more places to look at converting to using this method. We are so close to releasing 23.11 that we are not going to include this code change in 23.11, so I'm targeting it for 24.08.

Have you seen any issues similar to comment 19?

https://bugs.schedmd.com/show_bug.cgi?id=16126#c19
Comment 43 Marshall Garey 2024-02-27 20:18:10 MST
Since it has been a long time and there are lots of comments on this bug, I'd like to summarize the issue once more. The bug is calling functions that are not async-signal-safe in a child after a fork() in a multi-threaded program. Slurm does this all over the place, and this is why your prolog script deadlocked.

We are working on fixing this in the next major release of Slurm, or at least fixing it in the large majority of places.
Comment 58 Marshall Garey 2024-04-19 09:45:24 MDT
We unfortunately just missed the release window for 24.05, so we are now targeting 24.11. We some preliminary commits:

* 7fe7a64d31 (HEAD -> master, origin/master, origin/HEAD) run_command - Add a wrapper function for execv
* 7bca6e36ab run_command - Remove a non-async-signal-safe function from a child
* ca71d1ccb5 run_command - Move code for child before exec to a separate function
* d7dbcbc214 run_command - Remove unused field turnoff_output
* c8ed431be1 run_command - Move child code to a function
* 4710a4690f slurmscriptd - Simplify arguments to _run_script
* dab509a794 slurmscriptd - Move code to change our process name to a function
* 48dee393c0 Add slurmstepd path to {un}pack_slurmd_conf_lite
* b0d73816c1 Move run_command init/fini from the datawarp plugin to slurmctld
* cc170fd414 Fix bad check in fd_close
* 577bcba0b2 Change static function _close_fd to be an extern and move it to fd.c/h
* 7551b2fcbe burst_buffer.lua.example - remove prefix "burst_buffer.lua" from logs
* 2767e660d1 burst_buffer.lua.example - remove log in pools function

One of these commits does actually fix an extremely rare case of a non-async-signal-safe function:

> * 7bca6e36ab run_command - Remove a non-async-signal-safe function from a child


We have a set of patches in review which make it really easy for us to fix many or most of the places in Slurm with non-async-signal-safe functions after a fork. My hope is that we can get this upstream very soon after 24.05 is branched and master becomes the 24.11 development branch, and then we can continue to convert additional places to use the new safe method.
Comment 59 Chris Samuel (NERSC) 2024-04-23 16:33:34 MDT
Hi Marshall,

Thanks so much for the update, best of luck with this!

All the best,
Chris
Comment 75 Marshall Garey 2024-06-19 18:05:30 MDT
Hey Chris! I'm happy to report that we have pushed quite a few commits to fix *many*, but not all, of the places with unsafe calls after a fork.

Commits 7ec113a211..da922b364c
Commit 9536c65cac
Commit 19fdb20025
Commits 0196c5aa4d..c222efc277

The remaining places in Slurm fork() outside of run_command.c that are unsafe are trickier to fix. We are still working to fix as many as possible before 24.11 is released. Because those changes will most likely be part of another larger project that we're doing for 24.11, I'm going to close this ticket as fixed and track the additional work in the internal ticket for that project.

This has required some big changes, and we'll see even more bigger changes in 24.11, so we really appreciate your patience.

Closing as fixed in 24.11.
Comment 77 Ben Roberts 2024-10-01 12:04:59 MDT
*** Ticket 20777 has been marked as a duplicate of this ticket. ***