Ticket 16063

Summary: Slurmstepd crash
Product: Slurm Reporter: CSC sysadmins <csc-slurm-tickets>
Component: slurmstepdAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact: Tim Wickberg <tim>
Severity: 4 - Minor Issue    
Priority: --- CC: tim
Version: 22.05.8   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=5103
https://bugs.schedmd.com/show_bug.cgi?id=16273
Site: CSC - IT Center for Science Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02.1 23.11pre0 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Current configuration
Valgrind log from the stepd crash
test.c
Dumped gstacks with 2sec interval until it crashed
v3 (22.05)

Description CSC sysadmins 2023-02-17 06:26:18 MST
Hi,

Here is simple way to get slurmstepd to crash:

# srun -t 0:01 -p test -N1 --pty $SHELL

open another terminal and 

# killall -SIGSTOP srun

After a while stepd crashes and node is unavailable for other use.

(We should have a support contract "CSC - It center for science Finland", I'll check what's going on with managers)


Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `slurmstepd: [2020446.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fffbec55b44 in pthread_cancel () from /usr/lib64/libpthread.so.0
[Current thread is 1 (Thread 0x7fffbb388700 (LWP 2372381))]
Missing separate debuginfos, use: yum debuginfo-install slurm-slurmd-22.05.8-1_issue_660.el8.x86_64
(gdb) thread apply all bt

Thread 3 (Thread 0x7fffbfeed740 (LWP 2372299)):
#0  0x00007fffbe40cac1 in poll () from /usr/lib64/libc.so.6
#1  0x00007fffbfa2583b in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fffffffe5b8) at /usr/include/bits/poll2.h:46
#2  slurm_recv_timeout (fd=fd@entry=25, buffer=buffer@entry=0x7fffffffe624 "", size=size@entry=4, flags=flags@entry=0, timeout=timeout@entry=100000) at slurm_protocol_socket.c:403
#3  0x00007fffbfa25b76 in slurm_msg_recvfrom_timeout (fd=fd@entry=25, pbuf=pbuf@entry=0x7fffffffe660, lenp=lenp@entry=0x7fffffffe668, flags=flags@entry=0, tmout=tmout@entry=100000) at slurm_protocol_socket.c:133
#4  0x00007fffbf9e9019 in slurm_receive_msg (fd=fd@entry=25, msg=msg@entry=0x7fffffffe720, timeout=100000, timeout@entry=0) at slurm_protocol_api.c:1201
#5  0x00007fffbf9eb31f in slurm_send_recv_msg (fd=fd@entry=25, req=req@entry=0x7fffffffe910, resp=resp@entry=0x7fffffffe720, timeout=timeout@entry=0) at slurm_protocol_api.c:2313
#6  0x00007fffbf9eb33d in _send_and_recv_msg (fd=25, req=req@entry=0x7fffffffe910, resp=resp@entry=0x7fffffffe720, timeout=timeout@entry=0) at slurm_protocol_api.c:2332
#7  0x00007fffbf9ebd1f in slurm_send_recv_rc_msg_only_one (req=req@entry=0x7fffffffe910, rc=rc@entry=0x7fffffffe8ec, timeout=timeout@entry=0) at slurm_protocol_api.c:2844
#8  0x0000000000413b90 in _send_srun_resp_msg (nnodes=<optimized out>, resp_msg=0x7fffffffe910) at mgr.c:247
#9  _send_exit_msg (job=job@entry=0x6854d0, tid=<optimized out>, n=n@entry=1, status=status@entry=0) at mgr.c:655
#10 0x00000000004159fe in stepd_send_pending_exit_msgs (job=job@entry=0x6854d0) at mgr.c:2169
#11 0x0000000000415e3c in job_manager (job=job@entry=0x6854d0) at mgr.c:1512
#12 0x00000000004111d5 in main (argc=<optimized out>, argv=<optimized out>) at slurmstepd.c:193

Thread 2 (Thread 0x7fff32dd3700 (LWP 2372307)):
#0  0x00007fffbe40cac1 in poll () from /usr/lib64/libc.so.6
#1  0x000000000041a336 in poll (__timeout=-1, __nfds=1, __fds=0x7fff32dd2ec4) at /usr/include/bits/poll2.h:46
#2  _window_manager (arg=0x790960) at io.c:812
#3  0x00007fffbec4e1cf in start_thread () from /usr/lib64/libpthread.so.0
#4  0x00007fffbe321dd3 in clone () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x7fffbb388700 (LWP 2372381)):
#0  0x00007fffbec55b44 in pthread_cancel () from /usr/lib64/libpthread.so.0
#1  0x0000000000411e3d in _kill_thr (args=<optimized out>) at mgr.c:2385
#2  0x00007fffbec4e1cf in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007fffbe321dd3 in clone () from /usr/lib64/libc.so.6
Comment 1 CSC sysadmins 2023-02-17 06:40:12 MST
Created attachment 28906 [details]
Current configuration
Comment 2 Jason Booth 2023-02-17 08:48:03 MST
> (We should have a support contract "CSC - It center for science Finland", I'll check what's going on with managers)

Thanks for letting us know. We have corrected the issue on our end, so you should not encounter issues while logging bugs in the future.
Comment 7 Marcin Stolarek 2023-02-21 06:28:54 MST
Tommi,

I can't reproduce the segfault, but I see the node getting drained with "Kill task failed" reason. I assume srun and slurmd are running on different hosts, is the network between them standard IP over ethernet connection? 

Looking at the code to reach the line where slurmstepd segfaults we need to spend more than 300s in a write call from slurmstepd to srun. The goal of this "write" is to inform srun about I/O connection shutdown (job reached time limit). For some reason even if srun is stopped this happens immediately in my case (I'm working on a VMs, maybe that's the reason).

The main thing is still the same, we fail to communicate with srun, since it's stopped and the whole kill procedure fails - node gets drained.
I'll check the details if we can improve that, but it may be really difficult to differentiate communication with a stopped process (holds socket connected to port) and network issue.

cheers,
Marcin
Comment 8 CSC sysadmins 2023-02-21 08:12:39 MST
> I can't reproduce the segfault, but I see the node getting drained with
> "Kill task failed" reason. I assume srun and slurmd are running on different
> hosts, is the network between them standard IP over ethernet connection? 

Hi,

That's correct, login node <-> slurmd communication goes over IPoIB 
I tried on another cluster which is quite similar and there I cannot reproduce segfault there (biggest difference what comes my mind is lack of mem cgroup)
 
> Looking at the code to reach the line where slurmstepd segfaults we need to
> spend more than 300s in a write call from slurmstepd to srun. The goal of
> this "write" is to inform srun about I/O connection shutdown (job reached
> time limit). For some reason even if srun is stopped this happens
> immediately in my case (I'm working on a VMs, maybe that's the reason).

Yep, it takes ~5mins until core is generated:

[2023-02-21T16:56:52.306] [2030167.0] error: *** STEP 2030167.0 ON c1102 CANCELLED AT 2023-02-21T16:56:52 DUE TO TIME LIMIT ***
[2023-02-21T16:59:36.058] error: run_command: health_check poll timeout @ 60000 msec
[2023-02-21T17:00:52.000] [2030167.0] error: *** STEP 2030167.0 STEPD TERMINATED ON c1102 AT 2023-02-21T17:00:51 DUE TO JOB NOT ENDING WITH SIGNALS ***

-rw-------   1 root  root  8110080 Feb 21 17:01 core.700073

BR,
Tommi
Comment 9 Marcin Stolarek 2023-02-22 06:10:10 MST
Tommi,

>That's correct, login node <-> slurmd communication goes over IPoIB [...]
so it looks like the mentioned write[1] hangs when the backend is IPoIB.

I modified the code to trigger segfaulting pthread_cancel (from comment 0), but it worked fine for me. The cancel operation depends on reaching of a pthread cancellation point which may also behave differently on your backend.

Just to clarify - we're in agreement that the fact that node gets drained is a desired behavior, the only issue is that we run into segmentation fault?

cheers,
Marcin 
[1]https://github.com/SchedMD/slurm/blob/slurm-22-05-8-1/src/common/eio.c#L229
Comment 10 CSC sysadmins 2023-02-23 06:43:37 MST
> Just to clarify - we're in agreement that the fact that node gets drained is
> a desired behavior, the only issue is that we run into segmentation fault?

Well, it's quite annoying that users can easily drain cluster without even realizing it. Suspending "srun -N200" script causes quite huge impact and IMO reason is wrong, slurmd has killed all user tasks from the compute node so it could resume and accept jobs.

BR,
Tommi
Comment 11 CSC sysadmins 2023-02-24 02:00:27 MST
Created attachment 29022 [details]
Valgrind log from the stepd crash
Comment 13 Marcin Stolarek 2023-03-02 05:21:45 MST
Tommi,

I have a draft patch for a slurmstepd segfault you faced, but since I can't reproduce it I don't have a chance to actually verify if it works.

Do you have an environment where you can apply it and test? The patch didn't pass our Q&A, but you can apply that on a single node.

Let me know.

cheers,
Marcin
Comment 15 CSC sysadmins 2023-03-02 05:57:32 MST
> Do you have an environment where you can apply it and test? The patch didn't
> pass our Q&A, but you can apply that on a single node.

Hi,

Yes, I can try it on single node.

Best Regards,
Tommi
Comment 20 Tim Wickberg 2023-03-02 14:42:23 MST
Created attachment 29121 [details]
test.c

Hey Tommi -

Do you mind compiling and running this on one of the afflicted compute nodes? While I can see that what we're doing here is non-standard, I'm just curious if this simple test script reproduces the problem on your systems.

I'm guessing there's something different about your libc (specifically pthread implementation) that's surfacing this issue now - can you share some details on what distro is on this system?

thanks,
- Tim
Comment 22 CSC sysadmins 2023-03-03 01:05:07 MST
> Do you mind compiling and running this on one of the afflicted compute
> nodes? While I can see that what we're doing here is non-standard, I'm just
> curious if this simple test script reproduces the problem on your systems.
> 
> I'm guessing there's something different about your libc (specifically
> pthread implementation) that's surfacing this issue now - can you share some
> details on what distro is on this system?

Hi,

This test does not crash on the login node (similar RHEL8.6 as compute) but as a job it crashes:


Reading symbols from ./slurm_bug16063_test...(no debugging symbols found)...done.
(gdb) r
Starting program: /users/tervotom/slurm_bug16063_test 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffbf6e9700 (LWP 3544630)]
thread
[Thread 0x7fffbf6e9700 (LWP 3544630) exited]

Thread 1 "slurm_bug16063_" received signal SIGSEGV, Segmentation fault.
0x00007fffbfabeb44 in pthread_cancel () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007fffbfabeb44 in pthread_cancel () from /lib64/libpthread.so.0
#1  0x00000000004007c1 in main ()
(gdb) quit
A debugging session is active.

        Inferior 1 [process 3544626] will be killed.

Quit anyway? (y or n) y
[tervotom@c3101 ~]$ rpm -q glibc
glibc-2.28-189.5.el8_6.x86_64
[tervotom@c3101 ~]$ uname -a
Linux c3101.mahti.csc.fi 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Fri Oct 7 12:35:10 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
[tervotom@c3101 ~]$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.6 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.6 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.6
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.6"
Comment 23 CSC sysadmins 2023-03-03 01:10:36 MST
Found difference, jobs have larger stack than logins:
stack size              (kbytes, -s) 1048576

So decreasing it to the RHEL default level and test does not crash anymore:

[tervotom@c3101 ~]$ ulimit -s 8192
[tervotom@c3101 ~]$ ./slurm_bug16063_test 
thread
[tervotom@c3101 ~]$ echo $?
0
Comment 27 Marcin Stolarek 2023-03-06 04:33:10 MST
Tommi,

I didn't share the patch with you since we had an internal discussion on the issue. We changed the approach to potential patch. I'm testing new version. I'll hand it over to you when ready.

cheers,
Marcin
Comment 28 Marcin Stolarek 2023-03-06 06:46:33 MST
Tommi,

Besides the segfault I'd like to understand the reason why I/O thread hanged.
Could you please reproduce the issue and call pstack on slurmstepd process when the job is in CG/completing state and when it's completed?
If you can call it every few seconds (in a loop?) and attach the log to the bug report?

cheers,
Marcin
Comment 29 CSC sysadmins 2023-03-06 07:41:04 MST
Created attachment 29153 [details]
Dumped gstacks with 2sec interval until it crashed
Comment 31 Marcin Stolarek 2023-03-09 06:06:06 MST
Tommy,

If you're able to perform a test. Could you please apply Attachment 29143 [details] on a single node and retest.

I expect that this time, slurmstepd will continue to run in your environment for 500s (5 * MessageTimeout) after the job time out/cancel. For that time the job should remain in CG stat, when done the node shouldn't drain.

cheers,
Marcin
Comment 32 CSC sysadmins 2023-03-09 06:23:11 MST
> If you're able to perform a test. Could you please apply Attachment 29143 [details]
> [details] on a single node and retest.

Hi Marcin/Tim,

Patch seems to be for 23.02 which does not apply cleanly to 22.05 (*job seems to be renamed to *step?)

slurm-22.05.8]$ patch -p1 --dry-run < /tmp/bug16063_2302_v3.patch 
checking file src/slurmd/slurmstepd/io.c
Hunk #1 FAILED at 1189.
Hunk #2 succeeded at 1471 with fuzz 2 (offset -11 lines).
1 out of 2 hunks FAILED
checking file src/slurmd/slurmstepd/mgr.c
Hunk #1 succeeded at 2375 (offset 26 lines).
Hunk #2 FAILED at 2358.
1 out of 2 hunks FAILED
checking file src/slurmd/slurmstepd/slurmstepd_job.h
Hunk #1 succeeded at 209 (offset -20 lines).

BR,
Tommi
Comment 33 Marcin Stolarek 2023-03-09 06:59:34 MST
My fault, I'll rebase that and share with you.
Comment 36 Marcin Stolarek 2023-03-10 01:37:38 MST
Created attachment 29257 [details]
v3 (22.05)

Tommy,

Please find the rebased patch in attachment. Testing on 22.05 code base I see that the node still gets drained. This is changed on Slurm 23.02, however, I'll need to study further if it isn't a unintentional behavior change.

cheers,
Marcin
Comment 37 CSC sysadmins 2023-03-10 02:30:29 MST
> Please find the rebased patch in attachment. Testing on 22.05 code base I
> see that the node still gets drained. This is changed on Slurm 23.02,
> however, I'll need to study further if it isn't a unintentional behavior
> change.

Hi,

I did a quick test and got similar results, node drained.

[2023-03-10T11:01:42.008] update_node: node c1101 reason set to: Kill task failed
[2023-03-10T11:01:42.008] update_node: node c1101 state set to DRAINING

Best Regards,
Tommi
Comment 45 Marcin Stolarek 2023-03-21 08:58:14 MDT
Tommi,

Looking at it longer I think that the fact that nodes are drained is correct. There are no leftover processes but slurmstepd tear down process takes more than configured UnkillableStepTimeout(180s in your setup). This comes from the fact that slurmstepd cannot communicate with the stopped srun and timeouts for this communication are based on MessageTimeout (100s in your setup).

If you'll increase UnkillableStepTimeout to value higher than 500s (mentioned 5*MessageTimeout) running with the patch the node should get back to idle and slurmstepd shouldn't crash.

cheers,
Marcin
Comment 49 Marcin Stolarek 2023-03-27 01:04:07 MDT
Tommi,

Commit 8c2e171aff[1] was merged to Slurm 23.02 branch and will be part of Slurm 23.02.1 release.

We'd like to keep the bug open since we've found similar code pattern in other place too.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm
Comment 60 Marcin Stolarek 2023-06-09 03:51:41 MDT
Tommi,

I'm closing the case now, as mentioned before the issue you faced should already be fixed in Slurm 23.02.1 by 8c2e171af.

We kept the ticket open for a while, since we had an internal discussion on potential improvements in similar code patters. Having other options considered we decided to keep those places unchanged for now.

cheers,
Marcin