Hi, Here is simple way to get slurmstepd to crash: # srun -t 0:01 -p test -N1 --pty $SHELL open another terminal and # killall -SIGSTOP srun After a while stepd crashes and node is unavailable for other use. (We should have a support contract "CSC - It center for science Finland", I'll check what's going on with managers) Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `slurmstepd: [2020446.'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007fffbec55b44 in pthread_cancel () from /usr/lib64/libpthread.so.0 [Current thread is 1 (Thread 0x7fffbb388700 (LWP 2372381))] Missing separate debuginfos, use: yum debuginfo-install slurm-slurmd-22.05.8-1_issue_660.el8.x86_64 (gdb) thread apply all bt Thread 3 (Thread 0x7fffbfeed740 (LWP 2372299)): #0 0x00007fffbe40cac1 in poll () from /usr/lib64/libc.so.6 #1 0x00007fffbfa2583b in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fffffffe5b8) at /usr/include/bits/poll2.h:46 #2 slurm_recv_timeout (fd=fd@entry=25, buffer=buffer@entry=0x7fffffffe624 "", size=size@entry=4, flags=flags@entry=0, timeout=timeout@entry=100000) at slurm_protocol_socket.c:403 #3 0x00007fffbfa25b76 in slurm_msg_recvfrom_timeout (fd=fd@entry=25, pbuf=pbuf@entry=0x7fffffffe660, lenp=lenp@entry=0x7fffffffe668, flags=flags@entry=0, tmout=tmout@entry=100000) at slurm_protocol_socket.c:133 #4 0x00007fffbf9e9019 in slurm_receive_msg (fd=fd@entry=25, msg=msg@entry=0x7fffffffe720, timeout=100000, timeout@entry=0) at slurm_protocol_api.c:1201 #5 0x00007fffbf9eb31f in slurm_send_recv_msg (fd=fd@entry=25, req=req@entry=0x7fffffffe910, resp=resp@entry=0x7fffffffe720, timeout=timeout@entry=0) at slurm_protocol_api.c:2313 #6 0x00007fffbf9eb33d in _send_and_recv_msg (fd=25, req=req@entry=0x7fffffffe910, resp=resp@entry=0x7fffffffe720, timeout=timeout@entry=0) at slurm_protocol_api.c:2332 #7 0x00007fffbf9ebd1f in slurm_send_recv_rc_msg_only_one (req=req@entry=0x7fffffffe910, rc=rc@entry=0x7fffffffe8ec, timeout=timeout@entry=0) at slurm_protocol_api.c:2844 #8 0x0000000000413b90 in _send_srun_resp_msg (nnodes=<optimized out>, resp_msg=0x7fffffffe910) at mgr.c:247 #9 _send_exit_msg (job=job@entry=0x6854d0, tid=<optimized out>, n=n@entry=1, status=status@entry=0) at mgr.c:655 #10 0x00000000004159fe in stepd_send_pending_exit_msgs (job=job@entry=0x6854d0) at mgr.c:2169 #11 0x0000000000415e3c in job_manager (job=job@entry=0x6854d0) at mgr.c:1512 #12 0x00000000004111d5 in main (argc=<optimized out>, argv=<optimized out>) at slurmstepd.c:193 Thread 2 (Thread 0x7fff32dd3700 (LWP 2372307)): #0 0x00007fffbe40cac1 in poll () from /usr/lib64/libc.so.6 #1 0x000000000041a336 in poll (__timeout=-1, __nfds=1, __fds=0x7fff32dd2ec4) at /usr/include/bits/poll2.h:46 #2 _window_manager (arg=0x790960) at io.c:812 #3 0x00007fffbec4e1cf in start_thread () from /usr/lib64/libpthread.so.0 #4 0x00007fffbe321dd3 in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x7fffbb388700 (LWP 2372381)): #0 0x00007fffbec55b44 in pthread_cancel () from /usr/lib64/libpthread.so.0 #1 0x0000000000411e3d in _kill_thr (args=<optimized out>) at mgr.c:2385 #2 0x00007fffbec4e1cf in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00007fffbe321dd3 in clone () from /usr/lib64/libc.so.6
Created attachment 28906 [details] Current configuration
> (We should have a support contract "CSC - It center for science Finland", I'll check what's going on with managers) Thanks for letting us know. We have corrected the issue on our end, so you should not encounter issues while logging bugs in the future.
Tommi, I can't reproduce the segfault, but I see the node getting drained with "Kill task failed" reason. I assume srun and slurmd are running on different hosts, is the network between them standard IP over ethernet connection? Looking at the code to reach the line where slurmstepd segfaults we need to spend more than 300s in a write call from slurmstepd to srun. The goal of this "write" is to inform srun about I/O connection shutdown (job reached time limit). For some reason even if srun is stopped this happens immediately in my case (I'm working on a VMs, maybe that's the reason). The main thing is still the same, we fail to communicate with srun, since it's stopped and the whole kill procedure fails - node gets drained. I'll check the details if we can improve that, but it may be really difficult to differentiate communication with a stopped process (holds socket connected to port) and network issue. cheers, Marcin
> I can't reproduce the segfault, but I see the node getting drained with > "Kill task failed" reason. I assume srun and slurmd are running on different > hosts, is the network between them standard IP over ethernet connection? Hi, That's correct, login node <-> slurmd communication goes over IPoIB I tried on another cluster which is quite similar and there I cannot reproduce segfault there (biggest difference what comes my mind is lack of mem cgroup) > Looking at the code to reach the line where slurmstepd segfaults we need to > spend more than 300s in a write call from slurmstepd to srun. The goal of > this "write" is to inform srun about I/O connection shutdown (job reached > time limit). For some reason even if srun is stopped this happens > immediately in my case (I'm working on a VMs, maybe that's the reason). Yep, it takes ~5mins until core is generated: [2023-02-21T16:56:52.306] [2030167.0] error: *** STEP 2030167.0 ON c1102 CANCELLED AT 2023-02-21T16:56:52 DUE TO TIME LIMIT *** [2023-02-21T16:59:36.058] error: run_command: health_check poll timeout @ 60000 msec [2023-02-21T17:00:52.000] [2030167.0] error: *** STEP 2030167.0 STEPD TERMINATED ON c1102 AT 2023-02-21T17:00:51 DUE TO JOB NOT ENDING WITH SIGNALS *** -rw------- 1 root root 8110080 Feb 21 17:01 core.700073 BR, Tommi
Tommi, >That's correct, login node <-> slurmd communication goes over IPoIB [...] so it looks like the mentioned write[1] hangs when the backend is IPoIB. I modified the code to trigger segfaulting pthread_cancel (from comment 0), but it worked fine for me. The cancel operation depends on reaching of a pthread cancellation point which may also behave differently on your backend. Just to clarify - we're in agreement that the fact that node gets drained is a desired behavior, the only issue is that we run into segmentation fault? cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-22-05-8-1/src/common/eio.c#L229
> Just to clarify - we're in agreement that the fact that node gets drained is > a desired behavior, the only issue is that we run into segmentation fault? Well, it's quite annoying that users can easily drain cluster without even realizing it. Suspending "srun -N200" script causes quite huge impact and IMO reason is wrong, slurmd has killed all user tasks from the compute node so it could resume and accept jobs. BR, Tommi
Created attachment 29022 [details] Valgrind log from the stepd crash
Tommi, I have a draft patch for a slurmstepd segfault you faced, but since I can't reproduce it I don't have a chance to actually verify if it works. Do you have an environment where you can apply it and test? The patch didn't pass our Q&A, but you can apply that on a single node. Let me know. cheers, Marcin
> Do you have an environment where you can apply it and test? The patch didn't > pass our Q&A, but you can apply that on a single node. Hi, Yes, I can try it on single node. Best Regards, Tommi
Created attachment 29121 [details] test.c Hey Tommi - Do you mind compiling and running this on one of the afflicted compute nodes? While I can see that what we're doing here is non-standard, I'm just curious if this simple test script reproduces the problem on your systems. I'm guessing there's something different about your libc (specifically pthread implementation) that's surfacing this issue now - can you share some details on what distro is on this system? thanks, - Tim
> Do you mind compiling and running this on one of the afflicted compute > nodes? While I can see that what we're doing here is non-standard, I'm just > curious if this simple test script reproduces the problem on your systems. > > I'm guessing there's something different about your libc (specifically > pthread implementation) that's surfacing this issue now - can you share some > details on what distro is on this system? Hi, This test does not crash on the login node (similar RHEL8.6 as compute) but as a job it crashes: Reading symbols from ./slurm_bug16063_test...(no debugging symbols found)...done. (gdb) r Starting program: /users/tervotom/slurm_bug16063_test [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7fffbf6e9700 (LWP 3544630)] thread [Thread 0x7fffbf6e9700 (LWP 3544630) exited] Thread 1 "slurm_bug16063_" received signal SIGSEGV, Segmentation fault. 0x00007fffbfabeb44 in pthread_cancel () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007fffbfabeb44 in pthread_cancel () from /lib64/libpthread.so.0 #1 0x00000000004007c1 in main () (gdb) quit A debugging session is active. Inferior 1 [process 3544626] will be killed. Quit anyway? (y or n) y [tervotom@c3101 ~]$ rpm -q glibc glibc-2.28-189.5.el8_6.x86_64 [tervotom@c3101 ~]$ uname -a Linux c3101.mahti.csc.fi 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Fri Oct 7 12:35:10 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux [tervotom@c3101 ~]$ cat /etc/os-release NAME="Red Hat Enterprise Linux" VERSION="8.6 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.6" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.6 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.6 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.6"
Found difference, jobs have larger stack than logins: stack size (kbytes, -s) 1048576 So decreasing it to the RHEL default level and test does not crash anymore: [tervotom@c3101 ~]$ ulimit -s 8192 [tervotom@c3101 ~]$ ./slurm_bug16063_test thread [tervotom@c3101 ~]$ echo $? 0
Tommi, I didn't share the patch with you since we had an internal discussion on the issue. We changed the approach to potential patch. I'm testing new version. I'll hand it over to you when ready. cheers, Marcin
Tommi, Besides the segfault I'd like to understand the reason why I/O thread hanged. Could you please reproduce the issue and call pstack on slurmstepd process when the job is in CG/completing state and when it's completed? If you can call it every few seconds (in a loop?) and attach the log to the bug report? cheers, Marcin
Created attachment 29153 [details] Dumped gstacks with 2sec interval until it crashed
Tommy, If you're able to perform a test. Could you please apply Attachment 29143 [details] on a single node and retest. I expect that this time, slurmstepd will continue to run in your environment for 500s (5 * MessageTimeout) after the job time out/cancel. For that time the job should remain in CG stat, when done the node shouldn't drain. cheers, Marcin
> If you're able to perform a test. Could you please apply Attachment 29143 [details] > [details] on a single node and retest. Hi Marcin/Tim, Patch seems to be for 23.02 which does not apply cleanly to 22.05 (*job seems to be renamed to *step?) slurm-22.05.8]$ patch -p1 --dry-run < /tmp/bug16063_2302_v3.patch checking file src/slurmd/slurmstepd/io.c Hunk #1 FAILED at 1189. Hunk #2 succeeded at 1471 with fuzz 2 (offset -11 lines). 1 out of 2 hunks FAILED checking file src/slurmd/slurmstepd/mgr.c Hunk #1 succeeded at 2375 (offset 26 lines). Hunk #2 FAILED at 2358. 1 out of 2 hunks FAILED checking file src/slurmd/slurmstepd/slurmstepd_job.h Hunk #1 succeeded at 209 (offset -20 lines). BR, Tommi
My fault, I'll rebase that and share with you.
Created attachment 29257 [details] v3 (22.05) Tommy, Please find the rebased patch in attachment. Testing on 22.05 code base I see that the node still gets drained. This is changed on Slurm 23.02, however, I'll need to study further if it isn't a unintentional behavior change. cheers, Marcin
> Please find the rebased patch in attachment. Testing on 22.05 code base I > see that the node still gets drained. This is changed on Slurm 23.02, > however, I'll need to study further if it isn't a unintentional behavior > change. Hi, I did a quick test and got similar results, node drained. [2023-03-10T11:01:42.008] update_node: node c1101 reason set to: Kill task failed [2023-03-10T11:01:42.008] update_node: node c1101 state set to DRAINING Best Regards, Tommi
Tommi, Looking at it longer I think that the fact that nodes are drained is correct. There are no leftover processes but slurmstepd tear down process takes more than configured UnkillableStepTimeout(180s in your setup). This comes from the fact that slurmstepd cannot communicate with the stopped srun and timeouts for this communication are based on MessageTimeout (100s in your setup). If you'll increase UnkillableStepTimeout to value higher than 500s (mentioned 5*MessageTimeout) running with the patch the node should get back to idle and slurmstepd shouldn't crash. cheers, Marcin
Tommi, Commit 8c2e171aff[1] was merged to Slurm 23.02 branch and will be part of Slurm 23.02.1 release. We'd like to keep the bug open since we've found similar code pattern in other place too. cheers, Marcin [1]https://github.com/SchedMD/slurm
Tommi, I'm closing the case now, as mentioned before the issue you faced should already be fixed in Slurm 23.02.1 by 8c2e171af. We kept the ticket open for a while, since we had an internal discussion on potential improvements in similar code patters. Having other options considered we decided to keep those places unchanged for now. cheers, Marcin