Summary: | Slurmstepd crash | ||
---|---|---|---|
Product: | Slurm | Reporter: | CSC sysadmins <csc-slurm-tickets> |
Component: | slurmstepd | Assignee: | Marcin Stolarek <cinek> |
Status: | RESOLVED FIXED | QA Contact: | Tim Wickberg <tim> |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | tim |
Version: | 22.05.8 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=5103 https://bugs.schedmd.com/show_bug.cgi?id=16273 |
||
Site: | CSC - IT Center for Science | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 23.02.1 23.11pre0 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
Current configuration
Valgrind log from the stepd crash test.c Dumped gstacks with 2sec interval until it crashed v3 (22.05) |
Description
CSC sysadmins
2023-02-17 06:26:18 MST
Created attachment 28906 [details]
Current configuration
> (We should have a support contract "CSC - It center for science Finland", I'll check what's going on with managers)
Thanks for letting us know. We have corrected the issue on our end, so you should not encounter issues while logging bugs in the future.
Tommi, I can't reproduce the segfault, but I see the node getting drained with "Kill task failed" reason. I assume srun and slurmd are running on different hosts, is the network between them standard IP over ethernet connection? Looking at the code to reach the line where slurmstepd segfaults we need to spend more than 300s in a write call from slurmstepd to srun. The goal of this "write" is to inform srun about I/O connection shutdown (job reached time limit). For some reason even if srun is stopped this happens immediately in my case (I'm working on a VMs, maybe that's the reason). The main thing is still the same, we fail to communicate with srun, since it's stopped and the whole kill procedure fails - node gets drained. I'll check the details if we can improve that, but it may be really difficult to differentiate communication with a stopped process (holds socket connected to port) and network issue. cheers, Marcin > I can't reproduce the segfault, but I see the node getting drained with > "Kill task failed" reason. I assume srun and slurmd are running on different > hosts, is the network between them standard IP over ethernet connection? Hi, That's correct, login node <-> slurmd communication goes over IPoIB I tried on another cluster which is quite similar and there I cannot reproduce segfault there (biggest difference what comes my mind is lack of mem cgroup) > Looking at the code to reach the line where slurmstepd segfaults we need to > spend more than 300s in a write call from slurmstepd to srun. The goal of > this "write" is to inform srun about I/O connection shutdown (job reached > time limit). For some reason even if srun is stopped this happens > immediately in my case (I'm working on a VMs, maybe that's the reason). Yep, it takes ~5mins until core is generated: [2023-02-21T16:56:52.306] [2030167.0] error: *** STEP 2030167.0 ON c1102 CANCELLED AT 2023-02-21T16:56:52 DUE TO TIME LIMIT *** [2023-02-21T16:59:36.058] error: run_command: health_check poll timeout @ 60000 msec [2023-02-21T17:00:52.000] [2030167.0] error: *** STEP 2030167.0 STEPD TERMINATED ON c1102 AT 2023-02-21T17:00:51 DUE TO JOB NOT ENDING WITH SIGNALS *** -rw------- 1 root root 8110080 Feb 21 17:01 core.700073 BR, Tommi Tommi, >That's correct, login node <-> slurmd communication goes over IPoIB [...] so it looks like the mentioned write[1] hangs when the backend is IPoIB. I modified the code to trigger segfaulting pthread_cancel (from comment 0), but it worked fine for me. The cancel operation depends on reaching of a pthread cancellation point which may also behave differently on your backend. Just to clarify - we're in agreement that the fact that node gets drained is a desired behavior, the only issue is that we run into segmentation fault? cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-22-05-8-1/src/common/eio.c#L229 > Just to clarify - we're in agreement that the fact that node gets drained is
> a desired behavior, the only issue is that we run into segmentation fault?
Well, it's quite annoying that users can easily drain cluster without even realizing it. Suspending "srun -N200" script causes quite huge impact and IMO reason is wrong, slurmd has killed all user tasks from the compute node so it could resume and accept jobs.
BR,
Tommi
Created attachment 29022 [details]
Valgrind log from the stepd crash
Tommi, I have a draft patch for a slurmstepd segfault you faced, but since I can't reproduce it I don't have a chance to actually verify if it works. Do you have an environment where you can apply it and test? The patch didn't pass our Q&A, but you can apply that on a single node. Let me know. cheers, Marcin > Do you have an environment where you can apply it and test? The patch didn't
> pass our Q&A, but you can apply that on a single node.
Hi,
Yes, I can try it on single node.
Best Regards,
Tommi
Created attachment 29121 [details]
test.c
Hey Tommi -
Do you mind compiling and running this on one of the afflicted compute nodes? While I can see that what we're doing here is non-standard, I'm just curious if this simple test script reproduces the problem on your systems.
I'm guessing there's something different about your libc (specifically pthread implementation) that's surfacing this issue now - can you share some details on what distro is on this system?
thanks,
- Tim
> Do you mind compiling and running this on one of the afflicted compute > nodes? While I can see that what we're doing here is non-standard, I'm just > curious if this simple test script reproduces the problem on your systems. > > I'm guessing there's something different about your libc (specifically > pthread implementation) that's surfacing this issue now - can you share some > details on what distro is on this system? Hi, This test does not crash on the login node (similar RHEL8.6 as compute) but as a job it crashes: Reading symbols from ./slurm_bug16063_test...(no debugging symbols found)...done. (gdb) r Starting program: /users/tervotom/slurm_bug16063_test [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7fffbf6e9700 (LWP 3544630)] thread [Thread 0x7fffbf6e9700 (LWP 3544630) exited] Thread 1 "slurm_bug16063_" received signal SIGSEGV, Segmentation fault. 0x00007fffbfabeb44 in pthread_cancel () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007fffbfabeb44 in pthread_cancel () from /lib64/libpthread.so.0 #1 0x00000000004007c1 in main () (gdb) quit A debugging session is active. Inferior 1 [process 3544626] will be killed. Quit anyway? (y or n) y [tervotom@c3101 ~]$ rpm -q glibc glibc-2.28-189.5.el8_6.x86_64 [tervotom@c3101 ~]$ uname -a Linux c3101.mahti.csc.fi 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Fri Oct 7 12:35:10 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux [tervotom@c3101 ~]$ cat /etc/os-release NAME="Red Hat Enterprise Linux" VERSION="8.6 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.6" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.6 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.6 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.6" Found difference, jobs have larger stack than logins: stack size (kbytes, -s) 1048576 So decreasing it to the RHEL default level and test does not crash anymore: [tervotom@c3101 ~]$ ulimit -s 8192 [tervotom@c3101 ~]$ ./slurm_bug16063_test thread [tervotom@c3101 ~]$ echo $? 0 Tommi, I didn't share the patch with you since we had an internal discussion on the issue. We changed the approach to potential patch. I'm testing new version. I'll hand it over to you when ready. cheers, Marcin Tommi, Besides the segfault I'd like to understand the reason why I/O thread hanged. Could you please reproduce the issue and call pstack on slurmstepd process when the job is in CG/completing state and when it's completed? If you can call it every few seconds (in a loop?) and attach the log to the bug report? cheers, Marcin Created attachment 29153 [details]
Dumped gstacks with 2sec interval until it crashed
Tommy,
If you're able to perform a test. Could you please apply Attachment 29143 [details] on a single node and retest.
I expect that this time, slurmstepd will continue to run in your environment for 500s (5 * MessageTimeout) after the job time out/cancel. For that time the job should remain in CG stat, when done the node shouldn't drain.
cheers,
Marcin
> If you're able to perform a test. Could you please apply Attachment 29143 [details] > [details] on a single node and retest. Hi Marcin/Tim, Patch seems to be for 23.02 which does not apply cleanly to 22.05 (*job seems to be renamed to *step?) slurm-22.05.8]$ patch -p1 --dry-run < /tmp/bug16063_2302_v3.patch checking file src/slurmd/slurmstepd/io.c Hunk #1 FAILED at 1189. Hunk #2 succeeded at 1471 with fuzz 2 (offset -11 lines). 1 out of 2 hunks FAILED checking file src/slurmd/slurmstepd/mgr.c Hunk #1 succeeded at 2375 (offset 26 lines). Hunk #2 FAILED at 2358. 1 out of 2 hunks FAILED checking file src/slurmd/slurmstepd/slurmstepd_job.h Hunk #1 succeeded at 209 (offset -20 lines). BR, Tommi My fault, I'll rebase that and share with you. Created attachment 29257 [details]
v3 (22.05)
Tommy,
Please find the rebased patch in attachment. Testing on 22.05 code base I see that the node still gets drained. This is changed on Slurm 23.02, however, I'll need to study further if it isn't a unintentional behavior change.
cheers,
Marcin
> Please find the rebased patch in attachment. Testing on 22.05 code base I
> see that the node still gets drained. This is changed on Slurm 23.02,
> however, I'll need to study further if it isn't a unintentional behavior
> change.
Hi,
I did a quick test and got similar results, node drained.
[2023-03-10T11:01:42.008] update_node: node c1101 reason set to: Kill task failed
[2023-03-10T11:01:42.008] update_node: node c1101 state set to DRAINING
Best Regards,
Tommi
Tommi, Looking at it longer I think that the fact that nodes are drained is correct. There are no leftover processes but slurmstepd tear down process takes more than configured UnkillableStepTimeout(180s in your setup). This comes from the fact that slurmstepd cannot communicate with the stopped srun and timeouts for this communication are based on MessageTimeout (100s in your setup). If you'll increase UnkillableStepTimeout to value higher than 500s (mentioned 5*MessageTimeout) running with the patch the node should get back to idle and slurmstepd shouldn't crash. cheers, Marcin Tommi, Commit 8c2e171aff[1] was merged to Slurm 23.02 branch and will be part of Slurm 23.02.1 release. We'd like to keep the bug open since we've found similar code pattern in other place too. cheers, Marcin [1]https://github.com/SchedMD/slurm Tommi, I'm closing the case now, as mentioned before the issue you faced should already be fixed in Slurm 23.02.1 by 8c2e171af. We kept the ticket open for a while, since we had an internal discussion on potential improvements in similar code patters. Having other options considered we decided to keep those places unchanged for now. cheers, Marcin |