Good morning, Slurm Gurus! We are contacting you to inquire about a new slurm behavior that we just started seeing upon upgrading to version 21.08.8-2. Since the upgrade we began finding unusual messages in the system logs on our head/login node: Aug 19 12:15:53 <login_node> srun: *** Error in `/usr/local/slurm/bin/srun': corrupted size vs. prev_size: 0x000000000187d460 *** Aug 19 12:15:59 <login_node> srun: *** Error in `/usr/local/slurm/bin/srun': double free or corruption (fasttop): 0x0000000000c4d320 *** Nothing else has changed in our configuration; we just upgraded slurm and then these started appearing in the logs. We have been unable to intentionally reproduce the errors but they seem to occur with the ending of interactive sessions and are always followed by a log message of this type: Aug 19 12:15:59 <login_node> systemd-logind[1059]: Removed session 85704. We upgraded slurm on 2022-08-08 and since that time the above messages have appeared in the system log 1938 times (and counting). We compared the srun.c file for version 20.11.7 with the srun.c for version 21.08.8-2 and noticed some (extensive?) refactoring of the code between the two versions. We also built a separate slurm install tree without optimizations and ran some srun tests under valgrind in an attempt to try to get more information about the error without much success. Hence we are reaching out to the gurus for help. Please let us know what additional helpful information we can provide to aid in resolving this issue. Thank you!
Would you elaborate further on your upgrade? Is the entire cluster on 21.08.8-2 (slurmctld, slurmdbd, slurmd, client commands and submit hosts)? The build you are on contains the fixes for a number of security vulnerabilities patched earlier this year. These have been assigned CVE-2022-29500, CVE-2022-29501, and CVE-2022-29502. https://groups.google.com/g/slurm-users/c/eBoNtkYDE6A/m/WnKwFbXcEAAJ With regard to the error you are seeing, can you provide steps and a reproducer? For example, are you running srun from a login node or inside a salloc or other job allocation? Please also attach your slurm.conf for review.
Created attachment 26414 [details] slurm.conf
(In reply to Jason Booth from comment #1) > Would you elaborate further on your upgrade? Is the entire cluster on > 21.08.8-2 (slurmctld, slurmdbd, slurmd, client commands and submit hosts)? > Yes, we always upgrade all of the slurm components together; everything is at version 21.08.8-2. > The build you are on contains the fixes for a number of security > vulnerabilities patched earlier this year. > > These have been assigned CVE-2022-29500, > CVE-2022-29501, and CVE-2022-29502. > > > https://groups.google.com/g/slurm-users/c/eBoNtkYDE6A/m/WnKwFbXcEAAJ > Yes, and we applied the SchedMD security megapatch to our previous version 20.11.7 and ran with that for several weeks without this issue appearing. > > With regard to the error you are seeing, can you provide steps and a > reproducer? > Unfortunately, no, we are not able to reproduce the error at will but for some reason it seems to have no problem occurring on its own. > For example, are you running srun from a login node or inside a salloc or > other job allocation? Yes, this error involves running srun on the login node only. It does not seem to occur when srun is called from within mpi jobs. At least we do not see any srun errors in the compute node log files. Also, we have seen where the error occurs both when running srun standalone and inside salloc. > Please also attach your slurm.conf for review. Please see attached slurm.conf. Thank you!
Update: in case it helps, we managed to find some core dumps and checked two of them for consistency. They both seemed to have the same backtrace output from gdb: GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/local/slurm-21.08/slurm-21.08.8-2/bin/srun...done. [New LWP 110474] [New LWP 64620] [New LWP 110487] [New LWP 110486] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/local/slurm/bin/srun --cpus-per-task=4 --mem=16g --pty --preserve-env --x1'. Program terminated with signal 6, Aborted. #0 0x00002ba9790d0387 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64 lz4-1.8.3-1.el7.x86_64 (gdb) (gdb) (gdb) bt #0 0x00002ba9790d0387 in raise () from /lib64/libc.so.6 #1 0x00002ba9790d1a78 in abort () from /lib64/libc.so.6 #2 0x00002ba979112f67 in __libc_message () from /lib64/libc.so.6 #3 0x00002ba97911b329 in _int_free () from /lib64/libc.so.6 #4 0x00002ba9788f2530 in slurm_xfree (item=item@entry=0x7ffdd97e75b8) at xmalloc.c:213 #5 0x00002ba9787e24d0 in _free_io_buf (ptr=<optimized out>) at step_io.c:981 #6 0x00002ba9788352c2 in list_destroy (l=<optimized out>) at list.c:194 #7 0x00002ba9787e4987 in client_io_handler_destroy (cio=<optimized out>) at step_io.c:1224 #8 0x00002ba9787e8a8d in slurm_step_launch_wait_finish (ctx=0x8768b0) at step_launch.c:810 #9 0x00002ba97820a2cd in launch_p_step_wait (job=0x87a300, got_alloc=<optimized out>, opt_local=0x421ea0 <opt>) at launch_slurm.c:896 #10 0x000000000040ce67 in launch_g_step_wait (job=job@entry=0x87a300, got_alloc=got_alloc@entry=false, opt_local=opt_local@entry=0x421ea0 <opt>) at launch.c:704 #11 0x0000000000408beb in _launch_one_app (data=<optimized out>) at srun.c:286 #12 0x000000000040a2e7 in _launch_app (got_alloc=false, srun_job_list=0x0, job=0x87a300) at srun.c:587 #13 srun (ac=8, av=<optimized out>) at srun.c:217 #14 0x000000000040a78d in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17
Thanks for the updated information regarding the core dumps - that was really helpful. This looks like a duplicate of bug 13418, which unfortunately is private, but here's the commit that fixed it: https://github.com/SchedMD/slurm/commit/613da29e7c A reproducer was to use ctl-c while in srun. Does your user signal srun with SIGSTOP or SIGTERM while it is running? It's in 22.05.3, though you can always apply that commit locally if you want.
That's great that the core dump information was useful! We performed a test of starting an interactive session and then doing a ctrl-c to see if it would reproduce the error. Unfortunately, it did not. Nor did it create a core dump that we could locate. It did, however, result in this message on the terminal screen when exiting the interactive session: srun: error: cn4291: task 0: Exited with exit code 130 Do you still think this is the same as bug 13418? Thank you!
I do think it is the same because the backtrace is identical. Do you have a test system in which you can cherry pick the commit and run the test? The patch changes the Slurm API, so you'll just need to replace the user commands with the new binaries (the patch doesn't touch the daemons).
Okay, thanks. It sounds like you were able to identify this as a known issue and are confident that the referenced patch is the fix. So we can either apply the patch to our current version or upgrade to version 22.05.3 to resolve the issue. We will probably go ahead and patch the current version for now and then do the upgrade a little more down the road. Does that sound like a reasonable approach to you? Thanks again!
Yes, that sounds reasonable. Would you like me to keep this bug open until you're able to patch and test it, or should I close it and you can re-open it if you still see issues?
If you don't mind too much, could we keep the ticket open for now? We will get you an update as soon as we get the patch in. Thank you!
No problem, I'll keep the ticket open and wait for your update.
I'm changing this to a sev-4 while I wait for your update.
Hi, Marshall. We have been running the patched version for the past week without any further occurrences of errors in the log file nor core dumps. It seems like that fixed the bug and we are now bug-free! :-) We can probably go ahead and close the ticket now. Thank you all for your help and stay safe!
Great! Closing as a duplicate of bug 13418. *** This ticket has been marked as a duplicate of ticket 13418 ***