14804 – New srun errors

Ticket 14804 - New srun errors

Summary: New srun errors

Status:	RESOLVED DUPLICATE of ticket 13418

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	21.08.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-08-22 09:32 MDT by rl303f
Modified:	2022-09-12 11:39 MDT (History)
CC List:	2 users (show)

See Also:
Site:	NIH
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	RHEL
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (5.07 KB, application/x-zip-compressed) 2022-08-22 10:26 MDT, rl303f	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description rl303f 2022-08-22 09:32:08 MDT

Good morning, Slurm Gurus!

We are contacting you to inquire about a new slurm behavior that we just
started seeing upon upgrading to version 21.08.8-2. Since the upgrade we
began finding unusual messages in the system logs on our head/login node:

Aug 19 12:15:53 <login_node> srun: *** Error in `/usr/local/slurm/bin/srun': corrupted size vs. prev_size: 0x000000000187d460 ***
Aug 19 12:15:59 <login_node> srun: *** Error in `/usr/local/slurm/bin/srun': double free or corruption (fasttop): 0x0000000000c4d320 ***

Nothing else has changed in our configuration; we just upgraded slurm and
then these started appearing in the logs. We have been unable to intentionally
reproduce the errors but they seem to occur with the ending of interactive
sessions and are always followed by a log message of this type:

Aug 19 12:15:59 <login_node> systemd-logind[1059]: Removed session 85704.

We upgraded slurm on 2022-08-08 and since that time the above messages have
appeared in the system log 1938 times (and counting).

We compared the srun.c file for version 20.11.7 with the srun.c for version
21.08.8-2 and noticed some (extensive?) refactoring of the code between the
two versions.

We also built a separate slurm install tree without optimizations and ran
some srun tests under valgrind in an attempt to try to get more information
about the error without much success. Hence we are reaching out to the gurus
for help.

Please let us know what additional helpful information we can provide to aid
in resolving this issue.

Thank you!

Comment 1 Jason Booth 2022-08-22 10:07:24 MDT

Would you elaborate further on your upgrade? Is the entire cluster on 21.08.8-2 (slurmctld, slurmdbd, slurmd, client commands and submit hosts)?

The build you are on contains the fixes for a number of security vulnerabilities patched earlier this year.

These have been assigned CVE-2022-29500,
CVE-2022-29501, and CVE-2022-29502.


https://groups.google.com/g/slurm-users/c/eBoNtkYDE6A/m/WnKwFbXcEAAJ


With regard to the error you are seeing, can you provide steps and a reproducer?

For example, are you running srun from a login node or inside a salloc or other job allocation? Please also attach your slurm.conf for review.

Comment 2 rl303f 2022-08-22 10:26:55 MDT

Created attachment 26414 [details]
slurm.conf

Comment 3 rl303f 2022-08-22 10:27:11 MDT

(In reply to Jason Booth from comment #1)
> Would you elaborate further on your upgrade? Is the entire cluster on
> 21.08.8-2 (slurmctld, slurmdbd, slurmd, client commands and submit hosts)?
> 

Yes, we always upgrade all of the slurm components together; everything is
at version 21.08.8-2. 

> The build you are on contains the fixes for a number of security
> vulnerabilities patched earlier this year.
> 
> These have been assigned CVE-2022-29500,
> CVE-2022-29501, and CVE-2022-29502.
> 
> 
> https://groups.google.com/g/slurm-users/c/eBoNtkYDE6A/m/WnKwFbXcEAAJ
> 

Yes, and we applied the SchedMD security megapatch to our previous version
20.11.7 and ran with that for several weeks without this issue appearing.

> 
> With regard to the error you are seeing, can you provide steps and a
> reproducer?
> 

Unfortunately, no, we are not able to reproduce the error at will but for
some reason it seems to have no problem occurring on its own.

> For example, are you running srun from a login node or inside a salloc or
> other job allocation? 

Yes, this error involves running srun on the login node only.  It does not
seem to occur when srun is called from within mpi jobs.  At least we do not
see any srun errors in the compute node log files.

Also, we have seen where the error occurs both when running srun standalone and inside salloc.

> Please also attach your slurm.conf for review.

Please see attached slurm.conf.

Thank you!

Comment 4 rl303f 2022-08-22 14:21:27 MDT

Update: in case it helps, we managed to find some core dumps and checked
two of them for consistency.  They both seemed to have the same backtrace
output from gdb:

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/local/slurm-21.08/slurm-21.08.8-2/bin/srun...done.
[New LWP 110474]
[New LWP 64620]
[New LWP 110487]
[New LWP 110486]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/local/slurm/bin/srun --cpus-per-task=4 --mem=16g --pty --preserve-env --x1'.
Program terminated with signal 6, Aborted.
#0  0x00002ba9790d0387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64 lz4-1.8.3-1.el7.x86_64
(gdb)
(gdb)
(gdb) bt
#0  0x00002ba9790d0387 in raise () from /lib64/libc.so.6
#1  0x00002ba9790d1a78 in abort () from /lib64/libc.so.6
#2  0x00002ba979112f67 in __libc_message () from /lib64/libc.so.6
#3  0x00002ba97911b329 in _int_free () from /lib64/libc.so.6
#4  0x00002ba9788f2530 in slurm_xfree (item=item@entry=0x7ffdd97e75b8) at xmalloc.c:213
#5  0x00002ba9787e24d0 in _free_io_buf (ptr=<optimized out>) at step_io.c:981
#6  0x00002ba9788352c2 in list_destroy (l=<optimized out>) at list.c:194
#7  0x00002ba9787e4987 in client_io_handler_destroy (cio=<optimized out>) at step_io.c:1224
#8  0x00002ba9787e8a8d in slurm_step_launch_wait_finish (ctx=0x8768b0) at step_launch.c:810
#9  0x00002ba97820a2cd in launch_p_step_wait (job=0x87a300, got_alloc=<optimized out>, opt_local=0x421ea0 <opt>) at launch_slurm.c:896
#10 0x000000000040ce67 in launch_g_step_wait (job=job@entry=0x87a300, got_alloc=got_alloc@entry=false, opt_local=opt_local@entry=0x421ea0 <opt>) at launch.c:704
#11 0x0000000000408beb in _launch_one_app (data=<optimized out>) at srun.c:286
#12 0x000000000040a2e7 in _launch_app (got_alloc=false, srun_job_list=0x0, job=0x87a300) at srun.c:587
#13 srun (ac=8, av=<optimized out>) at srun.c:217
#14 0x000000000040a78d in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17

Comment 5 Marshall Garey 2022-08-22 14:36:08 MDT

Thanks for the updated information regarding the core dumps - that was really helpful. This looks like a duplicate of bug 13418, which unfortunately is private, but here's the commit that fixed it:

https://github.com/SchedMD/slurm/commit/613da29e7c

A reproducer was to use ctl-c while in srun. Does your user signal srun with SIGSTOP or SIGTERM while it is running?

It's in 22.05.3, though you can always apply that commit locally if you want.

Comment 6 rl303f 2022-08-23 12:09:13 MDT

That's great that the core dump information was useful!

We performed a test of starting an interactive session and then doing a ctrl-c
to see if it would reproduce the error.  Unfortunately, it did not.  Nor did it
create a core dump that we could locate.  It did, however, result in this message
on the terminal screen when exiting the interactive session:

srun: error: cn4291: task 0: Exited with exit code 130

Do you still think this is the same as bug 13418?

Thank you!

Comment 7 Marshall Garey 2022-08-23 12:15:48 MDT

I do think it is the same because the backtrace is identical. Do you have a test system in which you can cherry pick the commit and run the test? The patch changes the Slurm API, so you'll just need to replace the user commands with the new binaries (the patch doesn't touch the daemons).

Comment 8 rl303f 2022-08-23 14:32:58 MDT

Okay, thanks.  It sounds like you were able to identify this as a known issue
and are confident that the referenced patch is the fix.  So we can either apply
the patch to our current version or upgrade to version 22.05.3 to resolve the
issue.

We will probably go ahead and patch the current version for now and then do the
upgrade a little more down the road.

Does that sound like a reasonable approach to you?

Thanks again!

Comment 9 Marshall Garey 2022-08-23 15:48:18 MDT

Yes, that sounds reasonable.

Would you like me to keep this bug open until you're able to patch and test it, or should I close it and you can re-open it if you still see issues?

Comment 10 rl303f 2022-08-24 14:41:45 MDT

If you don't mind too much, could we keep the ticket open for now?

We will get you an update as soon as we get the patch in.

Thank you!

Comment 11 Marshall Garey 2022-08-24 14:44:52 MDT

No problem, I'll keep the ticket open and wait for your update.

Comment 12 Marshall Garey 2022-09-06 11:22:21 MDT

I'm changing this to a sev-4 while I wait for your update.

Comment 13 rl303f 2022-09-12 11:31:57 MDT

Hi, Marshall.

We have been running the patched version for the past week without any further
occurrences of errors in the log file nor core dumps.  It seems like that fixed
the bug and we are now bug-free!  :-) 

We can probably go ahead and close the ticket now.

Thank you all for your help and stay safe!

Comment 14 Marshall Garey 2022-09-12 11:39:33 MDT

Great! Closing as a duplicate of bug 13418.

*** This ticket has been marked as a duplicate of ticket 13418 ***