6369 – slurmctld segfaults on startup

Ticket 6369 - slurmctld segfaults on startup

Summary: slurmctld segfaults on startup

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	17.11.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-01-16 12:38 MST by john.blaas
Modified:	2019-08-27 10:23 MDT (History)
CC List:	1 user (show)

See Also:	7641
Site:	University of Colorado
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	RHEL
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm3 verbose slurmctld output (181.33 KB, text/plain) 2019-01-16 12:46 MST, john.blaas	Details
slurmctld gdb output (7.66 KB, text/plain) 2019-01-16 14:28 MST, john.blaas	Details
Avoid the segfault (448 bytes, patch) 2019-01-16 14:54 MST, Marshall Garey	Details \| Diff
slurmctld still segfaulting on latest 17.11.12 (11.53 KB, text/plain) 2019-01-16 15:48 MST, john.blaas	Details
slurmctld gdb output after adding patch to 17.11.12 (10.66 KB, text/plain) 2019-01-16 16:11 MST, john.blaas	Details
compressed directory being used to build packages from WITH THE PATCH (8.35 MB, application/x-gzip) 2019-01-16 16:30 MST, john.blaas	Details
re-upload of patch for 17.11.12 - avoid segfault (448 bytes, patch) 2019-01-16 17:38 MST, Marshall Garey	Details \| Diff
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description john.blaas 2019-01-16 12:38:45 MST

Comment 2 john.blaas 2019-01-16 12:46:48 MST

Created attachment 8939 [details]
Slurm3 verbose slurmctld output

Comment 3 john.blaas 2019-01-16 12:47:46 MST

Hello,

Today we noticed that both our slurmctld servers that serve one of our clusters had stopped. Upon trying to restart the slurmctld service on both servers we were met with segfaults from slurmctld shortly before it tried to schedule jobs. We have not upgraded slurm or the controllers recently, so this is a bit concerning.

We have tried reinstalling all slurm packages, but the problem persists. We only get a log message like below when the segfault occurs.

Jan 16 12:36:13 slurm3 kernel: srvcn[1919]: segfault at 58 ip 00000000004aab12 sp 00007f1a8aabe8b0 error 4 in slurmctld[400000+df000]

Attached you'll find the output of /usr/sbin/slurmctld -D -vvv on slurm3.

At this point we would just like to know how we can get the slurmctld process started without it segfaulting.

Comment 4 Marshall Garey 2019-01-16 13:15:55 MST

I'll need to see the backtrace from the coredump with gdb.

$ gdb path/to/the/slurmctld/binary path/to/the/core

Then inside of gdb:

thread apply all bt

Comment 5 john.blaas 2019-01-16 13:34:54 MST

Trying to get GDB installed on the host currently. 

As another datapoint we just rebuilt the other controller server from scratch and it is still showing the same behavior of segfaulting shortly after being started.

Jan 16 13:28:18 slurm4 kernel: srvcn[2015]: segfault at 58 ip 00000000004aab12 sp 00007f1295ba98b0 error 4 in slurmctld[400000+df000]
Jan 16 13:28:18 slurm4 systemd: slurmctld.service: main process exited, code=killed, status=11/SEGV
Jan 16 13:28:18 slurm4 systemd: Unit slurmctld.service entered failed state.
Jan 16 13:28:18 slurm4 systemd: slurmctld.service failed.

Comment 7 Marshall Garey 2019-01-16 13:37:05 MST

Unfortunately, I have no idea how to fix the segfault without a backtrace. The log file doesn't tell me where it is.

Most likely, you've hit a segfault that has already been fixed. Here's a list from our NEWS file of segfaults that are fixed in 17.11 after 17.11.7:

17.11.10:
6230489d7ad   28 (Danny Auble               2018-08-21 14:10:51 -0600   42)  -- Fix srun segfault caused by invalid memory reads on the env.
eab9f4052c9   29 (Brian Christiansen        2018-08-22 11:59:05 -0600   43)  -- Fix segfault on job arrays when starting controller without dbd up.

17.11.9-2:
21d2ab6ed16    7 (Danny Auble               2018-08-10 16:07:05 -0600   77)  -- Fix invalid read (segfault) when sorting multi-partition jobs.

17.11.9:
fef07a40972    6 (Dominik Bartkiewicz       2018-07-27 11:15:13 -0600   83)  -- Fix segfault in slurmctld when a job's node bitmap is NULL during a

17.11.8:
d10854d99d6   38 (Danny Auble               2018-07-09 11:55:52 -0600  133)  -- Fix potential segfault when closing the mpi/pmi2 plugin.

Comment 8 Marshall Garey 2019-01-16 13:40:34 MST

It's possible the segfault could be fixed by upgrading to the latest version of 17.11, though I can't say for sure without a backtrace. I'm happy to take a look at the backtrace as soon as you have it.

Comment 10 Marshall Garey 2019-01-16 13:45:26 MST

Is your slurmdbd up? At least one of the segfaults was caused by starting slurmctld without the slurmdbd already started.

Comment 11 john.blaas 2019-01-16 14:28:56 MST

Created attachment 8945 [details]
slurmctld gdb output

Comment 12 john.blaas 2019-01-16 14:29:32 MST

I have attached the output of gdb for the corefile produced by slurmctld -D -vvv

Comment 13 john.blaas 2019-01-16 14:32:54 MST

Slurmdb is up and is working correctly as it also serves as the slurmdb for our other cluster Summit, which has not had any issues.

Comment 14 Marshall Garey 2019-01-16 14:54:39 MST

Created attachment 8946 [details]
Avoid the segfault

This looks like a duplicate of a few other bugs that reported a segfault in the same place. My guess is it was fixed by commit fef07a40972. However, this patch should get you past the segfault and allow the slurmctld to continue running. Can you let us know if it works?

I also advise upgrading to the latest version of 17.11 (and removing this patch that I've provided) so you can avoid the cause of this segfault in the future.

Comment 15 john.blaas 2019-01-16 15:30:45 MST

That patch did not alleviate our issues.

Comment 16 Marshall Garey 2019-01-16 15:36:08 MST

I assume you applied it to slurmctld and restarted the slurmctld?

Using the new slurmctld binary and the new coredump, can you upload the following from gdb:

(gdb) thread apply all bt
(gdb) bt full
(gdb) frame 1
(gdb) p job_resrcs_ptr
(gdb) p *job_resrcs_ptr

Comment 17 Marshall Garey 2019-01-16 15:40:27 MST

Sorry, I messed up in that last post. Instead of "frame 1" I want "thread 1" - just make sure you're in the thread that has _step_dealloc_lps. I really want it from frame 0.

Comment 18 john.blaas 2019-01-16 15:48:07 MST

Created attachment 8947 [details]
slurmctld still segfaulting on latest 17.11.12

Comment 19 john.blaas 2019-01-16 15:49:48 MST

Just posted the gdb output you asked for, but for slurm-17.11.12 as the patch failed and we wanted to see how the most recent version faired as well since you mentioned that there were many changes that resolved conditions resulting in segfaults.

It does appear however that the latest version does not fix our issues.

Comment 20 Marshall Garey 2019-01-16 15:54:04 MST

I think I see what happened. So, the commit that prevents the segfault from occurring doesn't actually stop it from happening once you're in that situation (a job_ptr with job_resrcs==NULL). So, you'll need to apply my patch on top of whatever 17.11 you're on (which I can see isn't applied in that backtrace), wait for awhile to ensure that the jobs with that situation are flushed out of the queue, then you can safely remove my patch.

If it still segfaults with my patch, which you implied it did, please upload the output of the following from gdb, since it may be segfaulting in a different place.

thread apply all bt
bt full

Comment 22 john.blaas 2019-01-16 16:11:48 MST

Created attachment 8949 [details]
slurmctld gdb output after adding patch to 17.11.12

Comment 23 john.blaas 2019-01-16 16:12:17 MST

Still had slurmctld segfault, attached the new gdb output

Comment 24 Marshall Garey 2019-01-16 16:17:18 MST

John, it looks like that binary doesn't have the patch, since the segfault is on the same line number in the same file and function. My patch pushed that line back by 2 lines - it would be line 2081.

(gdb) bt full
#0  _step_dealloc_lps (step_ptr=0x1251740) at step_mgr.c:2081

Are you certain that you started slurmctld with the new binary (with the patch)? Or that the build didn't fail? Or that you aren't using the wrong coredump?

Comment 25 Marshall Garey 2019-01-16 16:17:58 MST

Sorry, it would be line 2083 with my patch. It's already line 2081

Comment 26 Marshall Garey 2019-01-16 16:28:33 MST

This seems a silly question, but did you recompile after applying the patch?

Comment 27 john.blaas 2019-01-16 16:30:53 MST

Created attachment 8950 [details]
compressed directory being used to build packages from WITH THE PATCH

Just so we are all on the same page I applied the patch to the file decompressing the source. I then repackaged it as a .tar.gz file and used the following command to build the RPMs.

rpmbuild -tb slurm-17.11.12.tar.bz2 --with lua --with mysql

This produced packages without issue. After the packages were built we removed the old packages from our central repo, copied over these new packages, and then rebuilt the repo.

We then went on to both controllers and did a yum remove of ALL slurm packages on the nodes.  Afterwards we issued a yum clean all, and then installed the 17.11.12 packages once again.

Comment 28 john.blaas 2019-01-16 16:31:16 MST

(In reply to john.blaas from comment #27)
> Created attachment 8950 [details]
> compressed directory being used to build packages from WITH THE PATCH
> 
> Just so we are all on the same page I applied the patch to the file
> decompressing the source. I then repackaged it as a .tar.gz file and used
> the following command to build the RPMs.
> 
> rpmbuild -tb slurm-17.11.12.tar.gz --with lua --with mysql
> 
> This produced packages without issue. After the packages were built we
> removed the old packages from our central repo, copied over these new
> packages, and then rebuilt the repo.
> 
> We then went on to both controllers and did a yum remove of ALL slurm
> packages on the nodes.  Afterwards we issued a yum clean all, and then
> installed the 17.11.12 packages once again.

Comment 29 john.blaas 2019-01-16 16:31:38 MST

(In reply to john.blaas from comment #27)
> Created attachment 8950 [details]
> compressed directory being used to build packages from WITH THE PATCH
> 
> Just so we are all on the same page I applied the patch to the file
> decompressing the source. I then repackaged it as a .tar.gz file and used
> the following command to build the RPMs.
> 
> rpmbuild -tb slurm-17.11.12.tar.gz --with lua --with mysql
> 
> This produced packages without issue. After the packages were built we
> removed the old packages from our central repo, copied over these new
> packages, and then rebuilt the repo.
> 
> We then went on to both controllers and did a yum remove of ALL slurm
> packages on the nodes.  Afterwards we issued a yum clean all, and then
> installed the 17.11.12 packages once again.

Comment 33 Marshall Garey 2019-01-16 17:16:04 MST

That process looks like it should work. A couple last things to try then.

Can you start slurmctld manually instead of through systemd?

./path/to/slurmctld -D

and (assuming it segfaults) see if the backtrace is the same or different? The thing confusing me is that your source clearly has the patch, but the backtrace from the coredump implies it is segfaulting on this line:

	if (!job_resrcs_ptr)

which is impossible. It's probably segfaulting on this line:

	i_first = bit_ffs(job_resrcs_ptr->node_bitmap);

A colleague of mine reached out to you via email

Comment 34 john.blaas 2019-01-16 17:30:52 MST

I found the issue with the way I was building the rpms. After fixing the issue I am now running into issues building the packages with the patch you gave me.

gcc -DHAVE_CONFIG_H -I. -I../.. -I../../slurm  -I../..    -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c -o statistics.o statistics.c
gcc -DHAVE_CONFIG_H -I. -I../.. -I../../slurm  -I../..    -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c -o step_mgr.o step_mgr.c
step_mgr.c: In function '_step_dealloc_lps':
step_mgr.c:2122:10: warning: format '%u' expects argument of type 'unsigned int', but argument 6 has type 'const char *' [-Wformat=]
          cpus_alloc, job_node_inx);
          ^
step_mgr.c:2122:10: warning: too many arguments for format [-Wformat-extra-args]
step_mgr.c: In function 'build_extern_step':
step_mgr.c:4706:2: error: expected declaration or statement at end of input
  jobacct_storage_g_step_start(acct_db_conn, step_ptr);
  ^
gcc -DHAVE_CONFIG_H -I. -I../.. -I../../slurm  -I../..    -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c -o trigger_mgr.o trigger_mgr.c
step_mgr.c:4706:2: warning: control reaches end of non-void function [-Wreturn-type]
  jobacct_storage_g_step_start(acct_db_conn, step_ptr);
  ^
make[3]: *** [step_mgr.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[3]: Leaving directory `/root/rpmbuild/BUILD/slurm-17.11.12-2/src/slurmctld'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/root/rpmbuild/BUILD/slurm-17.11.12-2/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/rpmbuild/BUILD/slurm-17.11.12-2'
make: *** [all] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.jYhpBG (%build)

Comment 35 Marshall Garey 2019-01-16 17:38:56 MST

Created attachment 8951 [details]
re-upload of patch for 17.11.12 - avoid segfault

Just as a sanity check, I rebuilt my copy of slurm 17.11.2 with the patch in place and am re-uploading it here.

Comment 36 Marshall Garey 2019-01-16 17:45:10 MST

*17.11.12, not 17.11.2

The typos are strong with me tonight.

Comment 37 john.blaas 2019-01-16 17:57:03 MST

Yep, caught that I forgot to close a ), typos abound.

After recompiling the packages and deploying them things are looking stable now.

I am going to keep an eye on it and will update if the controller daemon segfaults again.

Comment 38 Jason Booth 2019-01-16 17:59:49 MST

Moving this to sev 3 for now since the patch is now applied.

Comment 39 Marshall Garey 2019-01-17 09:16:46 MST

I'm glad it's working and that I'm not insane. I'll keep this open for a few days to make sure the system is stable.

Comment 40 Jonathon Anderson 2019-01-17 10:32:58 MST

Does schedmd publish (or have on hand) a recommended patching procedure that we can refer to in our internal documentation, so we aren't second-guessing ourselves when a situation like this arises in the future?

~jonathon

Comment 41 Marshall Garey 2019-01-17 14:56:22 MST

No, we don't have any specific recommended procedures for patching the code. From one of my colleagues:

"SchedMD maintains the open source project only. We do not roll binaries so it is expected that sites know how to apply patches and compile code. We do not have a procedure since each site is different e.g some roll rpms some create spec files and others just compile from source."

As a Slurm developer, I always build from source, but I realize that most sites do not. I found this tutorial for maintaining patch files for rpm packages very helpful for me (as a novice with rpm's), but your mileage may vary.

https://cromwell-intl.com/open-source/rpm-patch.html

Every site being different, we can't recommend a specific procedure for patching the source. Some sites even maintain a list of local patches that aren't in the official github repository.

I recommend against patching the source by hand. Instead, always use the patch file(s) that we upload. That way, if it doesn't compile, it's our fault. :)

Comment 42 Jonathon Anderson 2019-01-17 15:30:36 MST

Would you say that any patch we receive from SchedMD should be able to be added to the SchedMD-distributed spec file with a line like

PatchN: avoid_segfault.patch

Comment 43 Jonathon Anderson 2019-01-17 15:32:57 MST

I guess that just registers the patch as existing; maybe then we'd do

%prep
#[...]
%patchN -p1

Comment 44 Marshall Garey 2019-01-22 15:21:20 MST

You'd want to put both of those things in the slurm.spec file:

PatchN: avoid_segfault.patch
%prep
#[...]
%patchN -p1

(And of course you also need to have the actual patch file in the correct directory (for me, in ~/rpmbuild/SOURCES alongside the Slurm tarball).

Then you should be able to build and install the packages.

Is the system still stable?

Comment 45 Jonathon Anderson 2019-01-22 15:24:55 MST

It is! Thanks for your help.

What release should we expect to see this patch in?

~jonathon

Comment 46 Marshall Garey 2019-01-22 15:59:36 MST

The patch I gave you isn't in a release. Although we never reproduced the exact situation that caused the segfault you hit, we haven't yet seen any customers hit this segfault on the newest versions of 17.11 (I think the latest version someone reported this is 17.11.7). So, we believe it's resolved by the other commits in 17.11 that actually fix the situations that eventually caused segfaults. If you do remove the patch and hit the same segfault again, please upload a backtrace from gdb so we can take another look.

Comment 47 Marshall Garey 2019-01-24 16:07:05 MST

I'll close this as resolved/infogiven. You can respond to re-open this issue. We especially want to know if you do hit this segfault again.