Created attachment 8939 [details] Slurm3 verbose slurmctld output
Hello, Today we noticed that both our slurmctld servers that serve one of our clusters had stopped. Upon trying to restart the slurmctld service on both servers we were met with segfaults from slurmctld shortly before it tried to schedule jobs. We have not upgraded slurm or the controllers recently, so this is a bit concerning. We have tried reinstalling all slurm packages, but the problem persists. We only get a log message like below when the segfault occurs. Jan 16 12:36:13 slurm3 kernel: srvcn[1919]: segfault at 58 ip 00000000004aab12 sp 00007f1a8aabe8b0 error 4 in slurmctld[400000+df000] Attached you'll find the output of /usr/sbin/slurmctld -D -vvv on slurm3. At this point we would just like to know how we can get the slurmctld process started without it segfaulting.
I'll need to see the backtrace from the coredump with gdb. $ gdb path/to/the/slurmctld/binary path/to/the/core Then inside of gdb: thread apply all bt
Trying to get GDB installed on the host currently. As another datapoint we just rebuilt the other controller server from scratch and it is still showing the same behavior of segfaulting shortly after being started. Jan 16 13:28:18 slurm4 kernel: srvcn[2015]: segfault at 58 ip 00000000004aab12 sp 00007f1295ba98b0 error 4 in slurmctld[400000+df000] Jan 16 13:28:18 slurm4 systemd: slurmctld.service: main process exited, code=killed, status=11/SEGV Jan 16 13:28:18 slurm4 systemd: Unit slurmctld.service entered failed state. Jan 16 13:28:18 slurm4 systemd: slurmctld.service failed.
Unfortunately, I have no idea how to fix the segfault without a backtrace. The log file doesn't tell me where it is. Most likely, you've hit a segfault that has already been fixed. Here's a list from our NEWS file of segfaults that are fixed in 17.11 after 17.11.7: 17.11.10: 6230489d7ad 28 (Danny Auble 2018-08-21 14:10:51 -0600 42) -- Fix srun segfault caused by invalid memory reads on the env. eab9f4052c9 29 (Brian Christiansen 2018-08-22 11:59:05 -0600 43) -- Fix segfault on job arrays when starting controller without dbd up. 17.11.9-2: 21d2ab6ed16 7 (Danny Auble 2018-08-10 16:07:05 -0600 77) -- Fix invalid read (segfault) when sorting multi-partition jobs. 17.11.9: fef07a40972 6 (Dominik Bartkiewicz 2018-07-27 11:15:13 -0600 83) -- Fix segfault in slurmctld when a job's node bitmap is NULL during a 17.11.8: d10854d99d6 38 (Danny Auble 2018-07-09 11:55:52 -0600 133) -- Fix potential segfault when closing the mpi/pmi2 plugin.
It's possible the segfault could be fixed by upgrading to the latest version of 17.11, though I can't say for sure without a backtrace. I'm happy to take a look at the backtrace as soon as you have it.
Is your slurmdbd up? At least one of the segfaults was caused by starting slurmctld without the slurmdbd already started.
Created attachment 8945 [details] slurmctld gdb output
I have attached the output of gdb for the corefile produced by slurmctld -D -vvv
Slurmdb is up and is working correctly as it also serves as the slurmdb for our other cluster Summit, which has not had any issues.
Created attachment 8946 [details] Avoid the segfault This looks like a duplicate of a few other bugs that reported a segfault in the same place. My guess is it was fixed by commit fef07a40972. However, this patch should get you past the segfault and allow the slurmctld to continue running. Can you let us know if it works? I also advise upgrading to the latest version of 17.11 (and removing this patch that I've provided) so you can avoid the cause of this segfault in the future.
That patch did not alleviate our issues.
I assume you applied it to slurmctld and restarted the slurmctld? Using the new slurmctld binary and the new coredump, can you upload the following from gdb: (gdb) thread apply all bt (gdb) bt full (gdb) frame 1 (gdb) p job_resrcs_ptr (gdb) p *job_resrcs_ptr
Sorry, I messed up in that last post. Instead of "frame 1" I want "thread 1" - just make sure you're in the thread that has _step_dealloc_lps. I really want it from frame 0.
Created attachment 8947 [details] slurmctld still segfaulting on latest 17.11.12
Just posted the gdb output you asked for, but for slurm-17.11.12 as the patch failed and we wanted to see how the most recent version faired as well since you mentioned that there were many changes that resolved conditions resulting in segfaults. It does appear however that the latest version does not fix our issues.
I think I see what happened. So, the commit that prevents the segfault from occurring doesn't actually stop it from happening once you're in that situation (a job_ptr with job_resrcs==NULL). So, you'll need to apply my patch on top of whatever 17.11 you're on (which I can see isn't applied in that backtrace), wait for awhile to ensure that the jobs with that situation are flushed out of the queue, then you can safely remove my patch. If it still segfaults with my patch, which you implied it did, please upload the output of the following from gdb, since it may be segfaulting in a different place. thread apply all bt bt full
Created attachment 8949 [details] slurmctld gdb output after adding patch to 17.11.12
Still had slurmctld segfault, attached the new gdb output
John, it looks like that binary doesn't have the patch, since the segfault is on the same line number in the same file and function. My patch pushed that line back by 2 lines - it would be line 2081. (gdb) bt full #0 _step_dealloc_lps (step_ptr=0x1251740) at step_mgr.c:2081 Are you certain that you started slurmctld with the new binary (with the patch)? Or that the build didn't fail? Or that you aren't using the wrong coredump?
Sorry, it would be line 2083 with my patch. It's already line 2081
This seems a silly question, but did you recompile after applying the patch?
Created attachment 8950 [details] compressed directory being used to build packages from WITH THE PATCH Just so we are all on the same page I applied the patch to the file decompressing the source. I then repackaged it as a .tar.gz file and used the following command to build the RPMs. rpmbuild -tb slurm-17.11.12.tar.bz2 --with lua --with mysql This produced packages without issue. After the packages were built we removed the old packages from our central repo, copied over these new packages, and then rebuilt the repo. We then went on to both controllers and did a yum remove of ALL slurm packages on the nodes. Afterwards we issued a yum clean all, and then installed the 17.11.12 packages once again.
(In reply to john.blaas from comment #27) > Created attachment 8950 [details] > compressed directory being used to build packages from WITH THE PATCH > > Just so we are all on the same page I applied the patch to the file > decompressing the source. I then repackaged it as a .tar.gz file and used > the following command to build the RPMs. > > rpmbuild -tb slurm-17.11.12.tar.gz --with lua --with mysql > > This produced packages without issue. After the packages were built we > removed the old packages from our central repo, copied over these new > packages, and then rebuilt the repo. > > We then went on to both controllers and did a yum remove of ALL slurm > packages on the nodes. Afterwards we issued a yum clean all, and then > installed the 17.11.12 packages once again.
That process looks like it should work. A couple last things to try then. Can you start slurmctld manually instead of through systemd? ./path/to/slurmctld -D and (assuming it segfaults) see if the backtrace is the same or different? The thing confusing me is that your source clearly has the patch, but the backtrace from the coredump implies it is segfaulting on this line: if (!job_resrcs_ptr) which is impossible. It's probably segfaulting on this line: i_first = bit_ffs(job_resrcs_ptr->node_bitmap); A colleague of mine reached out to you via email
I found the issue with the way I was building the rpms. After fixing the issue I am now running into issues building the packages with the patch you gave me. gcc -DHAVE_CONFIG_H -I. -I../.. -I../../slurm -I../.. -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c -o statistics.o statistics.c gcc -DHAVE_CONFIG_H -I. -I../.. -I../../slurm -I../.. -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c -o step_mgr.o step_mgr.c step_mgr.c: In function '_step_dealloc_lps': step_mgr.c:2122:10: warning: format '%u' expects argument of type 'unsigned int', but argument 6 has type 'const char *' [-Wformat=] cpus_alloc, job_node_inx); ^ step_mgr.c:2122:10: warning: too many arguments for format [-Wformat-extra-args] step_mgr.c: In function 'build_extern_step': step_mgr.c:4706:2: error: expected declaration or statement at end of input jobacct_storage_g_step_start(acct_db_conn, step_ptr); ^ gcc -DHAVE_CONFIG_H -I. -I../.. -I../../slurm -I../.. -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c -o trigger_mgr.o trigger_mgr.c step_mgr.c:4706:2: warning: control reaches end of non-void function [-Wreturn-type] jobacct_storage_g_step_start(acct_db_conn, step_ptr); ^ make[3]: *** [step_mgr.o] Error 1 make[3]: *** Waiting for unfinished jobs.... make[3]: Leaving directory `/root/rpmbuild/BUILD/slurm-17.11.12-2/src/slurmctld' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/root/rpmbuild/BUILD/slurm-17.11.12-2/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/rpmbuild/BUILD/slurm-17.11.12-2' make: *** [all] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.jYhpBG (%build)
Created attachment 8951 [details] re-upload of patch for 17.11.12 - avoid segfault Just as a sanity check, I rebuilt my copy of slurm 17.11.2 with the patch in place and am re-uploading it here.
*17.11.12, not 17.11.2 The typos are strong with me tonight.
Yep, caught that I forgot to close a ), typos abound. After recompiling the packages and deploying them things are looking stable now. I am going to keep an eye on it and will update if the controller daemon segfaults again.
Moving this to sev 3 for now since the patch is now applied.
I'm glad it's working and that I'm not insane. I'll keep this open for a few days to make sure the system is stable.
Does schedmd publish (or have on hand) a recommended patching procedure that we can refer to in our internal documentation, so we aren't second-guessing ourselves when a situation like this arises in the future? ~jonathon
No, we don't have any specific recommended procedures for patching the code. From one of my colleagues: "SchedMD maintains the open source project only. We do not roll binaries so it is expected that sites know how to apply patches and compile code. We do not have a procedure since each site is different e.g some roll rpms some create spec files and others just compile from source." As a Slurm developer, I always build from source, but I realize that most sites do not. I found this tutorial for maintaining patch files for rpm packages very helpful for me (as a novice with rpm's), but your mileage may vary. https://cromwell-intl.com/open-source/rpm-patch.html Every site being different, we can't recommend a specific procedure for patching the source. Some sites even maintain a list of local patches that aren't in the official github repository. I recommend against patching the source by hand. Instead, always use the patch file(s) that we upload. That way, if it doesn't compile, it's our fault. :)
Would you say that any patch we receive from SchedMD should be able to be added to the SchedMD-distributed spec file with a line like PatchN: avoid_segfault.patch
I guess that just registers the patch as existing; maybe then we'd do %prep #[...] %patchN -p1
You'd want to put both of those things in the slurm.spec file: PatchN: avoid_segfault.patch %prep #[...] %patchN -p1 (And of course you also need to have the actual patch file in the correct directory (for me, in ~/rpmbuild/SOURCES alongside the Slurm tarball). Then you should be able to build and install the packages. Is the system still stable?
It is! Thanks for your help. What release should we expect to see this patch in? ~jonathon
The patch I gave you isn't in a release. Although we never reproduced the exact situation that caused the segfault you hit, we haven't yet seen any customers hit this segfault on the newest versions of 17.11 (I think the latest version someone reported this is 17.11.7). So, we believe it's resolved by the other commits in 17.11 that actually fix the situations that eventually caused segfaults. If you do remove the patch and hit the same segfault again, please upload a backtrace from gdb so we can take another look.
I'll close this as resolved/infogiven. You can respond to re-open this issue. We especially want to know if you do hit this segfault again.