Ticket 5447

Summary: SlurmCtlD segfaults on startup
Product: Slurm Reporter: Steve Ford <fordste5>
Component: slurmctldAssignee: Marshall Garey <marshall>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.7   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=5438
Site: MSU Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 17.11.9 18.08.0-pre2
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: gdb thread apply all bt
Slurm Config File
Avoid segfault if job_resrcs_ptr is NULL
slurmctld log from 7-18
Prevent job_resrcs from being overwritten for multi-partition job submissions

Description Steve Ford 2018-07-18 12:43:49 MDT
Created attachment 7342 [details]
gdb thread apply all bt

Our slurmctld daemon is currently segfaulting shortly after being started. There were no major changes in the configuration recently. Any ideas?
Comment 1 Steve Ford 2018-07-18 12:44:48 MDT
Created attachment 7343 [details]
Slurm Config File
Comment 2 Marshall Garey 2018-07-18 12:53:52 MDT
Thanks for the backtrace. Can you just get the backtrace of the single thread, too? I'm not seeing the one that segfaulted.

(gdb) bt
Comment 3 Marshall Garey 2018-07-18 12:55:14 MDT
Nevermind, it's thread one. I got it.
Comment 5 Marshall Garey 2018-07-18 12:57:11 MDT
Are you on 17.11.7 without any local patches?
Comment 7 Steve Ford 2018-07-18 13:01:37 MDT
Yes. 17.11.7, no patches.
Comment 8 Marshall Garey 2018-07-18 13:02:06 MDT
Created attachment 7344 [details]
Avoid segfault if job_resrcs_ptr is NULL

This is an exact duplicate of bug 5438, which also came in this week. Can you apply this patch (which I got from that bug, but added in an error message)? And can you start the slurmctld again?
Comment 9 Steve Ford 2018-07-18 13:49:50 MDT
I applied the patch. The slurmctld service is no longer crashing.

I see that error message in the log now:

error: _step_dealloc_lps: job_resrcs_ptr is NULL for job 6038
Comment 10 Marshall Garey 2018-07-18 13:52:47 MDT
Good. Assuming the system is stabilized, is it alright if we move this down to a sev-3 or sev-4?

Also, could you upload a slurmctld log file from today, including after you applied the patch and saw that error message I added in?
Comment 11 Steve Ford 2018-07-18 14:03:04 MDT
Changing this to sev 3 or 4 is fine.
Comment 12 Steve Ford 2018-07-18 14:04:27 MDT
Created attachment 7346 [details]
slurmctld log from 7-18
Comment 13 Marshall Garey 2018-07-18 16:03:08 MDT
Thanks. I'll take a look through it. Since 5438 was filed first, I'm going to be doing more commenting on that bug. Feel free to CC yourself on it. I don't want to close this as a duplicate of that one just yet. On 5438, I reported that I accidentally ran into this myself today, so hopefully it'll be easier to debug now.
Comment 14 Marshall Garey 2018-07-26 17:06:41 MDT
Created attachment 7434 [details]
Prevent job_resrcs from being overwritten for multi-partition job submissions

Can you apply this patch and restart the slurmctld and see if it fixes the issue? We've been able to reliably reproduce this segfault, and this patch fixes it for us. It hasn't been committed yet, but we think it will be soon.

It's possible it fixes other issues you've been seeing, too.
Comment 15 Marshall Garey 2018-08-01 10:46:45 MDT
Have you seen this segfault come up again?

By the way, that patch has been committed and will be in 17.11.9.

https://github.com/SchedMD/slurm/commit/fef07a40972
Comment 16 Steve Ford 2018-08-02 08:31:33 MDT
Marshall,

We have not seen this seggfault since we applied the patch. I'm glad to hear it will make the next release. Is there somewhere I can view all the bugfixes that are planned for the 17.11.9 release?

Thanks,
Steve
Comment 17 Marshall Garey 2018-08-02 08:43:23 MDT
Bug fixes that have already been committed for the next release can be found on the NEWS file on github. Here it is for the 17.11 branch:

https://github.com/SchedMD/slurm/blob/slurm-17.11/NEWS

There's no way to view bug fixes that haven't been committed yet (things we're still working on) besides looking at bugzilla, but of course not all the tickets are publicly viewable.

I'm glad it fixed it for you. I'm closing this as resolved/duplicate of 5452, since Dominik is the one who did the patch.

*** This ticket has been marked as a duplicate of ticket 5452 ***