Created attachment 7343 [details]
Slurm Config File
Thanks for the backtrace. Can you just get the backtrace of the single thread, too? I'm not seeing the one that segfaulted. (gdb) bt Nevermind, it's thread one. I got it. Are you on 17.11.7 without any local patches? Yes. 17.11.7, no patches. Created attachment 7344 [details] Avoid segfault if job_resrcs_ptr is NULL This is an exact duplicate of bug 5438, which also came in this week. Can you apply this patch (which I got from that bug, but added in an error message)? And can you start the slurmctld again? I applied the patch. The slurmctld service is no longer crashing. I see that error message in the log now: error: _step_dealloc_lps: job_resrcs_ptr is NULL for job 6038 Good. Assuming the system is stabilized, is it alright if we move this down to a sev-3 or sev-4? Also, could you upload a slurmctld log file from today, including after you applied the patch and saw that error message I added in? Changing this to sev 3 or 4 is fine. Created attachment 7346 [details]
slurmctld log from 7-18
Thanks. I'll take a look through it. Since 5438 was filed first, I'm going to be doing more commenting on that bug. Feel free to CC yourself on it. I don't want to close this as a duplicate of that one just yet. On 5438, I reported that I accidentally ran into this myself today, so hopefully it'll be easier to debug now. Created attachment 7434 [details]
Prevent job_resrcs from being overwritten for multi-partition job submissions
Can you apply this patch and restart the slurmctld and see if it fixes the issue? We've been able to reliably reproduce this segfault, and this patch fixes it for us. It hasn't been committed yet, but we think it will be soon.
It's possible it fixes other issues you've been seeing, too.
Have you seen this segfault come up again? By the way, that patch has been committed and will be in 17.11.9. https://github.com/SchedMD/slurm/commit/fef07a40972 Marshall, We have not seen this seggfault since we applied the patch. I'm glad to hear it will make the next release. Is there somewhere I can view all the bugfixes that are planned for the 17.11.9 release? Thanks, Steve Bug fixes that have already been committed for the next release can be found on the NEWS file on github. Here it is for the 17.11 branch: https://github.com/SchedMD/slurm/blob/slurm-17.11/NEWS There's no way to view bug fixes that haven't been committed yet (things we're still working on) besides looking at bugzilla, but of course not all the tickets are publicly viewable. I'm glad it fixed it for you. I'm closing this as resolved/duplicate of 5452, since Dominik is the one who did the patch. *** This ticket has been marked as a duplicate of ticket 5452 *** |
Created attachment 7342 [details] gdb thread apply all bt Our slurmctld daemon is currently segfaulting shortly after being started. There were no major changes in the configuration recently. Any ideas?