We are currently experiencing an issue on our Grace cluster were slurmctld is crashing within a few minutes of startup. We are unable to get it to start and remain up. We have held all pending jobs but the issue persists. [Tue Jun 18 09:19:07 2019] srvcn[34708]: segfault at 58 ip 00000000004b46da sp 00007f15a7ffea00 error 4 in slurmctld[400000+e8000] [Tue Jun 18 09:58:14 2019] srvcn[40266]: segfault at 58 ip 00000000004b46da sp 00007f0b590f8a00 error 4 in slurmctld[400000+e8000] [Tue Jun 18 09:59:14 2019] srvcn[642]: segfault at 58 ip 00000000004b46da sp 00007fc4111eca00 error 4 in slurmctld[400000+e8000] [Tue Jun 18 10:01:24 2019] srvcn[1730]: segfault at 58 ip 00000000004b46da sp 00007f8d43af9a00 error 4 in slurmctld[400000+e8000] [Tue Jun 18 10:04:42 2019] srvcn[3169]: segfault at 58 ip 00000000004b46da sp 00007fed470e7a00 error 4 in slurmctld[400000+e8000] [Tue Jun 18 10:05:42 2019] srvcn[4050]: segfault at 58 ip 00000000004b46da sp 00007f46b24e3a00 error 4 in slurmctld[400000+e8000] [Tue Jun 18 10:06:48 2019] srvcn[4234]: segfault at 58 ip 00000000004b46da sp 00007f18756fea00 error 4 in slurmctld[400000+e8000] [Tue Jun 18 10:12:15 2019] srvcn[6461]: segfault at 58 ip 00000000004b46da sp 00007fb3bfafaa00 error 4 in slurmctld[400000+e8000] [Tue Jun 18 10:14:24 2019] srvcn[7415]: segfault at 58 ip 00000000004b46da sp 00007fdd609e7a00 error 4 in slurmctld[400000+e8000] [Tue Jun 18 10:15:24 2019] srvcn[8458]: segfault at 58 ip 00000000004b46da sp 00007f0cc51e9a00 error 4 in slurmctld[400000+e8000] Thanks, Kaylea
Hi Do you have core file from this crash? If yes, can you generate backtrace? eg.: gdb -ex 't a a bt' -batch slurmctld <corefile> Dominik
Created attachment 10631 [details] backtrace of corefile from one of the segfaults
Hi I think this is duplicate of bug 6837 Check bug 6837 comment 5 We added this fix which prevents from such situation as: https://github.com/SchedMD/slurm/commit/70d12f070908c33 Dominik
I have installed patched RPMS on our controller node (still at 18.08.5), but we are still getting segfaults with nearly identical backtrace info. -Tyler
Created attachment 10634 [details] new backtrace after patch from commit mentioned in comment 3
Hi Did you apply patch from bug 6837 comment 5 or 70d12f070908c33? To clear your slurmctld state you need to apply patch from bug 6837 comment 5. Could you also send me the output from: gdb slurmctld <corefile> t 1 f 0 p job_resrcs_ptr p *job_resrcs_ptr Dominik
Ah, I only applied the one from the github commit. I assumed they were the same. I'm rebuilding with that patch from the bug 6835 comment 5 now. I'll reply back when I have that installed. -Tyler
Right, that seems to have fixed things like you predicted. Would you still like that last gdb info? We are back online now. Thank you very much for your help. -Tyler
Tyler - I am dropping this down to a sev 3 since you are back online. Dominiks remarks about the gdb info are not needed anymore. Those were just to verify that the patches had been applied correctly.
Hi Glad to hear that all is back to normal. With 70d12f070908c33 this bug shouldn't occur anymore. Let me know if you have any additional questions/problem otherwise I will close this bug. Dominik
Hi I'm going to go ahead and close the bug. If you have any questions, feel free to reopen the bug. Dominik *** This ticket has been marked as a duplicate of ticket 6837 ***