Ticket 7253

Summary: slurmctld repeatedly segfaulting
Product: Slurm Reporter: Kaylea Nelson <kaylea.nelson>
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart, eric.peskin, jay.kubeck, jbooth, tyler.trafford
Version: 18.08.5   
Hardware: Linux   
OS: Linux   
Site: Yale Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: backtrace of corefile from one of the segfaults
new backtrace after patch from commit mentioned in comment 3

Description Kaylea Nelson 2019-06-18 09:37:48 MDT
We are currently experiencing an issue on our Grace cluster were slurmctld is crashing within a few minutes of startup. We are unable to get it to start and remain up. We have held all pending jobs but the issue persists.

[Tue Jun 18 09:19:07 2019] srvcn[34708]: segfault at 58 ip 00000000004b46da sp 00007f15a7ffea00 error 4 in slurmctld[400000+e8000]
[Tue Jun 18 09:58:14 2019] srvcn[40266]: segfault at 58 ip 00000000004b46da sp 00007f0b590f8a00 error 4 in slurmctld[400000+e8000]
[Tue Jun 18 09:59:14 2019] srvcn[642]: segfault at 58 ip 00000000004b46da sp 00007fc4111eca00 error 4 in slurmctld[400000+e8000]
[Tue Jun 18 10:01:24 2019] srvcn[1730]: segfault at 58 ip 00000000004b46da sp 00007f8d43af9a00 error 4 in slurmctld[400000+e8000]
[Tue Jun 18 10:04:42 2019] srvcn[3169]: segfault at 58 ip 00000000004b46da sp 00007fed470e7a00 error 4 in slurmctld[400000+e8000]
[Tue Jun 18 10:05:42 2019] srvcn[4050]: segfault at 58 ip 00000000004b46da sp 00007f46b24e3a00 error 4 in slurmctld[400000+e8000]
[Tue Jun 18 10:06:48 2019] srvcn[4234]: segfault at 58 ip 00000000004b46da sp 00007f18756fea00 error 4 in slurmctld[400000+e8000]
[Tue Jun 18 10:12:15 2019] srvcn[6461]: segfault at 58 ip 00000000004b46da sp 00007fb3bfafaa00 error 4 in slurmctld[400000+e8000]
[Tue Jun 18 10:14:24 2019] srvcn[7415]: segfault at 58 ip 00000000004b46da sp 00007fdd609e7a00 error 4 in slurmctld[400000+e8000]
[Tue Jun 18 10:15:24 2019] srvcn[8458]: segfault at 58 ip 00000000004b46da sp 00007f0cc51e9a00 error 4 in slurmctld[400000+e8000]

Thanks,
Kaylea
Comment 1 Dominik Bartkiewicz 2019-06-18 09:46:39 MDT
Hi

Do you have core file from this crash?

If yes, can you generate backtrace?
eg.:
gdb -ex 't a a bt' -batch slurmctld <corefile>

Dominik
Comment 2 Tyler Trafford 2019-06-18 09:52:40 MDT
Created attachment 10631 [details]
backtrace of corefile from one of the segfaults
Comment 3 Dominik Bartkiewicz 2019-06-18 09:59:42 MDT
Hi

I think this is duplicate of bug 6837
Check bug 6837 comment 5

We added this fix which prevents from such situation as:
https://github.com/SchedMD/slurm/commit/70d12f070908c33

Dominik
Comment 5 Tyler Trafford 2019-06-18 11:45:59 MDT
I have installed patched RPMS on our controller node (still at 18.08.5), but we are still getting segfaults with nearly identical backtrace info.

-Tyler
Comment 6 Tyler Trafford 2019-06-18 11:52:42 MDT
Created attachment 10634 [details]
new backtrace after patch from commit mentioned in comment 3
Comment 7 Dominik Bartkiewicz 2019-06-18 11:59:23 MDT
Hi

Did you apply patch from bug 6837 comment 5 or 70d12f070908c33?
To clear your slurmctld state you need to apply patch from bug 6837 comment 5.

Could you also send me the output from:
gdb slurmctld <corefile>

t 1
f 0
p job_resrcs_ptr
p *job_resrcs_ptr

Dominik
Comment 8 Tyler Trafford 2019-06-18 12:13:27 MDT
Ah, I only applied the one from the github commit.  I assumed they were the same.  I'm rebuilding with that patch from the bug 6835 comment 5 now.  I'll reply back when I have that installed.

-Tyler
Comment 9 Tyler Trafford 2019-06-18 12:26:28 MDT
Right, that seems to have fixed things like you predicted.  Would you still like that last gdb info?  We are back online now.

Thank you very much for your help.

-Tyler
Comment 10 Jason Booth 2019-06-18 12:57:43 MDT
Tyler - I am dropping this down to a sev 3 since you are back online. Dominiks remarks about the gdb info are not needed anymore. Those were just to verify that the patches had been applied correctly.
Comment 11 Dominik Bartkiewicz 2019-06-19 04:11:47 MDT
Hi

Glad to hear that all is back to normal.
With 70d12f070908c33 this bug shouldn't occur anymore.
Let me know if you have any additional questions/problem otherwise I will close this bug.

Dominik
Comment 12 Dominik Bartkiewicz 2019-06-25 04:30:37 MDT
Hi

I'm going to go ahead and close the bug. If you have any questions, feel free
to reopen the bug.

Dominik

*** This ticket has been marked as a duplicate of ticket 6837 ***