Ticket 4484

Summary: slurmd segfault in stepd_completion/free_buf/slurm_xfree
Product: Slurm Reporter: David Gloe <david.gloe>
Component: slurmdAssignee: Felip Moll <felip.moll>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 17.11.0   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=4491
Site: CRAY Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: Cray Internal DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 17.11.1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurmd backtrace from gdb
slurmd log

Description David Gloe 2017-12-07 11:36:28 MST
Created attachment 5693 [details]
slurmd backtrace from gdb

We've just experienced a slurmd segfault on 17.11.0 in stepd_completion. I'll attach the backtrace and slurmd log. This same segfault happened on two nodes.

Looks like the segfault for this node happened at 2017-12-07 12:17:21.
Comment 1 David Gloe 2017-12-07 11:36:55 MST
Created attachment 5694 [details]
slurmd log
Comment 2 Tim Wickberg 2017-12-07 12:02:08 MST
Just a friendly reminder to attach the backtrace when you get a minute... the logs are a good start but that'd help speed up the fix.
Comment 3 David Gloe 2017-12-07 12:06:28 MST
(In reply to Tim Wickberg from comment #2)
> Just a friendly reminder to attach the backtrace when you get a minute...
> the logs are a good start but that'd help speed up the fix.

The backtrace is already attached, at https://bugs.schedmd.com/attachment.cgi?id=5693
Comment 4 Tim Wickberg 2017-12-07 13:00:02 MST
Ah, sorry, my fault. Missed that on the first comment.

Felip - can you work through this on Friday?
Comment 11 Felip Moll 2017-12-19 02:06:22 MST
Hi David,

This is just a quick update to inform you that we have already identified the problem and we have a patch pending for review and commit. Will be fixed officially asap.

Thanks
Felip M
Comment 13 Felip Moll 2017-12-20 02:58:35 MST
Fix for this issue is committed in 973ac2017280246ce0c7741c6d9e25b41d903c9f.

It will be available in 17.11.1 and up.

Thanks for reporting,
Felip M