Ticket 5487

Summary: slurmctld segfault (_step_dealloc_lps)
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: slurmctldAssignee: Marshall Garey <marshall>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.8   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=7641
Site: Stanford Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: GDB output "thread apply all bt full"

Description Kilian Cavalotti 2018-07-26 10:42:11 MDT
Created attachment 7423 [details]
GDB output "thread apply all bt full"

Another one:

srvcn[12894]: segfault at 58 ip 00000000004aabf2 sp 00007f39eeed0ec0 error 4 in slurmctld[400000+df000]

# addr2line -e /usr/sbin/slurmctld 00000000004aabf2
/root/rpmbuild/BUILD/slurm-17.11.8/src/slurmctld/step_mgr.c:2081

(gdb) bt
#0  _step_dealloc_lps (step_ptr=0x4883b70) at step_mgr.c:2081
#1  post_job_step (step_ptr=step_ptr@entry=0x4883b70) at step_mgr.c:4652
#2  0x00000000004ab1e3 in _post_job_step (step_ptr=0x4883b70) at step_mgr.c:266
#3  _internal_step_complete (job_ptr=job_ptr@entry=0x4883080, step_ptr=step_ptr@entry=0x4883b70) at step_mgr.c:307
#4  0x00000000004ab261 in delete_step_records (job_ptr=job_ptr@entry=0x4883080) at step_mgr.c:336
#5  0x0000000000463fa5 in cleanup_completing (job_ptr=job_ptr@entry=0x4883080) at job_scheduler.c:4731
#6  0x000000000046e51f in make_node_idle (node_ptr=0x226c0d8, job_ptr=job_ptr@entry=0x4883080) at node_mgr.c:3837
#7  0x000000000044ca23 in job_epilog_complete (job_id=<optimized out>, node_name=0x7f39c8018ab0 "sh-06-32", return_code=0) at job_mgr.c:14650
#8  0x0000000000486d98 in _slurm_rpc_epilog_complete (running_composite=true, run_scheduler=0x7f39eeed193f, msg=0x7f39c8018ae0) at proc_req.c:2229
#9  _slurm_rpc_comp_msg_list (comp_msg=comp_msg@entry=0x7f39c80181d0, run_scheduler=run_scheduler@entry=0x7f39eeed193f, msg_list_in=0x7f38f40f33c0, start_tv=start_tv@entry=0x7f39eeed1940, timeout=timeout@entry=2000000) at proc_req.c:6705
#10 0x00000000004864cf in _slurm_rpc_comp_msg_list (comp_msg=comp_msg@entry=0x7f39c8008a80, run_scheduler=run_scheduler@entry=0x7f39eeed193f, msg_list_in=0x7f38f40e9c80, start_tv=start_tv@entry=0x7f39eeed1940, timeout=2000000) at proc_req.c:6668
#11 0x0000000000487317 in _slurm_rpc_composite_msg (msg=msg@entry=0x7f39eeed1e50) at proc_req.c:6583
#12 0x000000000048f28d in slurmctld_req (msg=msg@entry=0x7f39eeed1e50, arg=arg@entry=0x7f399c001410) at proc_req.c:579
#13 0x0000000000424df8 in _service_connection (arg=0x7f399c001410) at controller.c:1125
#14 0x00007f3a00e8ee25 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f3a00bb8bad in clone () from /lib64/libc.so.6
Comment 1 Marshall Garey 2018-07-26 10:46:33 MDT
This looks like a duplicate of bug 5438. Are you running with the patch provided in bug 5438? If not, please apply that patch and see if it solves the segfault. We didn't include that in 17.11.8, because we're still trying to understand how it got into that state.
Comment 2 Kilian Cavalotti 2018-07-26 10:48:53 MDT
(In reply to Marshall Garey from comment #1)
> This looks like a duplicate of bug 5438. Are you running with the patch
> provided in bug 5438? If not, please apply that patch and see if it solves
> the segfault. We didn't include that in 17.11.8, because we're still trying
> to understand how it got into that state.

Oh, I assumed that the patched had been included in 17.11.8. We're running stock 17.11.8 now, so I'll apply the patch from 5438.

Thanks!
-- 
Kilian
Comment 3 Marshall Garey 2018-07-26 10:55:18 MDT
No worries. I'm marking this as a duplicate.

*** This ticket has been marked as a duplicate of ticket 5438 ***
Comment 4 Kilian Cavalotti 2018-07-26 11:06:37 MDT
(In reply to Marshall Garey from comment #3)
> No worries. I'm marking this as a duplicate.
> 
> *** This bug has been marked as a duplicate of bug 5438 ***

Thanks! I re-applied the patch from 5438 on our 17.11.8 installation.

On a side note, what about including that patch in future releases anyway, even without a complete understanding of how the job structure got corrupted? 

I mean logging an error or even generating an exception via an assert would be better than a segfault, wouldn't it?
Comment 5 Marshall Garey 2018-07-26 11:08:35 MDT
(In reply to Kilian Cavalotti from comment #4)
> (In reply to Marshall Garey from comment #3)
> > No worries. I'm marking this as a duplicate.
> > 
> > *** This bug has been marked as a duplicate of bug 5438 ***
> 
> Thanks! I re-applied the patch from 5438 on our 17.11.8 installation.
> 
> On a side note, what about including that patch in future releases anyway,
> even without a complete understanding of how the job structure got
> corrupted? 
> 
> I mean logging an error or even generating an exception via an assert would
> be better than a segfault, wouldn't it?

Yes, it would, and we are thinking about including it in future releases if we don't get the actual problem fixed.