Ticket 9606

Summary: slurmctld segfault purging jobs
Product: Slurm Reporter: Matt Ezell <ezellma>
Component: slurmctldAssignee: Brian Christiansen <brian>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 19.05.5   
Hardware: Linux   
OS: Linux   
Site: NOAA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: ORNL OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.02.4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Matt Ezell 2020-08-18 07:22:11 MDT
We are running 19.05 with https://bugs.schedmd.com/attachment.cgi?id=13854 from https://bugs.schedmd.com/show_bug.cgi?id=8584. We did not yet pull in the commit that Brian recommended, but I'm not sure if lack of that would cause this problem.

We have seen 2 segfaults with backtraces like:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000440091 in _remove_job_hash (job_entry=job_entry@entry=0x7f6a40000c00, type=type@entry=JOB_HASH_JOB) at job_mgr.c:2940
2940            while ((job_pptr != NULL) && (*job_pptr != NULL) &&
[Current thread is 1 (Thread 0x7f6a7cd54b80 (LWP 6670))]
Missing separate debuginfos, use: zypper install slurm-slurmctld-debuginfo-19.05.5-1.x86_64
(gdb) bt
#0  0x0000000000440091 in _remove_job_hash (job_entry=job_entry@entry=0x7f6a40000c00, type=type@entry=JOB_HASH_JOB) at job_mgr.c:2940
#1  0x000000000044216d in _delete_job_common (job_ptr=0x7f6a40000c00) at job_mgr.c:9343
#2  0x0000000000456371 in _list_delete_job (job_entry=<optimized out>) at job_mgr.c:9366
#3  0x00007f6a7c7fe1c7 in list_delete_all (l=0x6f82c0, f=f@entry=0x446d1a <_list_find_job_old>, key=key@entry=0x4c6acd) at list.c:420
#4  0x000000000044f1eb in purge_old_job () at job_mgr.c:10935
#5  0x0000000000431aa5 in _slurmctld_background (no_data=0x0) at controller.c:2162
#6  main (argc=<optimized out>, argv=<optimized out>) at controller.c:762
(gdb) p job_pptr
$1 = (struct job_record **) 0x3757724871417987

In the other backtrace;
(gdb) p job_pptr
$1 = (struct job_record **) 0x2400000174

Those pointers are invalid, as they cannot be de-referenced.
Comment 1 Brian Christiansen 2020-08-18 09:52:10 MDT
Hey Matt,

Ya, that's the same backtrace that is related to Bug 9383 and the patch in Bug 8584 Comment 49. I can open the bug up if you would like.

Thanks,
Brian
Comment 2 Matt Ezell 2020-08-18 09:56:22 MDT
(In reply to Brian Christiansen from comment #1)
> Hey Matt,
> 
> Ya, that's the same backtrace that is related to Bug 9383 and the patch in
> Bug 8584 Comment 49. I can open the bug up if you would like.
> 
> Thanks,
> Brian

Thanks. I'll build a version with that patch as well and report back in a couple days.
Comment 3 Matt Ezell 2020-08-27 06:51:31 MDT
No issues since we moved to the patched version. Thanks!

(note, it won't let me mark this as a duplicate of 9383 since I don't have access to it, so just marking Resolved/Fixed)