Ticket 1346

Summary: slurmctl crashes when scontrol is ran on members of job array.
Product: Slurm Reporter: John Hanks <john.hanks>
Component: slurmctldAssignee: Brian Christiansen <brian>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: brian, da
Version: 14.11.2   
Hardware: Linux   
OS: Linux   
Site: KAUST Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 14.11.3 15.08.0pre2
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description John Hanks 2015-01-01 00:08:08 MST
We are seeing reproducible crashes when modifying job array elements with scontrol update. 

To reproduce:

1. submit a job array.
2. modify it with something like:  
  for i in {1..50}; do 
    scontrol update jobid=3095958_$i reservation=kapferc_9
  done

Backtrace from core dump is:

Program terminated with signal 6, Aborted.
#0  0x0000003558832635 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6_5.4.x86_64 libgcc-4.4.7-4.el6.x86_64 munge-libs-0.5.10-1.el6.x86_64 sssd-client-1.11.6-30.el6.x86_64
(gdb) bt
#0  0x0000003558832635 in raise () from /lib64/libc.so.6
#1  0x0000003558833e15 in abort () from /lib64/libc.so.6
#2  0x0000003558870547 in __libc_message () from /lib64/libc.so.6
#3  0x0000003558875e76 in malloc_printerr () from /lib64/libc.so.6
#4  0x00000035588789b3 in _int_free () from /lib64/libc.so.6
#5  0x00000000004a622d in slurm_xfree (item=0x2b867e99d470, file=<value optimized out>, line=<value optimized out>, 
    func=<value optimized out>) at xmalloc.c:238
#6  0x00000000004692fa in select_nodes (job_ptr=<value optimized out>, test_only=false, select_node_bitmap=<value optimized out>, 
    err_msg=0x0) at node_scheduler.c:1814
#7  0x000000000045ba4b in schedule (job_limit=100) at job_scheduler.c:1274
#8  0x0000000000432558 in _slurmctld_background (no_data=<value optimized out>) at controller.c:1635
#9  0x000000000043507f in main (argc=<value optimized out>, argv=<value optimized out>) at controller.c:561
Comment 1 Brian Christiansen 2015-01-02 06:04:38 MST
This is fixed in the following commit:
https://github.com/SchedMD/slurm/commit/db98d6242db48b44cfe54d05e54ae01dcf596796

Please re-open if it doesn't solve it for you.

Thanks,
Brian