Ticket 1346 - slurmctl crashes when scontrol is ran on members of job array.
Summary: slurmctl crashes when scontrol is ran on members of job array.
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 14.11.2
Hardware: Linux Linux
: 2 - High Impact
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-01-01 00:08 MST by John Hanks
Modified: 2015-01-02 06:04 MST (History)
2 users (show)

See Also:
Site: KAUST
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.11.3 15.08.0pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description John Hanks 2015-01-01 00:08:08 MST
We are seeing reproducible crashes when modifying job array elements with scontrol update. 

To reproduce:

1. submit a job array.
2. modify it with something like:  
  for i in {1..50}; do 
    scontrol update jobid=3095958_$i reservation=kapferc_9
  done

Backtrace from core dump is:

Program terminated with signal 6, Aborted.
#0  0x0000003558832635 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6_5.4.x86_64 libgcc-4.4.7-4.el6.x86_64 munge-libs-0.5.10-1.el6.x86_64 sssd-client-1.11.6-30.el6.x86_64
(gdb) bt
#0  0x0000003558832635 in raise () from /lib64/libc.so.6
#1  0x0000003558833e15 in abort () from /lib64/libc.so.6
#2  0x0000003558870547 in __libc_message () from /lib64/libc.so.6
#3  0x0000003558875e76 in malloc_printerr () from /lib64/libc.so.6
#4  0x00000035588789b3 in _int_free () from /lib64/libc.so.6
#5  0x00000000004a622d in slurm_xfree (item=0x2b867e99d470, file=<value optimized out>, line=<value optimized out>, 
    func=<value optimized out>) at xmalloc.c:238
#6  0x00000000004692fa in select_nodes (job_ptr=<value optimized out>, test_only=false, select_node_bitmap=<value optimized out>, 
    err_msg=0x0) at node_scheduler.c:1814
#7  0x000000000045ba4b in schedule (job_limit=100) at job_scheduler.c:1274
#8  0x0000000000432558 in _slurmctld_background (no_data=<value optimized out>) at controller.c:1635
#9  0x000000000043507f in main (argc=<value optimized out>, argv=<value optimized out>) at controller.c:561
Comment 1 Brian Christiansen 2015-01-02 06:04:38 MST
This is fixed in the following commit:
https://github.com/SchedMD/slurm/commit/db98d6242db48b44cfe54d05e54ae01dcf596796

Please re-open if it doesn't solve it for you.

Thanks,
Brian