Ticket 526

Summary: Slurmctld Crashes: Cannot Access memory using select plugin in
Product: Slurm Reporter: Jason Sollom <jasons>
Component: slurmctldAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: da, david.gloe
Version: 14.03.x   
Hardware: Linux   
OS: Linux   
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 14.03.0pre6 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: partial slurmctld log

Description Jason Sollom 2013-11-18 02:48:29 MST
Slurmctld crashed twice over the weekend on galaxy.

The Cray slurm code was last synced with the mod 6b210cc8cd802dd37dd21e8003a6276706ee9988

Here's the full backtrace.
(gdb) bt full
#0  0x00007fbffe940b35 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fbffe942111 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007fbffe9399f0 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#3  0x00000000004df90e in bit_ffs (b=0x0) at bitstring.c:408
        bit = 0
        value = -1
        __PRETTY_FUNCTION__ = "bit_ffs"
#4  0x00007fbff99bf680 in _build_row_bitmaps (p_ptr=0x8c0758, job_ptr=0x925d78) at select_cons_res.c:640
        i = 0
        j = 2
        num_jobs = 36
        size = 2092
        x = 2
        jstart = 0x99cf38
        this_row = 0x7fbff31f0dc0
        orig_row = 0x9a4a08
        tmpjobs = 0x8c04d8
        job = 0x4d8be6 <debug3+137>
#5  0x00007fbff99c15dd in _rm_job_from_res (part_record_ptr=0x8c0758, node_usage=0x8ca838, job_ptr=0x925d78,
    action=0) at select_cons_res.c:1249
        p_ptr = 0x8c0758
        job = 0x7fbfe4020e08
        node_ptr = 0x89d518
        first_bit = 52
        last_bit = 52
        i = 5
        n = 1
        gres_list = 0x0
#6  0x00007fbff99c3975 in select_p_job_fini (job_ptr=0x925d78) at select_cons_res.c:2130
No locals.
#7  0x00007fbffd0f12d7 in other_job_fini (job_ptr=0x925d78) at other_select.c:372
No locals.
#8  0x00007fbffd0ef77d in _job_fini (args=0x925d78) at select_cray.c:682
        jobinfo = 0x0
        job_ptr = 0x925d78
        job_id = 118976
        node_list = 0x0
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
        job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK}
#9  0x00007fbffec8c7b6 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#10 0x00007fbffe9e7c5d in clone () from /lib64/libc.so.6
No symbol table info available.
#11 0x0000000000000000 in ?? ()
No symbol table info available.

I have copied the slurmctld binary as well as the *.so's into /home/users/jasons/slurmctld_crash_bug/.

Looking more indepth at this in gdb, we start out with

***Frame 0***
(gdb) frame 0
#0  0x00000000004df912 in bit_ffs (b=0x43614e496a454972) at bitstring.c:408
(gdb) p *b
Cannot access memory at address 0x43614e496a454972

***Frame 1***
(gdb) frame 1
#1  0x00007f76648d5680 in _build_row_bitmaps (p_ptr=0x7f7668024be8, job_ptr=0xa47918) at select_cons_res.c:640

This is the code:
jstart[x] = bit_ffs(tmpjobs[x]->node_bitmap);

(gdb) p tmpjobs[x]->node_bitmap
$26 = (bitstr_t *) 0x43614e496a454972       <<< Same inaccessible address as above

Which comes from line 638:
tmpjobs[x] = p_ptr->row[i].job_list[j];

(gdb) p p_ptr->row[i].job_list[j]
$27 = (struct job_resources *) 0x0  <<< NULL, not a good sign

***Frame 2***
(gdb) frame 2
#2  0x00007f76648d75dd in _rm_job_from_res (part_record_ptr=0x7f7668024be8, node_usage=0x7f7668026a38, job_ptr=0xa47918, action=0) at select_cons_res.c:1249

Which is this line in select_cons_res.c
/* job was found and removed, so refresh the bitmaps */
_build_row_bitmaps(p_ptr, job_ptr);

Just to reinterate the frame 1 findings, below we see this is NULL, which I believe is incorrect:

(gdb) p p_ptr->row[0].job_list[0]
$21 = (struct job_resources *) 0x0
Comment 1 Jason Sollom 2013-11-18 02:50:24 MST
The slurmctld core file and slurmctld log are also in the directory galaxy:/home/users/jasons/slurmctld_crash_bug/.
Comment 2 Danny Auble 2013-11-18 08:38:25 MST
I would expect p_ptr->row[i].job_list[j] to be NULL here on the line above the seg fault it gets explicitly set to NULL. But memory corruption does appear to be part of this issue though, making this a much tricker issue.  This doesn't appear related to previous memory corruption issues.

According to the core file it looks like there was 5 jobs in the system at this time.

gdb) print *tmpjobs[x]
$2 = {core_bitmap = 0x414141445177413a, core_bitmap_used = 0x4b383757584a354a, cpu_array_cnt = 1869898049, cpu_array_value = 0x6c71593436545853, cpu_array_reps = 0x496b497a666e614b, cpus = 0x5573516b39587a4e, cpus_used = 0x2f2b654549696e35, cores_per_socket = 0x44784965505a6732, memory_allocated = 0x6d66344f62776a63, memory_used = 0x445933477573316e, nhosts = 1701458008, node_bitmap = 0x43614e496a454972, node_req = 1312966765, nodes = 0x7849795369575663 <Address 0x7849795369575663 out of bounds>, ncpus = 1702384208, sock_core_rep_count = 0x4ccd0100003a3d, sockets_per_node = 0x3}

This is oviously bad memory.

(gdb) print x
$3 = 0
(gdb) print i
$4 = 0
(gdb) print j
$5 = 0
(gdb) print p_ptr->row[i].num_jobs
$7 = 5

The rest of the list looks ok though

(gdb) print *p_ptr->row[i].job_list[j+1]
$10 = {core_bitmap = 0x9eaa70, core_bitmap_used = 0x8b0440, cpu_array_cnt = 9, cpu_array_value = 0x88f408, cpu_array_reps = 0x8aaa08, cpus = 0x8a9b68, cpus_used = 0x9a7748, cores_per_socket = 0x9019f8, memory_allocated = 0xa4c078, memory_used = 0x953e38, nhosts = 32, node_bitmap = 0x93e3d0, node_req = 1, nodes = 0x9f7b78 "nid00[215-237,352-358,360-361]", ncpus = 524, sock_core_rep_count = 0x91e688, sockets_per_node = 0x8bca58}

(gdb) print *p_ptr->row[i].job_list[j+2]
$11 = {core_bitmap = 0x93e320, core_bitmap_used = 0x8cc7d0, cpu_array_cnt = 1, cpu_array_value = 0x86f628, cpu_array_reps = 0x8cc698, cpus = 0x8cc658, cpus_used = 0x9a2e48, cores_per_socket = 0x93e2e8, memory_allocated = 0x8c5348, memory_used = 0xa48cd8, nhosts = 19, node_bitmap = 0xa48ca0, node_req = 1, nodes = 0x89ee48 "nid00[217-235]", ncpus = 152, sock_core_rep_count = 0x8cc778, sockets_per_node = 0x8d1078}

(gdb) print *p_ptr->row[i].job_list[j+3]
$12 = {core_bitmap = 0x7f7650017580, core_bitmap_used = 0x7f7650014e10, cpu_array_cnt = 1, cpu_array_value = 0x7f7650017ea8, cpu_array_reps = 0x7f7650014e58, cpus = 0x7f7650014e38, cpus_used = 0x7f76500a9808, cores_per_socket = 0x7f7650017548, memory_allocated = 0x7f7650015108, memory_used = 0x7f76500a9828, nhosts = 1, node_bitmap = 0x7f76500c7780, node_req = 1, nodes = 0x7f7650069f58 "nid00236", ncpus = 23, sock_core_rep_count = 0x7f7650017568, sockets_per_node = 0x7f76500a9848}

(gdb) print *p_ptr->row[i].job_list[j+4]
$13 = {core_bitmap = 0x7f76500adb70, core_bitmap_used = 0x7f76500adb90, cpu_array_cnt = 1, cpu_array_value = 0x7f76500c70e8, cpu_array_reps = 0x7f7650017998, cpus = 0x7f7650017978, cpus_used = 0x7f765000c818, cores_per_socket = 0x7f7650013ec8, memory_allocated = 0x7f76500adc58, memory_used = 0x7f7650014dd8, nhosts = 1, node_bitmap = 0x7f7650058690, node_req = 1, nodes = 0x7f76500563a8 "nid00330", ncpus = 9, sock_core_rep_count = 0x7f7650018c18, sockets_per_node = 0x7f76500189c8}
(gdb) print *p_ptr->row[i].job_list[j+5]
Cannot access memory at address 0x0

This job was job 117991.
$14 = {account = 0x0, alias_list = 0x0, alloc_node = 0x8f3a08 "galaxy", alloc_resp_port = 0, alloc_sid = 17379, array_job_id = 0, array_task_id = 4294967294, assoc_id = 0, assoc_ptr = 0x0, batch_flag = 1, batch_host = 0x8b1298 "nid00346", check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 222, cr_enabled = 1, db_index = 2361, derived_ec = 0, details = 0x96b088, direct_set_prio = 0, end_time = 1384631159, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x97e848 "", gres_req = 0x8c5ce8 "", gres_used = 0x0, group_id = 11121, job_id = 117991, job_next = 0x0, job_resrcs = 0xa11a78, job_state = 32772, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set_max_cpus = 0, limit_set_max_nodes = 0, limit_set_min_cpus = 0, limit_set_min_nodes = 0, limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0, mail_type = 0, mail_user = 0x0, magic = 4038539564, name = 0x940038 "IMB-crayxt.msg22", network = 0x0, next_step_id = 1, nodes = 0x8cf0b8 "nid00[346-348,351-355,357-358,360-361]", node_addr = 0x92f3f8, node_bitmap = 0xa11ba0, node_bitmap_cg = 0x8a92c0, node_cnt = 12, nodes_completing = 0x0, other_port = 0, partition = 0x8bff98 "workq", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x7f7668023188, pre_sus_time = 0, preempt_time = 0, priority = 4294899834, priority_array = 0x0, prio_factors = 0x96b1e8, profile = 0, qos_id = 1, qos_ptr = 0x864758, restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, resv_flags = 0, requid = 1356, resp_host = 0x0, select_jobinfo = 0x8cc438, spank_job_env = 0x0, spank_job_env_size = 0, start_time = 1384630680, state_desc = 0x0, state_reason = 0, step_list = 0x7f764c20ffd8, suspend_time = 0, time_last_active = 1384631159, time_limit = 60, time_min = 0, tot_sus_time = 0, total_cpus = 222, total_nodes = 12, user_id = 1356, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}

print *job_ptr->job_resrcs 
$15 = {core_bitmap = 0x8b0ef0, core_bitmap_used = 0x92f690, cpu_array_cnt = 5, cpu_array_value = 0x92f758, cpu_array_reps = 0x92f6f8, cpus = 0x92f6c8, cpus_used = 0x92f668, cores_per_socket = 0xa11b18, memory_allocated = 0x8fbb38, memory_used = 0x8b1818, nhosts = 12, node_bitmap = 0xa11a40, node_req = 1, nodes = 0x8cd0a8 "nid00[346-348,351-355,357-358,360-361]", ncpus = 222, sock_core_rep_count = 0x8b0eb8, sockets_per_node = 0x92f638}
Comment 3 Danny Auble 2013-11-18 08:44:51 MST
Created attachment 516 [details]
partial slurmctld log

This is the last 100 lines of the slurmctld log file.

It appears job 117991 was canceled by scancel.  Jason do you know if that was the case or not?

[2013-11-16T13:45:59.311] Calling NHC for jobid 117991 and apid 0 on nodes nid00[346-348,351-355,357-358,360-361](346-348,351-355,357-358,360-361)
[2013-11-16T13:46:00.791] completing job 117991 status 0
[2013-11-16T13:46:00.791] _slurm_rpc_complete_batch_script JobId=117991: Job/step already completing or completed 
[2013-11-16T13:46:00.885] _run_nhc jobid 117991 and apid 0 completed took: usec=1575365

This makes it appear the NHC for the reservation started before the rpc for the batch script complete made it in.  That seems backwards and is probably the reason for the memory corruption (at least this is my guess).

There must be a path where this is plausible.  I'll see if I can reproduce.
Comment 4 Jason Sollom 2013-11-19 01:39:28 MST
(In reply to Danny Auble from comment #2)
> I would expect p_ptr->row[i].job_list[j] to be NULL here on the line above
> the seg fault it gets explicitly set to NULL. But memory corruption does
> appear to be part of this issue though, making this a much tricker issue. 
> This doesn't appear related to previous memory corruption issues.
> 

This is an easy conclusion to jump to; however, I think it is incorrect.

(gdb) info frame
Stack level 2, frame at 0x7f766369ce40:
 rip = 0x7f76648d75dd in _rm_job_from_res (select_cons_res.c:1249); saved rip 0x7f76648d9975
 called by frame at 0x7f766369ce60, caller of frame at 0x7f766369cdc0
 source language c.
 Arglist at 0x7f766369ce30, args: part_record_ptr=0x7f7668024be8, node_usage=0x7f7668026a38, job_ptr=0xa47918, action=0
 Locals at 0x7f766369ce30, Previous frame's sp is 0x7f766369ce40
 Saved registers:
  rbx at 0x7f766369ce28, rbp at 0x7f766369ce30, rip at 0x7f766369ce38
(gdb) p n
$22 = 1
(gdb) p i
$23 = 5

Looking at i=5 means that, it was row[5].job_list[j] that set to NULL, not row[0], which is the row in question.
Comment 5 Danny Auble 2013-11-19 02:27:50 MST
Interesting, in my findings this is what I see...

(gdb) print x
$3 = 0
(gdb) print i
$4 = 0
(gdb) print j
$5 = 0
(gdb) print p_ptr->row[i].num_jobs
$7 = 5

You appear to be in the wrong frame.  Frame 1 is where the print should be done, not frame 2.  I just ran the gdb again to confirm.

While I believe there is memory corruption, I believe p_ptr->row[i].job_list[j] should be NULL.  I wouldn't expect it to be anything other.
Comment 6 Jason Sollom 2013-11-22 05:40:54 MST
This happened again on galaxy.  The backtrace looks the same although the line number of  _job_fini (args=0xb75508) at select_cray.c:694 rather than
0x00007fbffd0ef77d in _job_fini (args=0x925d78) at select_cray.c:682


***bt full***

#0  0x00007ff9a5c7db35 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007ff9a5c7f111 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007ff9a5c769f0 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#3  0x00000000004df8ce in bit_ffs (b=0x0) at bitstring.c:408
        bit = 0
        value = -1
        __PRETTY_FUNCTION__ = "bit_ffs"
#4  0x00007ff99cde3680 in _build_row_bitmaps (p_ptr=0x7ff9a0024fe8, job_ptr=0xb75508) at select_cons_res.c:640
        i = 0
        j = 6
        num_jobs = 19
        size = 2092
        x = 6
        jstart = 0xa10648
        this_row = 0x7ff999282db0
        orig_row = 0x8719f8
        tmpjobs = 0xa71488
        job = 0x4d8ba6 <debug3+137>
#5  0x00007ff99cde55dd in _rm_job_from_res (part_record_ptr=0x7ff9a0024fe8, node_usage=0x7ff9a0026dc8, job_ptr=0xb75508, action=0) at select_cons_res.c:1249
        p_ptr = 0x7ff9a0024fe8
        job = 0xb5f708
        node_ptr = 0x7ff9a0031de8
        first_bit = 105
        last_bit = 105
        i = 5
        n = 1
        gres_list = 0x0
#6  0x00007ff99cde7975 in select_p_job_fini (job_ptr=0xb75508) at select_cons_res.c:2130
No locals.
#7  0x00007ff9a45314bb in other_job_fini (job_ptr=0xb75508) at other_select.c:372
No locals.
#8  0x00007ff9a452f95f in _job_fini (args=0xb75508) at select_cray.c:694
        jobinfo = 0x0
        job_ptr = 0xb75508
        nhc_info = {apid = 0, exit_code = 1, jobid = 122940, nodelist = 0x0, step = false}
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
        job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK}
#9  0x00007ff9a5fc97b6 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#10 0x00007ff9a5d24c5d in clone () from /lib64/libc.so.6
No symbol table info available.
#11 0x0000000000000000 in ?? ()
No symbol table info available.

***Further analysis***

**Frame 3**
(gdb) frame 3
#3  0x00000000004df8ce in bit_ffs (b=0x0) at bitstring.c:408
408	bitstring.c: No such file or directory.
(gdb) p *b
Cannot access memory at address 0x0

**Frame 4**
#4  0x00007ff99cde3680 in _build_row_bitmaps (p_ptr=0x7ff9a0024fe8, job_ptr=0xb75508) at select_cons_res.c:640
640	select_cons_res.c: No such file or directory.

(gdb) p *tmpjobs[x]
$16 = {core_bitmap = 0x7ff9a5fbdfb8 <main_arena+312>, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, cores_per_socket = 0x0, 
  memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 1, nodes = 0x0, ncpus = 38, sock_core_rep_count = 0x0, sockets_per_node = 0x0}

Notice node_bitmap is NULL.

It's hard to trace this further back any further because the job_list gets nullified here.

p_ptr->row[i].job_list[j] = NULL;
Comment 7 Jason Sollom 2013-11-26 05:11:24 MST
This occurred after we synced up with SchedMD.  The latest SchedMD mod was
694e724c3308febfb0bdad970c0d1d19886d128f .

This may help narrow down the window in which to look for any errors.
Comment 8 David Gloe 2013-12-03 04:34:08 MST
Happened again today on galaxy. The last SchedMD commit was 529f4324 and it has the uid/gid patch applied.

(gdb) bt full
#0  0x00000000004dfd46 in bit_ffs (b=0x280) at bitstring.c:408
        bit = 0
        value = -1
        __PRETTY_FUNCTION__ = "bit_ffs"
#1  0x00007f1643398654 in _build_row_bitmaps (p_ptr=0x7f1644000c98, job_ptr=0x8af6d8) at select_cons_res.c:645
        i = 0
        j = 0
        num_jobs = 1
        size = 2092
        x = 0
        this_row = 0x7f1640848cf0
        orig_row = 0x7f1638000b58
        ss = 0x7f1638000d18
#2  0x00007f164339a3dd in _rm_job_from_res (part_record_ptr=0x7f1644000c98, node_usage=0x7f16440027a8, job_ptr=0x8af6d8, action=0) at select_cons_res.c:1241
        p_ptr = 0x7f1644000c98
        job = 0x8a97d8
        node_ptr = 0x7f1644168998
        first_bit = 24
        last_bit = 31
        i = 3
        n = 1
        gres_list = 0x0
#3  0x00007f164339c7e8 in select_p_job_fini (job_ptr=0x8af6d8) at select_cons_res.c:2138
No locals.
#4  0x00007f164aadb4bb in other_job_fini (job_ptr=0x8af6d8) at other_select.c:372
No locals.
#5  0x00007f164aad995f in _job_fini (args=0x8af6d8) at select_cray.c:705
        jobinfo = 0x0
        job_ptr = 0x8af6d8
        nhc_info = {apid = 0, exit_code = 1, jobid = 149578, nodelist = 0x0, step = false}
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
        job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK}
#6  0x00007f164c6767b6 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#7  0x00007f164c3d1c5d in clone () from /lib64/libc.so.6
No symbol table info available.
#8  0x0000000000000000 in ?? ()
No symbol table info available.
Comment 9 Danny Auble 2013-12-05 05:58:43 MST
*** Ticket 476 has been marked as a duplicate of this ticket. ***
Comment 10 Danny Auble 2013-12-05 06:00:15 MST
I found and fixed an issue that may be related to this.

Patches...

c4511bdfb3f2caab2875cdc101b550d2509b5ce1
ad68eac67fb38e86da442ef36395ca4620bb59ee

Let me know if you see the issue after these patches.
Comment 11 Moe Jette 2013-12-19 08:17:17 MST
I'm closing this based upon Danny's patches. Please re-open as needed.
Comment 12 David Gloe 2014-02-05 04:46:30 MST
We saw this today on cst82 running slurm-14-03-0-0pre5
The core file is in /var/spool/slurm/core on the sdb.

(gdb) bt full
#0  0x00007fc54cc91b35 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fc54cc93111 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007fc54cc8a9f0 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#3  0x00000000004e17f0 in bit_ffs (b=0x7fc53001af40) at bitstring.c:451
        bit = 0
        value = -1
        __PRETTY_FUNCTION__ = "bit_ffs"
#4  0x00007fc548112654 in _build_row_bitmaps (p_ptr=0xa371b8, job_ptr=0xc9f698) at select_cons_res.c:645
        i = 0
        j = 7
        num_jobs = 58
        size = 82224
        x = 7
        this_row = 0x7fc545fcdcf0
        orig_row = 0x7fc53420f2d8
        ss = 0x7fc534036778
#5  0x00007fc5481143dd in _rm_job_from_res (part_record_ptr=0xa371b8, node_usage=0xa69048, job_ptr=0xc9f698, action=0) at select_cons_res.c:1241
        p_ptr = 0xa371b8
        job = 0x7fc53c001eb8
        node_ptr = 0xbbf3c8
        first_bit = 3371
        last_bit = 3371
        i = 5
        n = 1
        gres_list = 0x0
#6  0x00007fc5481167e8 in select_p_job_fini (job_ptr=0xc9f698) at select_cons_res.c:2138
No locals.
#7  0x00007fc54b846277 in other_job_fini (job_ptr=0xc9f698) at other_select.c:372
No locals.
#8  0x00007fc54b844925 in _job_fini (args=0xc9f698) at select_cray.c:711
        jobinfo = 0x0
        job_ptr = 0xc9f698
        nhc_info = {apid = 0, exit_code = 1, jobid = 591, nodelist = 0x0, step = false, user_id = 1356}
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
        job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK}
#9  0x00007fc54cfdd7b6 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#10 0x00007fc54cd38c5d in clone () from /lib64/libc.so.6
No symbol table info available.
#11 0x0000000000000000 in ?? ()
No symbol table info available.
Comment 13 Danny Auble 2014-02-06 04:27:22 MST
OK, this is fixed in 6d7e592c739ab7e89768e8ba0325587d5fdcee56.  Already checked into the master branch.

Let me know if you see this again, but I am fairly confident this should fix this issue.