Ticket 79

Summary: Slurm core dump on Sequoia
Product: Slurm Reporter: Don Lipari <lipari1>
Component: Bluegene select pluginAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 1 - System not usable    
Priority: ---    
Version: 2.4.x   
Hardware: IBM BlueGene   
OS: Linux   
Site: LLNL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: sequoia CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: stack trace and last log entries
log from blocks going into error

Description Don Lipari 2012-07-10 03:18:44 MDT
Created attachment 82 [details]
stack trace and last log entries
Comment 1 Danny Auble 2012-07-10 03:42:47 MDT
Send the logs when blocks
RMP10Jl073343488

RMP10Jl073343399
RMP10Jl073343274
RMP10Jl073343210

RMP09Jl224539188

were created and went into error.  Them going into error is probably the interesting portion.

The most interesting create is the second group which should happen all at the same time and then the first group.
Comment 2 Danny Auble 2012-07-10 04:44:33 MDT
Created attachment 83 [details]
log from blocks going into error

The log you sent perviously was good.  I'll see what I can find.  It apears  am attaching it here since it wasn't sent to the bug before.
Comment 3 Don Lipari 2012-07-10 10:44:05 MDT
slurmctld dumped core again:

(gdb) bt full
#0  0x000000808fa4e6fc in .pthread_mutex_lock () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00000000100af304 in list_iterator_create (l=0x91) at list.c:702
        e = <value optimized out>
        i = 0x400a80dac60
        __PRETTY_FUNCTION__ = "list_iterator_create"
#2  0x00000400003f2290 in ba_sub_block_in_record_clear (bg_record=0x401501a4c08, step_ptr=0x401500f3c38) at block_allocator.c:1262
        bit = 148
        itr = 0x0
        ba_mp = 0x0
        jobinfo = 0x401500ef218
        tmp_char = 0x0
        tmp_char2 = 0x0
        tmp_char3 = 0x0
#3  0x00000400003cffcc in select_p_step_finish (step_ptr=0x401500f3c38) at select_bluegene.c:2097
        bg_record = 0x401501a4c08
        jobinfo = <value optimized out>
        rc = 0
        tmp_char = 0x0
#4  0x00000000100d06fc in select_g_step_finish (step_ptr=0x401500f3c38) at node_select.c:798
No locals.
#5  0x000000001008dfc0 in job_step_complete (job_id=<value optimized out>, step_id=<value optimized out>, uid=<value optimized out>, requeue=<value optimized out>, job_return_code=<value optimized out>) at step_mgr.c:526
        job_ptr = 0x400f000fef8
        step_ptr = <value optimized out>
        error_code = <value optimized out>
#6  0x0000000010073a50 in _slurm_rpc_step_complete (msg=0x400f0132198) at proc_req.c:2413
        error_code = 0
        rc = <value optimized out>
        rem = 0
        step_rc = 0
        tv1 = {tv_sec = 1341959305, tv_usec = 855611}
        tv2 = {tv_sec = 4402073113368, tv_usec = 4398885495200}
        tv_str = "\360\024\221\340\000\000\000\000\020#(8\000\000\004\000\062\001\342@"
        req = 0x400f00c53a8
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
        uid = 8830
        dump_job = false
#7  0x0000000010076658 in slurmctld_req (msg=0x400f0132198) at proc_req.c:369
No locals.
#8  0x000000001002e518 in _service_connection (arg=0x40110002c48) at controller.c:1017
        conn = 0x40110002c48
        msg = 0x400f0132198
#9  0x000000808fa4c2bc in .start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#10 0x000000808f94866c in .__clone () from /lib64/libc.so.6
No symbol table info available.

[2012-07-10T15:28:13] add_bg_record: asking for [3000x3033] 0 0 0 0 0 T,T,T,T
[2012-07-10T15:28:13] debug:  513 can start unassigned job 1911 at 4294967295 on seq[3000x3033]
[2012-07-10T15:28:13] sched: _slurm_rpc_allocate_resources JobId=1911 NodeList=(null) usec=44008
[2012-07-10T15:28:20] Nodeboard 'N00' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N01' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N02' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N03' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N04' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N05' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N06' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N07' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N08' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N09' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N10' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N11' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N12' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N13' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N14' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N15' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:25] debug:  Processing RPC: REQUEST_STEP_COMPLETE for 1904.0 nodes 0-0 rc=0 uid=8830
Comment 4 Danny Auble 2012-07-10 11:55:12 MDT
Please save this core.  If this is related to the other core then there is memory corruption somewhere.  Do you know what is happening on the system?
Comment 5 Danny Auble 2012-07-11 04:50:52 MDT
Could you put the core from this last one and slurmctld in my home dir?  I can't seem to log onto seqsn.  The cores don't appear to be related at this point.  I have a handle on both but I want to verify some information.

Mostly in ba_sub_block_in_record_clear print *bg_record
Comment 6 Danny Auble 2012-07-11 04:54:26 MDT
The log from 2012-07-10T15:19:02 2012-07-10T15:32:51

would be helpful as well.
Comment 7 Danny Auble 2012-07-11 10:37:26 MDT
Both of these issues are fixed in this bug.

The first one is fixed here...
https://github.com/SchedMD/slurm/commit/0c371d3617d0e2ac44a1d98661b4c480f548c6ac

The second one is actually related to bug 82 I believe.  But this will fix the issue if it ever shows up again.
https://github.com/SchedMD/slurm/commit/4731a11b6ba665d8910743a374bda9ddd7c1910e