Ticket 5973

Summary: backup slurmctld abort() in select/cray
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: slurmctldAssignee: Broderick Gardner <broderick>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 18.08.3   
Hardware: Linux   
OS: Linux   
Site: NERSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Doug Jacobsen 2018-11-02 11:17:47 MDT
Hello,

Owing to continuing issues with slurmctld abort()ing owing to the pthread issue being discussed in another bug, our backup is getting some exercise.

The backup is presently configured to be non-scheduling, though I'm thinking of changing that given the relative instability we are experiencing with the primary controller.

The sequence of events were that at 8:53am the primary ctld abort()ed.  


#0  0x00007f460069cf67 in raise () from /lib64/libc.so.6
#1  0x00007f460069e33a in abort () from /lib64/libc.so.6
#2  0x00007f4600695d66 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f4600695e12 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f4600e98e31 in bit_test (b=<optimized out>, bit=<optimized out>) at bitstring.c:229
#5  0x00007f45ff81327d in _set_job_running (job_ptr=0x7f452c489680) at select_cray.c:938
#6  select_p_job_begin (job_ptr=0x7f452c489680) at select_cray.c:1920
#7  0x00007f4600ed6d9f in select_g_job_begin (job_ptr=job_ptr@entry=0x7f452c489680) at node_select.c:542
#8  0x000000000047af79 in select_nodes (job_ptr=<optimized out>, test_only=test_only@entry=false, select_node_bitmap=select_node_bitmap@entry=0x0, err_msg=err_msg@entry=0x7f44a56d59e0, submission=submission@entry=true) at node_scheduler.c:2870
#9  0x00000000004554ea in _select_nodes_parts (err_msg=0x7f44a56d59e0, select_node_bitmap=0x0, test_only=false, job_ptr=0x7f452c489680) at job_mgr.c:4683
#10 job_allocate (job_specs=job_specs@entry=0x7f452c02a9d0, immediate=immediate@entry=0, will_run=will_run@entry=0, resp=resp@entry=0x0, allocate=allocate@entry=1, submit_uid=submit_uid@entry=77419, job_pptr=job_pptr@entry=0x7f44a56d59d8,
    err_msg=err_msg@entry=0x7f44a56d59e0, protocol_version=8448) at job_mgr.c:4905
#11 0x000000000049214a in _slurm_rpc_allocate_resources (msg=msg@entry=0x7f44a56d5ec0) at proc_req.c:1648
#12 0x0000000000493703 in slurmctld_req (msg=msg@entry=0x7f44a56d5ec0, arg=arg@entry=0x7f45cc014cf0) at proc_req.c:328
#13 0x000000000042a541 in _service_connection (arg=0x7f45cc014cf0) at controller.c:1274
#14 0x00007f4600c19724 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f4600754e8d in clone () from /lib64/libc.so.6




(gdb) frame 5
#5  0x00007f45ff81327d in _set_job_running (job_ptr=0x7f452c489680) at select_cray.c:938
938	in select_cray.c
(gdb) print (jobinfo->blade_map[0])
$12 = 1111704645
(gdb) print (jobinfo->blade_map[1])
$13 = 0
(gdb) print nodeinfo->blade_id
$14 = 0
(gdb)



It would seem that the _assert_bit_valid is failing because the length of the map is 0.  I have not taken the time to trace the code, but I imagine this is because the blade_map was not initilized in the case of a backup controller, or a non-scheduling backup controller (haven't looked).

Thanks,
Doug
Comment 1 Doug Jacobsen 2018-11-02 11:20:18 MDT
sorry, it would seem that i failed to finish my timeline:

at 8:53 the primary crashed
at 8:55 the backup tookover
at 9:06 the backup crashed
at 9:16 the primary was restored


note that because of the frequent crashed of the primary, this instance of the backup has taken over several times.
Comment 2 Broderick Gardner 2018-11-19 17:02:46 MST
I am investigating this, thanks for the information. Did this happen only once, or is it happening whenever the backup controller takes over?
Comment 3 Doug Jacobsen 2018-11-19 17:23:44 MST
just the one time;  we usually don't fail over while in operations
Comment 4 Broderick Gardner 2018-11-27 16:03:57 MST
Do you still have this core file? There is a global blade_cnt that might be 0, which means either that select_p_node_init was never called in the backup controller or it got the wrong blade count. So could you get the value of blade_cnt from the core?
Comment 5 Broderick Gardner 2019-05-20 13:39:03 MDT
I've looked into this some more but have not been able to make progress. We need a reproduction to get more information and a core file. 

Timing out.

Thanks