Ticket 476

Summary: Slurmctld Crashes
Product: Slurm Reporter: Jason Sollom <jasons>
Component: Cray ALPSAssignee: Danny Auble <da>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: da
Version: 14.03.x   
Hardware: Linux   
OS: Linux   
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Jason Sollom 2013-10-22 05:39:07 MDT
On Galaxy, slurmctld crashed when I was using scontrol to mark nid00228 in the 'down' state.  I was marking it down as it was experiencing the socket timeouts of BZ 804155 and/or 804156. 

> (gdb) bt full
> #0  0x00007fdb7f7c6b35 in raise () from /lib64/libc.so.6
> No symbol table info available.
> #1  0x00007fdb7f7c8111 in abort () from /lib64/libc.so.6
> No symbol table info available.
> #2  0x00007fdb7f7bf9f0 in __assert_fail () from /lib64/libc.so.6
> No symbol table info available.
> #3  0x00000000004de422 in bit_ffs (b=0x0) at bitstring.c:408
>         bit = 0
>         value = -1
>         __PRETTY_FUNCTION__ = "bit_ffs"
> #4  0x00007fdb7b3246df in _build_row_bitmaps (p_ptr=0x89ef28, job_ptr=0x9df8a8) at select_cons_res.c:646
>         i = 0
>         j = 0
>         num_jobs = 3
>         size = 2092
>         x = 0
>         jstart = 0x926a98
>         this_row = 0x7fdb6b6f5dc0
>         orig_row = 0xa38208
>         tmpjobs = 0xa70e38
>         job = 0x4d76fa <debug3+137>
> #5  0x00007fdb7b32663c in _rm_job_from_res (part_record_ptr=0x89ef28, node_usage=0x89ae18, job_ptr=0x9df8a8,
>     action=0) at select_cons_res.c:1255
>         p_ptr = 0x89ef28
>         job = 0x7fdb64033a98
>         node_ptr = 0x8a1d38
>         first_bit = 32
>         last_bit = 32
>         i = 5
>         n = 1
>         gres_list = 0x0
> #6  0x00007fdb7b3289d4 in select_p_job_fini (job_ptr=0x9df8a8) at select_cons_res.c:2136
> No locals.
> #7  0x00007fdb7ebcc25c in other_job_fini (job_ptr=0x9df8a8) at other_select.c:362
> No locals.
> #8  0x00007fdb7ebca6c5 in _job_fini (args=0x9df8a8) at select_cray.c:662
>         jobinfo = 0x0
> ---Type <return> to continue, or q <return> to quit---
>         job_ptr = 0x9df8a8
>         job_id = 10146
>         node_list = 0x0
>         job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
>         job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK}
> #9  0x00007fdb7fb127b6 in start_thread () from /lib64/libpthread.so.0
> No symbol table info available.
> #10 0x00007fdb7f86dc5d in clone () from /lib64/libc.so.6
> No symbol table info available.
> #11 0x0000000000000000 in ?? ()
> No symbol table info available.

> (gdb) print *job_ptr
> $1 = {account = 0x0, alias_list = 0x0, alloc_node = 0x8c7288 "galaxy", alloc_resp_port = 0,
>   alloc_sid = 11780, array_job_id = 0, array_task_id = 65534, assoc_id = 0, assoc_ptr = 0x0, batch_flag = 1,
>   batch_host = 0x9d6948 "nid00228", check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0,
>   cpu_cnt = 8, cr_enabled = 1, db_index = 0, derived_ec = 0, details = 0xa45738, direct_set_prio = 0,
>   end_time = 1382459167, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0,
>   gres_alloc = 0x971118 "", gres_req = 0x9f3ee8 "", gres_used = 0x0, group_id = 1013, job_id = 10146,
>   job_next = 0x0, job_resrcs = 0x7fdb64033a98, job_state = 1, kill_on_node_fail = 1, licenses = 0x0,
>   license_list = 0x0, limit_set_max_cpus = 0, limit_set_max_nodes = 0, limit_set_min_cpus = 0,
>   limit_set_min_nodes = 0, limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0,
>   mail_type = 0, mail_user = 0x0, magic = 4038539564, name = 0xaf2368 "ior", network = 0x0,
>   next_step_id = 1, nodes = 0x7fdb64009608 "nid00228", node_addr = 0x9daf58, node_bitmap = 0x7fdb64011670,
>   node_bitmap_cg = 0x99f4a0, node_cnt = 1, nodes_completing = 0x0, other_port = 0,
>   partition = 0xbd3cb8 "workq", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x898228,
>   pre_sus_time = 0, preempt_time = 0, priority = 4294894013, priority_array = 0x0, prio_factors = 0x930f08,
>   profile = 0, qos_id = 0, qos_ptr = 0x0, restart_cnt = 2, resize_time = 0, resv_id = 0, resv_name = 0x0,
>   resv_ptr = 0x0, resv_flags = 0, requid = 4294967295, resp_host = 0x0, select_jobinfo = 0x9224b8,
>   spank_job_env = 0x0, spank_job_env_size = 0, start_time = 1382455567, state_desc = 0x0, state_reason = 0,
>   step_list = 0x9c98d8, suspend_time = 0, time_last_active = 1382455567, time_limit = 60, time_min = 0,
>   tot_sus_time = 0, total_cpus = 8, total_nodes = 1, user_id = 23421, wait_all_nodes = 0, warn_flags = 0,
>   warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true,
>   wait4switch_start = 0}
Comment 1 Jason Sollom 2013-10-22 05:40:00 MDT
Here are some salient bits.  The bit_string is NULL in the end.

(gdb) frame 3
#3  0x00000000004de422 in bit_ffs (b=0x0) at bitstring.c:408

(gdb) frame 4
#4  0x00007fdb7b3246df in _build_row_bitmaps (p_ptr=0x89ef28, job_ptr=0x9df8a8) at select_cons_res.c:646

(gdb) p x
$1 = 0
(gdb) p tmpjobs[x]
$2 = (struct job_resources *) 0xaf23f8

(gdb) p tmpjobs[x]->node_bitmap
$3 = (bitstr_t *) 0x0

(gdb) p p_ptr->row[i].job_list[j]
$4 = (struct job_resources *) 0x0    << The job_resources struct is NULL>

(gdb) p i
$5 = 0

(gdb) p j
$6 = 0

(gdb) p *p_ptr
$7 = {next = 0x0, num_rows = 4, part_ptr = 0x898228, row = 0x8c6e48}

Another interesting thing that I saw was that in frame 5, the job_ptr shows nid00228.

(gdb) frame 5
#5  0x00007fdb7b32663c in _rm_job_from_res (part_record_ptr=0x89ef28, node_usage=0x89ae18, job_ptr=0x9df8a8, action=0) at select_cons_res.c:1255
1255	in select_cons_res.c

(gdb) p *job_ptr
$8 = {account = 0x0, alias_list = 0x0, alloc_node = 0x8c7288 "galaxy", alloc_resp_port = 0, alloc_sid = 11780, array_job_id = 0, array_task_id = 65534, assoc_id = 0, assoc_ptr = 0x0, batch_flag = 1, 
  batch_host = 0x9d6948 "nid00228", check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 8, cr_enabled = 1, db_index = 0, derived_ec = 0, details = 0xa45738, direct_set_prio = 0, 
  end_time = 1382459167, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x971118 "", gres_req = 0x9f3ee8 "", gres_used = 0x0, group_id = 1013, job_id = 10146, job_next = 0x0, 
  job_resrcs = 0x7fdb64033a98, job_state = 1, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set_max_cpus = 0, limit_set_max_nodes = 0, limit_set_min_cpus = 0, limit_set_min_nodes = 0, 
  limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0, mail_type = 0, mail_user = 0x0, magic = 4038539564, name = 0xaf2368 "ior", network = 0x0, next_step_id = 1, 
  nodes = 0x7fdb64009608 "nid00228", node_addr = 0x9daf58, node_bitmap = 0x7fdb64011670, node_bitmap_cg = 0x99f4a0, node_cnt = 1, nodes_completing = 0x0, other_port = 0, partition = 0xbd3cb8 "workq", 
  part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x898228, pre_sus_time = 0, preempt_time = 0, priority = 4294894013, priority_array = 0x0, prio_factors = 0x930f08, profile = 0, qos_id = 0, 
  qos_ptr = 0x0, restart_cnt = 2, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, resv_flags = 0, requid = 4294967295, resp_host = 0x0, select_jobinfo = 0x9224b8, spank_job_env = 0x0, 
  spank_job_env_size = 0, start_time = 1382455567, state_desc = 0x0, state_reason = 0, step_list = 0x9c98d8, suspend_time = 0, time_last_active = 1382455567, time_limit = 60, time_min = 0, tot_sus_time = 0, 
  total_cpus = 8, total_nodes = 1, user_id = 23421, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}


However, in the slurmctld log once the slurmctld was restarted after the crash, it showed that it was trying to clean up this job 10146 on node nid00226 not nid00228.

Snippets from slurmctld.log:

[2013-10-22T11:45:47.482] Resending TERMINATE_JOB request JobId=10146 Nodelist=nid00226
[2013-10-22T11:45:47.484] debug:  sched: Running job scheduler
[2013-10-22T11:46:36.000] debug:  backfill: no jobs to backfill
[2013-10-22T11:46:47.492] Resending TERMINATE_JOB request JobId=10145 Nodelist=nid00230
[2013-10-22T11:46:47.494] Resending TERMINATE_JOB request JobId=10146 Nodelist=nid00226
[2013-10-22T11:46:47.494] debug:  sched: Running job scheduler
[2013-10-22T11:46:50.613] debug:  Spawning ping agent for nid00[304,333-337,339-358,360-363]
[2013-10-22T11:46:50.613] debug:  Spawning registration agent for nid00[196-245] 50 hosts
[2013-10-22T11:46:50.651] debug:  Killing job 10146 on node nid00226
[2013-10-22T11:46:50.651] debug:  Killing job 10146 on node nid00226
[2013-10-22T11:46:50.654] debug:  Killing job 10145 on node nid00230
[2013-10-22T11:46:50.654] debug:  Killing job 10145 on node nid00230
[2013-10-22T11:47:10.465] debug:  aeld sync event
[2013-10-22T11:47:10.466] debug:  _start_session: Created aeld session fd 7
[2013-10-22T11:47:34.000] debug:  backfill: no jobs to backfill
[2013-10-22T11:47:47.621] Resending TERMINATE_JOB request JobId=10145 Nodelist=nid00230
[2013-10-22T11:47:47.624] Resending TERMINATE_JOB request JobId=10146 Nodelist=nid00226

There are 70 of these Resending TERMINATE_JOB requests.

sdb:/var/spool/slurm # grep "Resending TERMINATE_JOB request JobId=10146" slurmctld.log|wc -l
70
Comment 2 Jason Sollom 2013-10-22 05:40:43 MDT
I was able to hit this again.  Ironically both times slurmctld crashed while marking nid00228 down.

Whenever I recieved the following error while running jobs I would mark the suspect node down and continue on.

> srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
> srun: error: Task launch for 10190.0 failed on node nid00228: Socket timed out on send/recv operation
> srun: error: Application launch failed: Socket timed out on send/recv operation
> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete

> galaxy:/home/users/pavek # scontrol update node=nid00228 state=down
> galaxy:/home/users/pavek # sinfo
> slurm_load_partitions: Unable to contact slurm controller (connect failure)

New core is attached.
Comment 3 Jason Sollom 2013-10-22 05:44:22 MDT
Due to limits on file sizes, the slurmctld log, and the two core files can be found at this link.

http://bugzilla.us.cray.com/show_bug.cgi?id=804210
Comment 4 Danny Auble 2013-11-12 08:24:30 MST
Jason, I think this might of been related to the packing of NULL jobinfo in the switch plugin.  If you haven't seen this since you fixed the code I am voting to close this.
Comment 5 Danny Auble 2013-12-05 05:58:43 MST
Pretty sure this is the same bug as 526

*** This ticket has been marked as a duplicate of ticket 526 ***