| Summary: | Slurmctld Crashes | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jason Sollom <jasons> |
| Component: | Cray ALPS | Assignee: | Danny Auble <da> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | da |
| Version: | 14.03.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CRAY | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Jason Sollom
2013-10-22 05:39:07 MDT
Here are some salient bits. The bit_string is NULL in the end.
(gdb) frame 3
#3 0x00000000004de422 in bit_ffs (b=0x0) at bitstring.c:408
(gdb) frame 4
#4 0x00007fdb7b3246df in _build_row_bitmaps (p_ptr=0x89ef28, job_ptr=0x9df8a8) at select_cons_res.c:646
(gdb) p x
$1 = 0
(gdb) p tmpjobs[x]
$2 = (struct job_resources *) 0xaf23f8
(gdb) p tmpjobs[x]->node_bitmap
$3 = (bitstr_t *) 0x0
(gdb) p p_ptr->row[i].job_list[j]
$4 = (struct job_resources *) 0x0 << The job_resources struct is NULL>
(gdb) p i
$5 = 0
(gdb) p j
$6 = 0
(gdb) p *p_ptr
$7 = {next = 0x0, num_rows = 4, part_ptr = 0x898228, row = 0x8c6e48}
Another interesting thing that I saw was that in frame 5, the job_ptr shows nid00228.
(gdb) frame 5
#5 0x00007fdb7b32663c in _rm_job_from_res (part_record_ptr=0x89ef28, node_usage=0x89ae18, job_ptr=0x9df8a8, action=0) at select_cons_res.c:1255
1255 in select_cons_res.c
(gdb) p *job_ptr
$8 = {account = 0x0, alias_list = 0x0, alloc_node = 0x8c7288 "galaxy", alloc_resp_port = 0, alloc_sid = 11780, array_job_id = 0, array_task_id = 65534, assoc_id = 0, assoc_ptr = 0x0, batch_flag = 1,
batch_host = 0x9d6948 "nid00228", check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 8, cr_enabled = 1, db_index = 0, derived_ec = 0, details = 0xa45738, direct_set_prio = 0,
end_time = 1382459167, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x971118 "", gres_req = 0x9f3ee8 "", gres_used = 0x0, group_id = 1013, job_id = 10146, job_next = 0x0,
job_resrcs = 0x7fdb64033a98, job_state = 1, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set_max_cpus = 0, limit_set_max_nodes = 0, limit_set_min_cpus = 0, limit_set_min_nodes = 0,
limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0, mail_type = 0, mail_user = 0x0, magic = 4038539564, name = 0xaf2368 "ior", network = 0x0, next_step_id = 1,
nodes = 0x7fdb64009608 "nid00228", node_addr = 0x9daf58, node_bitmap = 0x7fdb64011670, node_bitmap_cg = 0x99f4a0, node_cnt = 1, nodes_completing = 0x0, other_port = 0, partition = 0xbd3cb8 "workq",
part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x898228, pre_sus_time = 0, preempt_time = 0, priority = 4294894013, priority_array = 0x0, prio_factors = 0x930f08, profile = 0, qos_id = 0,
qos_ptr = 0x0, restart_cnt = 2, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, resv_flags = 0, requid = 4294967295, resp_host = 0x0, select_jobinfo = 0x9224b8, spank_job_env = 0x0,
spank_job_env_size = 0, start_time = 1382455567, state_desc = 0x0, state_reason = 0, step_list = 0x9c98d8, suspend_time = 0, time_last_active = 1382455567, time_limit = 60, time_min = 0, tot_sus_time = 0,
total_cpus = 8, total_nodes = 1, user_id = 23421, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
However, in the slurmctld log once the slurmctld was restarted after the crash, it showed that it was trying to clean up this job 10146 on node nid00226 not nid00228.
Snippets from slurmctld.log:
[2013-10-22T11:45:47.482] Resending TERMINATE_JOB request JobId=10146 Nodelist=nid00226
[2013-10-22T11:45:47.484] debug: sched: Running job scheduler
[2013-10-22T11:46:36.000] debug: backfill: no jobs to backfill
[2013-10-22T11:46:47.492] Resending TERMINATE_JOB request JobId=10145 Nodelist=nid00230
[2013-10-22T11:46:47.494] Resending TERMINATE_JOB request JobId=10146 Nodelist=nid00226
[2013-10-22T11:46:47.494] debug: sched: Running job scheduler
[2013-10-22T11:46:50.613] debug: Spawning ping agent for nid00[304,333-337,339-358,360-363]
[2013-10-22T11:46:50.613] debug: Spawning registration agent for nid00[196-245] 50 hosts
[2013-10-22T11:46:50.651] debug: Killing job 10146 on node nid00226
[2013-10-22T11:46:50.651] debug: Killing job 10146 on node nid00226
[2013-10-22T11:46:50.654] debug: Killing job 10145 on node nid00230
[2013-10-22T11:46:50.654] debug: Killing job 10145 on node nid00230
[2013-10-22T11:47:10.465] debug: aeld sync event
[2013-10-22T11:47:10.466] debug: _start_session: Created aeld session fd 7
[2013-10-22T11:47:34.000] debug: backfill: no jobs to backfill
[2013-10-22T11:47:47.621] Resending TERMINATE_JOB request JobId=10145 Nodelist=nid00230
[2013-10-22T11:47:47.624] Resending TERMINATE_JOB request JobId=10146 Nodelist=nid00226
There are 70 of these Resending TERMINATE_JOB requests.
sdb:/var/spool/slurm # grep "Resending TERMINATE_JOB request JobId=10146" slurmctld.log|wc -l
70
I was able to hit this again. Ironically both times slurmctld crashed while marking nid00228 down. Whenever I recieved the following error while running jobs I would mark the suspect node down and continue on. > srun: error: slurm_receive_msgs: Socket timed out on send/recv operation > srun: error: Task launch for 10190.0 failed on node nid00228: Socket timed out on send/recv operation > srun: error: Application launch failed: Socket timed out on send/recv operation > srun: Job step aborted: Waiting up to 2 seconds for job step to finish. > srun: error: Timed out waiting for job step to complete > galaxy:/home/users/pavek # scontrol update node=nid00228 state=down > galaxy:/home/users/pavek # sinfo > slurm_load_partitions: Unable to contact slurm controller (connect failure) New core is attached. Due to limits on file sizes, the slurmctld log, and the two core files can be found at this link. http://bugzilla.us.cray.com/show_bug.cgi?id=804210 Jason, I think this might of been related to the packing of NULL jobinfo in the switch plugin. If you haven't seen this since you fixed the code I am voting to close this. Pretty sure this is the same bug as 526 *** This ticket has been marked as a duplicate of ticket 526 *** |