Hello, At 13:44 today slurmctld aborted on edison in the backfill cycle. I'm attaching the logs and some preliminary analysis of the core file. Note that edison is still running 16.05.4 and can't be upgraded to 16.05.5 until we resolve bug #3154. We were able to restart slurmctld, so the problem does not appear to be ongoing, but there are some hallmarks of earlier crashes (e.g., bug #3082). I'm concerned it might re-occur, so would appreciate your help. (gdb) bt #0 0x00007f90ebe20875 in raise () from /lib64/libc.so.6 #1 0x00007f90ebe21e51 in abort () from /lib64/libc.so.6 #2 0x00007f90ebe19740 in __assert_fail () from /lib64/libc.so.6 #3 0x00000000005131c8 in bit_overlap (b1=0x7f902c35eb00, b2=0x0) at bitstring.c:808 #4 0x00007f90e63a5652 in _will_run_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, job_node_req=64000, preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at select_cons_res.c:1882 #5 0x00007f90e63a643d in select_p_job_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at select_cons_res.c:2248 #6 0x00007f90eb622f26 in other_job_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at other_select.c:277 #7 0x00007f90eb6210fa in select_p_job_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at select_cray.c:1920 #8 0x000000000052ef1f in select_g_job_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at node_select.c:588 #9 0x00007f90e5f91fd8 in _try_sched (job_ptr=0x7f8fdc513070, avail_bitmap=0x7f90e5f8cd60, min_nodes=2, max_nodes=2, req_nodes=2, exc_core_bitmap=0x0) at backfill.c:417 #10 0x00007f90e5f9521f in _attempt_backfill () at backfill.c:1479 #11 0x00007f90e5f92fbe in backfill_agent (args=0x0) at backfill.c:764 #12 0x00007f90ec171806 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f90ebecc9bd in clone () from /lib64/libc.so.6 #14 0x0000000000000000 in ?? () (gdb) ... (gdb) up #3 0x00000000005131c8 in bit_overlap (b1=0x7f902c35eb00, b2=0x0) at bitstring.c:808 808 bitstring.c: No such file or directory. (gdb) print b1 $1 = (bitstr_t *) 0x7f902c35eb00 (gdb) print b2 $2 = (bitstr_t *) 0x0 (gdb) ... (gdb) print tmp_job_ptr $3 = (struct job_record *) 0x7f90a0388820 (gdb) print tmp_job_ptr->node_bitmap $4 = (bitstr_t *) 0x0 (gdb) set print pretty on (gdb) print *tmp_job_ptr $5 = { account = 0x7f90a00aede0 "m1248", alias_list = 0x0, alloc_node = 0x7f90a0388d50 "edison11", alloc_resp_port = 0, alloc_sid = 25574, array_job_id = 2366862, array_task_id = 75, array_recs = 0x0, assoc_id = 2849, assoc_ptr = 0x7f90e0ae12f0, batch_flag = 2, batch_host = 0x7f90a0466d30 "nid06105", bit_flags = 0, burst_buffer = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 0, billable_tres = 2, cr_enabled = 1, deadline = 0, db_index = 0, derived_ec = 0, details = 0x7f90a0388b60, direct_set_prio = 0, end_time = 1476737065, end_time_exp = 0, epilog_running = false, exit_code = 0, front_end_ptr = 0x0, ---Type <return> to continue, or q <return> to quit--- gres = 0x7f90a073ed20 "craynetwork:0", gres_list = 0x0, gres_alloc = 0x7f90a0466ec0 "craynetwork:4", gres_req = 0x7f90a0466e60 "craynetwork:0", gres_used = 0x0, group_id = 62064, job_id = 2366942, job_next = 0x0, job_array_next_j = 0x7f90a0627ad0, job_array_next_t = 0x0, job_resrcs = 0x7f90a0425020, job_state = 4, kill_on_node_fail = 1, licenses = 0x7f90a073ed50 "project:1", license_list = 0x7f90e34a8520, limit_set = { qos = 0, time = 0, tres = 0x7f90a01a8140 }, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7f90a0388d80 "my_job", network = 0x0, next_step_id = 0, nodes = 0x0, node_addr = 0x7f90a0466ce0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 0, ---Type <return> to continue, or q <return> to quit--- node_cnt_wag = 1, nodes_completing = 0x0, other_port = 0, partition = 0x7f90a073edb0 "shared", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x7f902c48ac30, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 57242, priority_array = 0x0, prio_factors = 0x7f90a0388ce0, profile = 0, qos_id = 1, qos_ptr = 0x7f90e00048a0, reboot = 0 '\000', restart_cnt = 1, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 62064, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x7f90a073edd0, spank_job_env = 0x7f90a059b6b0, spank_job_env_size = 1, start_protocol_ver = 7680, start_time = 1476737065, state_desc = 0x0, ---Type <return> to continue, or q <return> to quit--- state_reason = 34, state_reason_prev = 34, step_list = 0x7f90e3035950, suspend_time = 0, time_last_active = 1476737021, time_limit = 2880, time_min = 0, tot_sus_time = 0, total_cpus = 0, total_nodes = 1, tres_req_cnt = 0x7f90a073ebe0, tres_req_str = 0x7f90a059b710 "1=1,2=1300,4=1", tres_fmt_req_str = 0x7f90a059b740 "cpu=1,mem=1300,node=1", tres_alloc_cnt = 0x7f90a0424dc0, tres_alloc_str = 0x7f90a0424d40 "1=2,2=2600,4=1", tres_fmt_alloc_str = 0x7f90a0466e00 "cpu=2,mem=2600M,node=1", user_id = 62064, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0 } (gdb) -Doug
Created attachment 3594 [details] thread apply all bt output
Created attachment 3595 [details] slurmctld log
Doug, do you know if there was any memory issues on the node? I am wondering it we ran out. The assert happened on bit_or(bitmap, orig_map); where orig_map was NULL. orig_map = bit_copy(bitmap); where bitmap=0x7f902c35eb00 bit_copy (meat) new = bit_alloc(newsize_bits); if (new) memcpy(&new[BITSTR_OVERHEAD], &b[BITSTR_OVERHEAD], len); If bit_alloc returned NULL here though I would expect a nice OOM message in the log as that is what the code is suppose to do inside bit_alloc. As that doesn't exist I am not sure how it got to be NULL. Other variables look good or I would expect memory corruption. Could you send 'bt full' on the thread provided in comment 0? It appears there are a lot of threads waiting, presumably because backfill had just given up the locks and we are not going through them. Nothing that appears too concerning.
(gdb) bt full #0 0x00007f90ebe20875 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007f90ebe21e51 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00007f90ebe19740 in __assert_fail () from /lib64/libc.so.6 No symbol table info available. #3 0x00000000005131c8 in bit_overlap (b1=0x7f902c35eb00, b2=0x0) at bitstring.c:808 count = 0 bit = 0 bit_cnt = 32656 word_size = 32 __PRETTY_FUNCTION__ = "bit_overlap" #4 0x00007f90e63a5652 in _will_run_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, job_node_req=64000, preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at select_cons_res.c:1882 first_job_ptr = 0x0 next_job_ptr = 0x0 overlap = 0 last_job_ptr = 0x0 rm_job_cnt = 0 time_window = 0 more_jobs = true future_part = 0x7f902c4ad9e0 future_usage = 0x7f902cbbe850 tmp_job_ptr = 0x7f90a0388820 cr_job_list = 0x2cc8f30 job_iterator = 0x8c3d20 preemptee_iterator = 0x5804d651 orig_map = 0x7f902cb19280 action = 0 rc = -1 now = 1476737066 tmp_cr_type = 50 qos_preemptor = false #5 0x00007f90e63a643d in select_p_job_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at select_cons_res.c:2248 rc = 22 job_node_req = 64000 debug_cpu_bind = false debug_check = true #6 0x00007f90eb622f26 in other_job_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at other_select.c:277 No locals. #7 0x00007f90eb6210fa in select_p_job_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at select_cray.c:1920 jobinfo = 0x7f8fdc60e6c0 #8 0x000000000052ef1f in select_g_job_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at node_select.c:588 No locals. #9 0x00007f90e5f91fd8 in _try_sched (job_ptr=0x7f8fdc513070, avail_bitmap=0x7f90e5f8cd60, min_nodes=2, max_nodes=2, req_nodes=2, exc_core_bitmap=0x0) at backfill.c:417 orig_shared = 0 now = 1476737066 str = "\340\312\370\345\220\177\000\000\212\023Q\000\000\000\000\000\267\300\376W\000\000\000\000\333\356O\000\322\025\000\000\311\363d\000\000\000\000\000h\315\370\345\220\177\000\000\000\353\065,\220\177\000\000p0Q\177\000\000 \313\370\345\220\177\000\000\000\000\000\000\000\000\000\000 \313\370\345\220\177\000\000\300\216\f,\220\177\000\000\300\216\f," low_bitmap = 0x0 tmp_bitmap = 0x7f902ca3aed0 rc = 0 has_xor = false feat_cnt = 0 detail_ptr = 0x7f8fdc34e660 preemptee_candidates = 0x0 preemptee_job_list = 0x0 feat_iter = 0x64f3c9 feat_ptr = 0x7f90e5f8ca70 __func__ = "_try_sched" #10 0x00007f90e5f9521f in _attempt_backfill () at backfill.c:1479 tv1 = {tv_sec = 1476737066, tv_usec = 132024} tv2 = {tv_sec = 1476737065, tv_usec = 432606} tv_str = "usec=2001634\000\000\000\000\000\000\000" delta_t = 2001634 ---Type <return> to continue, or q <return> to quit--- filter_root = true job_queue = 0x7f90e3e62840 job_queue_rec = 0x0 qos_ptr = 0x7f90e00054b0 bb = 1 i = -436679056 j = 33 node_space_recs = 44 mcs_select = 0 job_ptr = 0x7f8fdc513070 part_ptr = 0x7f902c48a420 bf_part_ptr = 0x0 end_time = 1476953400 end_reserve = 1477039800 deadline_time_limit = 0 time_limit = 1620 comp_time_limit = 1620 orig_time_limit = 1620 part_time_limit = 5760 min_nodes = 2 max_nodes = 2 req_nodes = 2 active_bitmap = 0x0 avail_bitmap = 0x7f902c35eb00 exc_core_bitmap = 0x0 resv_bitmap = 0x7f902c0c8ec0 now = 1476737065 sched_start = 1476737066 later_start = 1476870600 start_res = 1476856200 resv_end = 0 window_end = 1477082641 orig_sched_start = 1476737041 orig_start_time = 0 node_space = 0x7f90382f6020 bf_time1 = {tv_sec = 1476737041, tv_usec = 177590} bf_time2 = {tv_sec = 1476736861, tv_usec = 163944} rc = 0 job_test_count = 1 test_time_count = 0 pend_time = 423794 uid = 0x7f902c6099e0 nuser = 47 bf_parts = 0 bf_part_jobs = 0x0 njobs = 0x7f902cbbe040 already_counted = true reject_array_job_id = 0 reject_array_part = 0x0 job_start_cnt = 0 start_time = 1476942600 config_update = 1476687614 part_update = 1476736860 start_tv = {tv_sec = 1476737066, tv_usec = 132024} test_array_job_id = 2239675 test_array_count = 16 acct_max_nodes = 4294967295 wait_reason = 44 job_no_reserve = 0 resv_overlap = false save_share_res = 3 '\003' save_whole_node = 0 '\000' test_fini = -1 __func__ = "_attempt_backfill" ---Type <return> to continue, or q <return> to quit--- bf_max_user_msg = true #11 0x00007f90e5f92fbe in backfill_agent (args=0x0) at backfill.c:764 now = 1476737041 wait_time = 180 all_locks = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK} load_config = false short_sleep = false last_backfill_time = 1476736861 __func__ = "backfill_agent" #12 0x00007f90ec171806 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #13 0x00007f90ebecc9bd in clone () from /lib64/libc.so.6 No symbol table info available. #14 0x0000000000000000 in ?? () No symbol table info available. (gdb) On 10/17/16 3:41 PM, bugs@schedmd.com wrote: > > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=3185#c3> on bug > 3185 <https://bugs.schedmd.com/show_bug.cgi?id=3185> from Danny Auble > <mailto:da@schedmd.com> * > Doug, do you know if there was any memory issues on the node? I am wondering > it we ran out. > > The assert happened on > > bit_or(bitmap, orig_map); > > where orig_map was NULL. > > orig_map = bit_copy(bitmap); > > where bitmap=0x7f902c35eb00 > > bit_copy (meat) > > new = bit_alloc(newsize_bits); > if (new) > memcpy(&new[BITSTR_OVERHEAD], &b[BITSTR_OVERHEAD], len); > > If bit_alloc returned NULL here though I would expect a nice OOM message in the > log as that is what the code is suppose to do inside bit_alloc. As that > doesn't exist I am not sure how it got to be NULL. Other variables look good > or I would expect memory corruption. > > Could you send 'bt full' on the thread provided incomment 0 <show_bug.cgi?id=3185#c0>? > > It appears there are a lot of threads waiting, presumably because backfill had > just given up the locks and we are not going through them. Nothing that > appears too concerning. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Hello, I don't see any evidence of the OOM Killer running on the slurm controller node. Just as an additional data point. -Doug On 10/17/16 3:41 PM, bugs@schedmd.com wrote: > > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=3185#c3> on bug > 3185 <https://bugs.schedmd.com/show_bug.cgi?id=3185> from Danny Auble > <mailto:da@schedmd.com> * > Doug, do you know if there was any memory issues on the node? I am wondering > it we ran out. > > The assert happened on > > bit_or(bitmap, orig_map); > > where orig_map was NULL. > > orig_map = bit_copy(bitmap); > > where bitmap=0x7f902c35eb00 > > bit_copy (meat) > > new = bit_alloc(newsize_bits); > if (new) > memcpy(&new[BITSTR_OVERHEAD], &b[BITSTR_OVERHEAD], len); > > If bit_alloc returned NULL here though I would expect a nice OOM message in the > log as that is what the code is suppose to do inside bit_alloc. As that > doesn't exist I am not sure how it got to be NULL. Other variables look good > or I would expect memory corruption. > > Could you send 'bt full' on the thread provided incomment 0 <show_bug.cgi?id=3185#c0>? > > It appears there are a lot of threads waiting, presumably because backfill had > just given up the locks and we are not going through them. Nothing that > appears too concerning. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Ah, I was looking at the wrong bitmap operation. overlap = bit_overlap(bitmap, tmp_job_ptr->node_bitmap); Could you send print *tmp_job_ptr from frame 3
See comment 0, I think i posted it there. On 10/17/16 4:50 PM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=3185#c6> on bug > 3185 <https://bugs.schedmd.com/show_bug.cgi?id=3185> from Danny Auble > <mailto:da@schedmd.com> * > Ah, I was looking at the wrong bitmap operation. > > overlap = bit_overlap(bitmap, > tmp_job_ptr->node_bitmap); > > Could you send > > print *tmp_job_ptr > > from frame 3 > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
I see it Doug, sorry for missing it before hand. Could you also print something like print *(select_jobinfo_t *)(job_ptr->select_jobinfo->data) Hopefully in that frame it will give you something.
I am not holding my breath on that print. I am guessing you will get an incomplete type. I am fairly sure the job is in a cleaning state (NHC is running). I can easily see in the log it printing all those "has zero end_time" messages (fixed in 16.05.5). My guess is in the requeue of this job the node_bitmap was freed and shouldn't be in this check. We are discussing how to fix this correctly. I wouldn't expect this to happen very often though. But please let us know if it happens again. I'll let you know when we have a fix for this. I don't think we need much more from you now.
Created attachment 3597 [details] Patch to prevent abort Doug, It's going to take some time to fully understand how the job got into this state and caused the abort, but this patch will at least prevent this abort from happening again. I'd like to propose that you use this patch for now and we'll study the logs and code more on Tuesday. I'm hoping that we can provide you with something better then.
I now understand what happened and the patch that you have is probably the best solution, although I'll comment this better in that I commit. Here are some key bits of the log with comments: Job allocated resources: [2016-10-17T13:43:38.897] sched: Allocate JobID=2366862_75(2366942) NodeList=nid06105 #CPUs=2 Partition=shared Job launch fails, apparently due to some problem with nid06105: [2016-10-17T13:43:41.368] job_complete: JobID=2366862_75(2366942) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2016-10-17T13:43:41.368] job_complete: requeue JobID=2366862_75(2366942) State=0x8000 NodeCnt=1 per user/system request [2016-10-17T13:43:41.368] job_complete: JobID=2366862_75(2366942) State=0x8000 NodeCnt=1 done Job gets requeued, clearing end_time and node_bitmap (as desired): [2016-10-17T13:44:22.824] Requeuing JobID=2366862_75(2366942) State=0x0 NodeCnt=0 Many bugs of this sort are generated by the backfill scheduler. These are due to NHC still running and will go away in Slurm version 16.05.5: [2016-10-17T13:44:23.447] error: Job 2366942 has zero end_time Job is cancelled, setting its end_time to the current time: [2016-10-17T13:44:25.800] _job_signal: of pending JobID=2366862_75(2366942) State=0x4 NodeCnt=0 successful Its the job cancel that makes the job state be: NHC is running end_time is non-zero node_bitmap is NULL The logic abort with this combination. Adding the check for a NULL node_bitmap fixes this.
The patch that is attached is logically equivalent to what I have committed to Slurm version 16.05.6: https://github.com/SchedMD/slurm/commit/222746a65f7bd60e26330d8ed40248e54169c2d5