3185 – slurmctld abort during backfill

Ticket 3185 - slurmctld abort during backfill

Summary: slurmctld abort during backfill

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	16.05.4
Hardware:	Cray XC Linux

Severity:	2 - High Impact
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-10-17 15:42 MDT by Doug Jacobsen
Modified:	2016-10-18 11:08 MDT (History)
CC List:	1 user (show)

See Also:
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	16.05.6
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
thread apply all bt output (114.04 KB, text/plain) 2016-10-17 15:43 MDT, Doug Jacobsen	Details
slurmctld log (9.05 MB, application/x-bzip2) 2016-10-17 15:46 MDT, Doug Jacobsen	Details
Patch to prevent abort (693 bytes, patch) 2016-10-17 19:40 MDT, Moe Jette	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Doug Jacobsen 2016-10-17 15:42:37 MDT

Hello,

At 13:44 today slurmctld aborted on edison in the backfill cycle.  I'm attaching the logs and some preliminary analysis of the core file.  Note that edison is still running 16.05.4 and can't be upgraded to 16.05.5 until we resolve bug #3154.

We were able to restart slurmctld, so the problem does not appear to be ongoing, but there are some hallmarks of earlier crashes (e.g., bug #3082).  I'm concerned it might re-occur, so would appreciate your help.


(gdb) bt
#0  0x00007f90ebe20875 in raise () from /lib64/libc.so.6
#1  0x00007f90ebe21e51 in abort () from /lib64/libc.so.6
#2  0x00007f90ebe19740 in __assert_fail () from /lib64/libc.so.6
#3  0x00000000005131c8 in bit_overlap (b1=0x7f902c35eb00, b2=0x0) at bitstring.c:808
#4  0x00007f90e63a5652 in _will_run_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, job_node_req=64000, preemptee_candidates=0x0,
    preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at select_cons_res.c:1882
#5  0x00007f90e63a643d in select_p_job_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, preemptee_candidates=0x0,
    preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at select_cons_res.c:2248
#6  0x00007f90eb622f26 in other_job_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, preemptee_candidates=0x0,
    preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at other_select.c:277
#7  0x00007f90eb6210fa in select_p_job_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, preemptee_candidates=0x0,
    preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at select_cray.c:1920
#8  0x000000000052ef1f in select_g_job_test (job_ptr=0x7f8fdc513070, bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, preemptee_candidates=0x0,
    preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at node_select.c:588
#9  0x00007f90e5f91fd8 in _try_sched (job_ptr=0x7f8fdc513070, avail_bitmap=0x7f90e5f8cd60, min_nodes=2, max_nodes=2, req_nodes=2, exc_core_bitmap=0x0) at backfill.c:417
#10 0x00007f90e5f9521f in _attempt_backfill () at backfill.c:1479
#11 0x00007f90e5f92fbe in backfill_agent (args=0x0) at backfill.c:764
#12 0x00007f90ec171806 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f90ebecc9bd in clone () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()
(gdb)
...
(gdb) up
#3  0x00000000005131c8 in bit_overlap (b1=0x7f902c35eb00, b2=0x0) at bitstring.c:808
808	bitstring.c: No such file or directory.
(gdb) print b1
$1 = (bitstr_t *) 0x7f902c35eb00
(gdb) print b2
$2 = (bitstr_t *) 0x0
(gdb)
...
(gdb) print tmp_job_ptr
$3 = (struct job_record *) 0x7f90a0388820
(gdb) print tmp_job_ptr->node_bitmap
$4 = (bitstr_t *) 0x0
(gdb) set print pretty on
(gdb) print *tmp_job_ptr
$5 = {
  account = 0x7f90a00aede0 "m1248",
  alias_list = 0x0,
  alloc_node = 0x7f90a0388d50 "edison11",
  alloc_resp_port = 0,
  alloc_sid = 25574,
  array_job_id = 2366862,
  array_task_id = 75,
  array_recs = 0x0,
  assoc_id = 2849,
  assoc_ptr = 0x7f90e0ae12f0,
  batch_flag = 2,
  batch_host = 0x7f90a0466d30 "nid06105",
  bit_flags = 0,
  burst_buffer = 0x0,
  check_job = 0x0,
  ckpt_interval = 0,
  ckpt_time = 0,
  comment = 0x0,
  cpu_cnt = 0,
  billable_tres = 2,
  cr_enabled = 1,
  deadline = 0,
  db_index = 0,
  derived_ec = 0,
  details = 0x7f90a0388b60,
  direct_set_prio = 0,
  end_time = 1476737065,
  end_time_exp = 0,
  epilog_running = false,
  exit_code = 0,
  front_end_ptr = 0x0,
---Type <return> to continue, or q <return> to quit---
  gres = 0x7f90a073ed20 "craynetwork:0",
  gres_list = 0x0,
  gres_alloc = 0x7f90a0466ec0 "craynetwork:4",
  gres_req = 0x7f90a0466e60 "craynetwork:0",
  gres_used = 0x0,
  group_id = 62064,
  job_id = 2366942,
  job_next = 0x0,
  job_array_next_j = 0x7f90a0627ad0,
  job_array_next_t = 0x0,
  job_resrcs = 0x7f90a0425020,
  job_state = 4,
  kill_on_node_fail = 1,
  licenses = 0x7f90a073ed50 "project:1",
  license_list = 0x7f90e34a8520,
  limit_set = {
    qos = 0,
    time = 0,
    tres = 0x7f90a01a8140
  },
  mail_type = 0,
  mail_user = 0x0,
  magic = 4038539564,
  mcs_label = 0x0,
  name = 0x7f90a0388d80 "my_job",
  network = 0x0,
  next_step_id = 0,
  nodes = 0x0,
  node_addr = 0x7f90a0466ce0,
  node_bitmap = 0x0,
  node_bitmap_cg = 0x0,
  node_cnt = 0,
---Type <return> to continue, or q <return> to quit---
  node_cnt_wag = 1,
  nodes_completing = 0x0,
  other_port = 0,
  partition = 0x7f90a073edb0 "shared",
  part_ptr_list = 0x0,
  part_nodes_missing = false,
  part_ptr = 0x7f902c48ac30,
  power_flags = 0 '\000',
  pre_sus_time = 0,
  preempt_time = 0,
  preempt_in_progress = false,
  priority = 57242,
  priority_array = 0x0,
  prio_factors = 0x7f90a0388ce0,
  profile = 0,
  qos_id = 1,
  qos_ptr = 0x7f90e00048a0,
  reboot = 0 '\000',
  restart_cnt = 1,
  resize_time = 0,
  resv_id = 0,
  resv_name = 0x0,
  resv_ptr = 0x0,
  requid = 62064,
  resp_host = 0x0,
  sched_nodes = 0x0,
  select_jobinfo = 0x7f90a073edd0,
  spank_job_env = 0x7f90a059b6b0,
  spank_job_env_size = 1,
  start_protocol_ver = 7680,
  start_time = 1476737065,
  state_desc = 0x0,
---Type <return> to continue, or q <return> to quit---
  state_reason = 34,
  state_reason_prev = 34,
  step_list = 0x7f90e3035950,
  suspend_time = 0,
  time_last_active = 1476737021,
  time_limit = 2880,
  time_min = 0,
  tot_sus_time = 0,
  total_cpus = 0,
  total_nodes = 1,
  tres_req_cnt = 0x7f90a073ebe0,
  tres_req_str = 0x7f90a059b710 "1=1,2=1300,4=1",
  tres_fmt_req_str = 0x7f90a059b740 "cpu=1,mem=1300,node=1",
  tres_alloc_cnt = 0x7f90a0424dc0,
  tres_alloc_str = 0x7f90a0424d40 "1=2,2=2600,4=1",
  tres_fmt_alloc_str = 0x7f90a0466e00 "cpu=2,mem=2600M,node=1",
  user_id = 62064,
  wait_all_nodes = 0,
  warn_flags = 0,
  warn_signal = 0,
  warn_time = 0,
  wckey = 0x0,
  req_switch = 0,
  wait4switch = 0,
  best_switch = true,
  wait4switch_start = 0
}
(gdb)
-Doug

Comment 1 Doug Jacobsen 2016-10-17 15:43:37 MDT

Created attachment 3594 [details]
thread apply all bt output

Comment 2 Doug Jacobsen 2016-10-17 15:46:26 MDT

Created attachment 3595 [details]
slurmctld log

Comment 3 Danny Auble 2016-10-17 16:41:50 MDT

Doug, do you know if there was any memory issues on the node?  I am wondering it we ran out.

The assert happened on

bit_or(bitmap, orig_map);

where orig_map was NULL.

orig_map = bit_copy(bitmap);

where bitmap=0x7f902c35eb00

bit_copy (meat)

	new = bit_alloc(newsize_bits);
	if (new)
		memcpy(&new[BITSTR_OVERHEAD], &b[BITSTR_OVERHEAD], len);

If bit_alloc returned NULL here though I would expect a nice OOM message in the log as that is what the code is suppose to do inside bit_alloc.  As that doesn't exist I am not sure how it got to be NULL.  Other variables look good or I would expect memory corruption.

Could you send 'bt full' on the thread provided in comment 0?

It appears there are a lot of threads waiting, presumably because backfill had just given up the locks and we are not going through them.  Nothing that appears too concerning.

Comment 4 Doug Jacobsen 2016-10-17 16:47:20 MDT

(gdb) bt full
#0  0x00007f90ebe20875 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f90ebe21e51 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007f90ebe19740 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#3  0x00000000005131c8 in bit_overlap (b1=0x7f902c35eb00, b2=0x0) at 
bitstring.c:808
         count = 0
         bit = 0
         bit_cnt = 32656
         word_size = 32
         __PRETTY_FUNCTION__ = "bit_overlap"
#4  0x00007f90e63a5652 in _will_run_test (job_ptr=0x7f8fdc513070, 
bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, 
job_node_req=64000, preemptee_candidates=0x0, 
preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at 
select_cons_res.c:1882
         first_job_ptr = 0x0
         next_job_ptr = 0x0
         overlap = 0
         last_job_ptr = 0x0
         rm_job_cnt = 0
         time_window = 0
         more_jobs = true
         future_part = 0x7f902c4ad9e0
         future_usage = 0x7f902cbbe850
         tmp_job_ptr = 0x7f90a0388820
         cr_job_list = 0x2cc8f30
         job_iterator = 0x8c3d20
         preemptee_iterator = 0x5804d651
         orig_map = 0x7f902cb19280
         action = 0
         rc = -1
         now = 1476737066
         tmp_cr_type = 50
         qos_preemptor = false
#5  0x00007f90e63a643d in select_p_job_test (job_ptr=0x7f8fdc513070, 
bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, 
preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, 
exc_core_bitmap=0x0) at select_cons_res.c:2248
         rc = 22
         job_node_req = 64000
         debug_cpu_bind = false
         debug_check = true
#6  0x00007f90eb622f26 in other_job_test (job_ptr=0x7f8fdc513070, 
bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, 
preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, 
exc_core_bitmap=0x0) at other_select.c:277
No locals.
#7  0x00007f90eb6210fa in select_p_job_test (job_ptr=0x7f8fdc513070, 
bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, 
preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, 
exc_core_bitmap=0x0) at select_cray.c:1920
         jobinfo = 0x7f8fdc60e6c0
#8  0x000000000052ef1f in select_g_job_test (job_ptr=0x7f8fdc513070, 
bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2, 
preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70, 
exc_core_bitmap=0x0) at node_select.c:588
No locals.
#9  0x00007f90e5f91fd8 in _try_sched (job_ptr=0x7f8fdc513070, 
avail_bitmap=0x7f90e5f8cd60, min_nodes=2, max_nodes=2, req_nodes=2, 
exc_core_bitmap=0x0) at backfill.c:417
         orig_shared = 0
         now = 1476737066
         str = 
"\340\312\370\345\220\177\000\000\212\023Q\000\000\000\000\000\267\300\376W\000\000\000\000\333\356O\000\322\025\000\000\311\363d\000\000\000\000\000h\315\370\345\220\177\000\000\000\353\065,\220\177\000\000p0Q܏\177\000\000 
\313\370\345\220\177\000\000\000\000\000\000\000\000\000\000 
\313\370\345\220\177\000\000\300\216\f,\220\177\000\000\300\216\f,"
         low_bitmap = 0x0
         tmp_bitmap = 0x7f902ca3aed0
         rc = 0
         has_xor = false
         feat_cnt = 0
         detail_ptr = 0x7f8fdc34e660
         preemptee_candidates = 0x0
         preemptee_job_list = 0x0
         feat_iter = 0x64f3c9
         feat_ptr = 0x7f90e5f8ca70
         __func__ = "_try_sched"
#10 0x00007f90e5f9521f in _attempt_backfill () at backfill.c:1479
         tv1 = {tv_sec = 1476737066, tv_usec = 132024}
         tv2 = {tv_sec = 1476737065, tv_usec = 432606}
         tv_str = "usec=2001634\000\000\000\000\000\000\000"
         delta_t = 2001634
---Type <return> to continue, or q <return> to quit---
         filter_root = true
         job_queue = 0x7f90e3e62840
         job_queue_rec = 0x0
         qos_ptr = 0x7f90e00054b0
         bb = 1
         i = -436679056
         j = 33
         node_space_recs = 44
         mcs_select = 0
         job_ptr = 0x7f8fdc513070
         part_ptr = 0x7f902c48a420
         bf_part_ptr = 0x0
         end_time = 1476953400
         end_reserve = 1477039800
         deadline_time_limit = 0
         time_limit = 1620
         comp_time_limit = 1620
         orig_time_limit = 1620
         part_time_limit = 5760
         min_nodes = 2
         max_nodes = 2
         req_nodes = 2
         active_bitmap = 0x0
         avail_bitmap = 0x7f902c35eb00
         exc_core_bitmap = 0x0
         resv_bitmap = 0x7f902c0c8ec0
         now = 1476737065
         sched_start = 1476737066
         later_start = 1476870600
         start_res = 1476856200
         resv_end = 0
         window_end = 1477082641
         orig_sched_start = 1476737041
         orig_start_time = 0
         node_space = 0x7f90382f6020
         bf_time1 = {tv_sec = 1476737041, tv_usec = 177590}
         bf_time2 = {tv_sec = 1476736861, tv_usec = 163944}
         rc = 0
         job_test_count = 1
         test_time_count = 0
         pend_time = 423794
         uid = 0x7f902c6099e0
         nuser = 47
         bf_parts = 0
         bf_part_jobs = 0x0
         njobs = 0x7f902cbbe040
         already_counted = true
         reject_array_job_id = 0
         reject_array_part = 0x0
         job_start_cnt = 0
         start_time = 1476942600
         config_update = 1476687614
         part_update = 1476736860
         start_tv = {tv_sec = 1476737066, tv_usec = 132024}
         test_array_job_id = 2239675
         test_array_count = 16
         acct_max_nodes = 4294967295
         wait_reason = 44
         job_no_reserve = 0
         resv_overlap = false
         save_share_res = 3 '\003'
         save_whole_node = 0 '\000'
         test_fini = -1
         __func__ = "_attempt_backfill"
---Type <return> to continue, or q <return> to quit---
         bf_max_user_msg = true
#11 0x00007f90e5f92fbe in backfill_agent (args=0x0) at backfill.c:764
         now = 1476737041
         wait_time = 180
         all_locks = {config = READ_LOCK, job = WRITE_LOCK, node = 
WRITE_LOCK, partition = READ_LOCK}
         load_config = false
         short_sleep = false
         last_backfill_time = 1476736861
         __func__ = "backfill_agent"
#12 0x00007f90ec171806 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#13 0x00007f90ebecc9bd in clone () from /lib64/libc.so.6
No symbol table info available.
#14 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb)


On 10/17/16 3:41 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=3185#c3> on bug 
> 3185 <https://bugs.schedmd.com/show_bug.cgi?id=3185> from Danny Auble 
> <mailto:da@schedmd.com> *
> Doug, do you know if there was any memory issues on the node?  I am wondering
> it we ran out.
>
> The assert happened on
>
> bit_or(bitmap, orig_map);
>
> where orig_map was NULL.
>
> orig_map = bit_copy(bitmap);
>
> where bitmap=0x7f902c35eb00
>
> bit_copy (meat)
>
>          new = bit_alloc(newsize_bits);
>          if (new)
>                  memcpy(&new[BITSTR_OVERHEAD], &b[BITSTR_OVERHEAD], len);
>
> If bit_alloc returned NULL here though I would expect a nice OOM message in the
> log as that is what the code is suppose to do inside bit_alloc.  As that
> doesn't exist I am not sure how it got to be NULL.  Other variables look good
> or I would expect memory corruption.
>
> Could you send 'bt full' on the thread provided incomment 0 <show_bug.cgi?id=3185#c0>?
>
> It appears there are a lot of threads waiting, presumably because backfill had
> just given up the locks and we are not going through them.  Nothing that
> appears too concerning.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 5 Doug Jacobsen 2016-10-17 17:05:20 MDT

Hello,

I don't see any evidence of the OOM Killer running on the slurm 
controller node.  Just as an additional data point.

-Doug

On 10/17/16 3:41 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=3185#c3> on bug 
> 3185 <https://bugs.schedmd.com/show_bug.cgi?id=3185> from Danny Auble 
> <mailto:da@schedmd.com> *
> Doug, do you know if there was any memory issues on the node?  I am wondering
> it we ran out.
>
> The assert happened on
>
> bit_or(bitmap, orig_map);
>
> where orig_map was NULL.
>
> orig_map = bit_copy(bitmap);
>
> where bitmap=0x7f902c35eb00
>
> bit_copy (meat)
>
>          new = bit_alloc(newsize_bits);
>          if (new)
>                  memcpy(&new[BITSTR_OVERHEAD], &b[BITSTR_OVERHEAD], len);
>
> If bit_alloc returned NULL here though I would expect a nice OOM message in the
> log as that is what the code is suppose to do inside bit_alloc.  As that
> doesn't exist I am not sure how it got to be NULL.  Other variables look good
> or I would expect memory corruption.
>
> Could you send 'bt full' on the thread provided incomment 0 <show_bug.cgi?id=3185#c0>?
>
> It appears there are a lot of threads waiting, presumably because backfill had
> just given up the locks and we are not going through them.  Nothing that
> appears too concerning.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 6 Danny Auble 2016-10-17 17:50:57 MDT

Ah, I was looking at the wrong bitmap operation.

overlap = bit_overlap(bitmap,
		      tmp_job_ptr->node_bitmap);

Could you send

print *tmp_job_ptr

from frame 3

Comment 7 Doug Jacobsen 2016-10-17 17:57:22 MDT

See comment 0, I think i posted it there.


On 10/17/16 4:50 PM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=3185#c6> on bug 
> 3185 <https://bugs.schedmd.com/show_bug.cgi?id=3185> from Danny Auble 
> <mailto:da@schedmd.com> *
> Ah, I was looking at the wrong bitmap operation.
>
> overlap = bit_overlap(bitmap,
>                        tmp_job_ptr->node_bitmap);
>
> Could you send
>
> print *tmp_job_ptr
>
> from frame 3
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 9 Danny Auble 2016-10-17 18:44:14 MDT

I see it Doug, sorry for missing it before hand.

Could you also print something like 

print *(select_jobinfo_t *)(job_ptr->select_jobinfo->data)

Hopefully in that frame it will give you something.

Comment 10 Danny Auble 2016-10-17 19:01:54 MDT

I am not holding my breath on that print.  I am guessing you will get an incomplete type.

I am fairly sure the job is in a cleaning state (NHC is running).  I can easily see in the log it printing all those "has zero end_time" messages (fixed in 16.05.5).

My guess is in the requeue of this job the node_bitmap was freed and shouldn't be in this check.  We are discussing how to fix this correctly.  I wouldn't expect this to happen very often though.  But please let us know if it happens again.

I'll let you know when we have a fix for this.  I don't think we need much more from you now.

Comment 12 Moe Jette 2016-10-17 19:40:09 MDT

Created attachment 3597 [details]
Patch to prevent abort

Doug,
It's going to take some time to fully understand how the job got into this state and caused the abort, but this patch will at least prevent this abort from happening again. I'd like to propose that you use this patch for now and we'll study the logs and code more on Tuesday. I'm hoping that we can provide you with something better then.

Comment 14 Moe Jette 2016-10-18 09:48:52 MDT

I now understand what happened and the patch that you have is probably the best solution, although I'll comment this better in that I commit. Here are some key bits of the log with comments:

Job allocated resources:
[2016-10-17T13:43:38.897] sched: Allocate JobID=2366862_75(2366942) NodeList=nid06105 #CPUs=2 Partition=shared

Job launch fails, apparently due to some problem with nid06105:
[2016-10-17T13:43:41.368] job_complete: JobID=2366862_75(2366942) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2016-10-17T13:43:41.368] job_complete: requeue JobID=2366862_75(2366942) State=0x8000 NodeCnt=1 per user/system request
[2016-10-17T13:43:41.368] job_complete: JobID=2366862_75(2366942) State=0x8000 NodeCnt=1 done

Job gets requeued, clearing end_time and node_bitmap (as desired):
[2016-10-17T13:44:22.824] Requeuing JobID=2366862_75(2366942) State=0x0 NodeCnt=0

Many bugs of this sort are generated by the backfill scheduler. These are due to NHC still running and will go away in Slurm version 16.05.5:
[2016-10-17T13:44:23.447] error: Job 2366942 has zero end_time

Job is cancelled, setting its end_time to the current time:
[2016-10-17T13:44:25.800] _job_signal: of pending JobID=2366862_75(2366942) State=0x4 NodeCnt=0 successful

Its the job cancel that makes the job state be:
NHC is running
end_time is non-zero
node_bitmap is NULL

The logic abort with this combination. Adding the check for a NULL node_bitmap fixes this.

Comment 15 Moe Jette 2016-10-18 10:41:26 MDT

The patch that is attached is logically equivalent to what I have committed to Slurm version 16.05.6:

https://github.com/SchedMD/slurm/commit/222746a65f7bd60e26330d8ed40248e54169c2d5