3230 – 16.05.6 slurmctld crash

Ticket 3230 - 16.05.6 slurmctld crash

Summary: 16.05.6 slurmctld crash

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	16.05.6
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-11-01 15:44 MDT by Doug Jacobsen
Modified:	2016-11-14 09:00 MST (History)
CC List:	1 user (show)

See Also:
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	16.05.7
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmctld log leading up to the crash (9.05 MB, application/x-bzip2) 2016-11-01 15:49 MDT, Doug Jacobsen	Details
slurmctld log leading up to the crash (9.05 MB, application/x-bzip2) 2016-11-01 16:02 MDT, Doug Jacobsen	Details
kept uploading the wrong path, stupid Documents folder (7.67 MB, application/x-bzip2) 2016-11-01 16:13 MDT, Doug Jacobsen	Details
slurmctld for today (20.40 MB, application/x-bzip2) 2016-11-03 13:24 MDT, Doug Jacobsen	Details
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Doug Jacobsen 2016-11-01 15:44:56 MDT

Hello,

We had another slurmctld crash on edison this morning, now with 16.05.5 (updated Thursday morning, just before the 16.05.6 announcement).

I'll attached the logs shortly.  This looks be related to bug 3185.  Our 16.05.5 does have the patch provided in 3185, but we still got this crash.  We're having an unrelated system maintenance tomorrow, so I will have an opportunity to upgrade from 16.05.5 to 16.05.6 on edison.

-Doug

Program terminated with signal 6, Aborted.
#0  0x00007f7bc12e8875 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f7bc12e8875 in raise () from /lib64/libc.so.6
#1  0x00007f7bc12e9e51 in abort () from /lib64/libc.so.6
#2  0x00007f7bc12e1740 in __assert_fail () from /lib64/libc.so.6
#3  0x0000000000512bb0 in bit_ffs (b=0x0) at bitstring.c:484
#4  0x00007f7bbb869ecc in _build_row_bitmaps (p_ptr=0x7f7b5415c680, job_ptr=0x4080900) at select_cons_res.c:655
#5  0x00007f7bbb86bdfa in _rm_job_from_res (part_record_ptr=0x7f7b5412a510, node_usage=0x7f7b547f5f70, job_ptr=0x4080900, action=0) at select_cons_res.c:1273
#6  0x00007f7bbb86e984 in select_p_job_fini (job_ptr=0x4080900) at select_cons_res.c:2397
#7  0x00007f7bc0aeb11b in other_job_fini (job_ptr=0x4080900) at other_select.c:386
#8  0x00007f7bc0ae69b6 in _job_fini (args=0x4080900) at select_cray.c:1083
#9  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb) set print pretty
(gdb) up
#1  0x00007f7bc12e9e51 in abort () from /lib64/libc.so.6
(gdb) up
#2  0x00007f7bc12e1740 in __assert_fail () from /lib64/libc.so.6
(gdb) up
#3  0x0000000000512bb0 in bit_ffs (b=0x0) at bitstring.c:484
484	bitstring.c: No such file or directory.
(gdb) up
#4  0x00007f7bbb869ecc in _build_row_bitmaps (p_ptr=0x7f7b5415c680, job_ptr=0x4080900) at select_cons_res.c:655
655	select_cons_res.c: No such file or directory.
(gdb) print *job_ptr
$1 = {
  account = 0x2ddc440 "m1820",
  alias_list = 0x0,
  alloc_node = 0x46dee20 "edison12",
  alloc_resp_port = 0,
  alloc_sid = 4664,
  array_job_id = 0,
  array_task_id = 4294967294,
  array_recs = 0x0,
  assoc_id = 9101,
  assoc_ptr = 0x137dbc0,
  batch_flag = 1,
  batch_host = 0x7f7b5c112da0 "nid06138",
  bit_flags = 0,
  burst_buffer = 0x0,
  check_job = 0x0,
  ckpt_interval = 0,
  ckpt_time = 0,
  comment = 0x0,
  cpu_cnt = 0,
  billable_tres = 2,
  cr_enabled = 1,
  deadline = 0,
  db_index = 13617048,
  derived_ec = 0,
  details = 0x3bc28b0,
  direct_set_prio = 0,
  end_time = 1478015968,
  end_time_exp = 4294967294,
  epilog_running = false,
  exit_code = 0,
  front_end_ptr = 0x0,
  gres = 0x2dbc700 "craynetwork:0",
  gres_list = 0x0,
  gres_alloc = 0x7f7b5c141f30 "craynetwork:4",
---Type <return> to continue, or q <return> to quit---
  gres_req = 0x7f7b5c112ca0 "craynetwork:0",
  gres_used = 0x0,
  group_id = 70270,
  job_id = 2546456,
  job_next = 0x0,
  job_array_next_j = 0x0,
  job_array_next_t = 0x0,
  job_resrcs = 0x7f7b5c1fa8d0,
  job_state = 3,
  kill_on_node_fail = 1,
  licenses = 0x0,
  license_list = 0x0,
  limit_set = {
    qos = 0,
    time = 0,
    tres = 0x45fa4f0
  },
  mail_type = 0,
  mail_user = 0x0,
  magic = 4038539564,
  mcs_label = 0x0,
  name = 0x3be1170 "runjob.sh",
  network = 0x0,
  next_step_id = 0,
  nodes = 0x7f7b5c112b70 "nid06138",
  node_addr = 0x7f7b5c1e4ad0,
  node_bitmap = 0x7f7b5c19d190,
  node_bitmap_cg = 0x3f10580,
  node_cnt = 0,
  node_cnt_wag = 1,
  nodes_completing = 0x0,
  other_port = 0,
  partition = 0x2722060 "shared",
  part_ptr_list = 0x0,
  part_nodes_missing = false,
---Type <return> to continue, or q <return> to quit---
  part_ptr = 0x7f7b544441b0,
  power_flags = 0 '\000',
  pre_sus_time = 0,
  preempt_time = 0,
  preempt_in_progress = false,
  priority = 56933,
  priority_array = 0x0,
  prio_factors = 0x4116860,
  profile = 0,
  qos_id = 1,
  qos_ptr = 0x8f71d0,
  reboot = 0 '\000',
  restart_cnt = 0,
  resize_time = 0,
  resv_id = 0,
  resv_name = 0x0,
  resv_ptr = 0x0,
  requid = 4294967295,
  resp_host = 0x0,
  sched_nodes = 0x0,
  select_jobinfo = 0x34e8e30,
  spank_job_env = 0x4973f30,
  spank_job_env_size = 1,
  start_protocol_ver = 7680,
  start_time = 1478001339,
  state_desc = 0x0,
  state_reason = 0,
  state_reason_prev = 0,
  step_list = 0x42a5ff0,
  suspend_time = 0,
  time_last_active = 1478015968,
  time_limit = 360,
  time_min = 0,
  tot_sus_time = 1,
  total_cpus = 2,
---Type <return> to continue, or q <return> to quit---
  total_nodes = 1,
  tres_req_cnt = 0x39555a0,
  tres_req_str = 0x8bef40 "1=1,2=1300,4=1",
  tres_fmt_req_str = 0x3543b10 "cpu=1,mem=1300,node=1",
  tres_alloc_cnt = 0x7f7b5c171a30,
  tres_alloc_str = 0x7f7b5c112df0 "1=2,2=2600,4=1",
  tres_fmt_alloc_str = 0x7f7b5c141ed0 "cpu=2,mem=2600M,node=1",
  user_id = 70270,
  wait_all_nodes = 0,
  warn_flags = 0,
  warn_signal = 0,
  warn_time = 0,
  wckey = 0x0,
  req_switch = 0,
  wait4switch = 0,
  best_switch = true,
  wait4switch_start = 0
}
(gdb) thread apply all bt

Thread 31 (Thread 0x7f7bb1edf700 (LWP 29557)):
#0  0x00007f7bc138b4d6 in poll () from /lib64/libc.so.6
#1  0x000000000057fa8e in slurm_recv_timeout (fd=24, buffer=0x7f7bb1edebac "{\177<", size=4, flags=0, timeout=60000) at slurm_protocol_socket_implementation.c:360
#2  0x000000000057f517 in slurm_msg_recvfrom_timeout (fd=24, pbuf=0x7f7bb1edec18, lenp=0x7f7bb1edec10, flags=0, tmout=60000)
    at slurm_protocol_socket_implementation.c:160
#3  0x0000000000541d6a in slurm_receive_msgs (fd=24, steps=0, timeout=60000) at slurm_protocol_api.c:3275
#4  0x0000000000543894 in _send_and_recv_msgs (fd=24, req=0x7f7bb1edee50, timeout=60000) at slurm_protocol_api.c:4099
#5  0x0000000000543f7b in slurm_send_addr_recv_msgs (msg=0x7f7bb1edee50, name=0x267b7c0 "nid01793", timeout=60000) at slurm_protocol_api.c:4377
#6  0x0000000000502289 in _fwd_tree_thread (arg=0x7f7b5435bfe0) at forward.c:398
#7  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#9  0x0000000000000000 in ?? ()

Thread 30 (Thread 0x7f7bb14d5700 (LWP 28213)):
#0  0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0
#1  0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb14d4e70) at select_cray.c:320
#2  0x00007f7bc0ae6962 in _job_fini (args=0x3892b70) at select_cray.c:1074
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 29 (Thread 0x7f7bb23e4700 (LWP 28305)):
#0  0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0
#1  0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb23e3e70) at select_cray.c:320
#2  0x00007f7bc0ae6962 in _job_fini (args=0x3d9adc0) at select_cray.c:1074
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 28 (Thread 0x7f7ba02c2700 (LWP 31261)):
#0  0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6
#1  0x00007f7bc136137c in sleep () from /lib64/libc.so.6
#2  0x00007f7bc0ae6802 in _wait_job_completed (job_id=2548627, job_ptr=0x7f7ba83a2780) at select_cray.c:1036
#3  0x00007f7bc0ae6959 in _job_fini (args=0x7f7ba83a2780) at select_cray.c:1071
---Type <return> to continue, or q <return> to quit---
#4  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#6  0x0000000000000000 in ?? ()

Thread 27 (Thread 0x7f7bb25e6700 (LWP 27013)):
#0  0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0
#1  0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb25e5e70) at select_cray.c:320
#2  0x00007f7bc0ae6962 in _job_fini (args=0x3b19950) at select_cray.c:1074
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 26 (Thread 0x7f7bb11d2700 (LWP 29538)):
#0  0x00007f7bc163d66c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000502fb6 in start_msg_tree (hl=0x4cdbf80, msg=0x7f7bb11d1e40, timeout=0) at forward.c:727
#2  0x0000000000543dac in slurm_send_recv_msgs (nodelist=0x340b720 "nid0[0245,0614,1234,1322,1718,1793,3671,3799,4119,4230,4301,4551,5352,5453,5622-5639,6115]",
    msg=0x7f7bb11d1e40, timeout=0, quiet=true) at slurm_protocol_api.c:4327
#3  0x000000000043ed05 in _thread_per_group_rpc (args=0x24c9540) at agent.c:902
#4  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#6  0x0000000000000000 in ?? ()

Thread 25 (Thread 0x7f7bb28e9700 (LWP 29537)):
#0  0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6
#1  0x00007f7bc138e574 in usleep () from /lib64/libc.so.6
#2  0x000000000043e03c in _wdog (args=0x3cc43f0) at agent.c:586
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 24 (Thread 0x7f7ba2ded700 (LWP 27984)):
#0  0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0
#1  0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7ba2dece70) at select_cray.c:320
#2  0x00007f7bc0ae6962 in _job_fini (args=0x7f7b4c133aa0) at select_cray.c:1074
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 23 (Thread 0x7f7bb30f0700 (LWP 28295)):
#0  0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0
#1  0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb30efe70) at select_cray.c:320
#2  0x00007f7bc0ae6962 in _job_fini (args=0x7f7b4c132d30) at select_cray.c:1074
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 22 (Thread 0x7f7bb09c9700 (LWP 27445)):
#0  0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0
#1  0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb09c8e70) at select_cray.c:320
#2  0x00007f7bc0ae6962 in _job_fini (args=0x7f7ba8176160) at select_cray.c:1074
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 21 (Thread 0x7f7bb2fef700 (LWP 29536)):
#0  0x00007f7bc1639fe5 in pthread_join () from /lib64/libpthread.so.0
#1  0x000000000043d57d in agent (args=0x7f7b8c180e10) at agent.c:338
#2  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 20 (Thread 0x7f7bb39f9700 (LWP 28311)):
#0  0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0
#1  0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb39f8e70) at select_cray.c:320
#2  0x00007f7bc0ae6962 in _job_fini (args=0x413fc30) at select_cray.c:1074
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 19 (Thread 0x7f7bb85c6700 (LWP 20601)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f7bc1641527 in do_sigwait () from /lib64/libpthread.so.0
#1  0x00007f7bc16415cd in sigwait () from /lib64/libpthread.so.0
#2  0x00000000004459ba in _slurmctld_signal_hand (no_data=0x0) at controller.c:876
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 18 (Thread 0x7f7bbc297700 (LWP 20524)):
#0  0x00007f7bc1639fe5 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f7bbc39cc6b in _cleanup_thread (no_data=0x0) at accounting_storage_slurmdbd.c:443
#2  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 17 (Thread 0x7f7b83bfb700 (LWP 31355)):
#0  0x00007f7bc138b4d6 in poll () from /lib64/libc.so.6
#1  0x000000000057fa8e in slurm_recv_timeout (fd=4, buffer=0x7f7b83bfad5c "", size=4, flags=0, timeout=60000) at slurm_protocol_socket_implementation.c:360
#2  0x000000000057f517 in slurm_msg_recvfrom_timeout (fd=4, pbuf=0x7f7b83bfadc0, lenp=0x7f7b83bfadb8, flags=0, tmout=60000)
    at slurm_protocol_socket_implementation.c:160
#3  0x0000000000541723 in slurm_receive_msg (fd=4, msg=0x26c0380, timeout=60000) at slurm_protocol_api.c:3096
#4  0x0000000000446461 in _service_connection (arg=0x45cc930) at controller.c:1104
#5  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#7  0x0000000000000000 in ?? ()

Thread 16 (Thread 0x7f7bb84c5700 (LWP 20602)):
#0  0x00007f7bc163d9fc in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004d1ad5 in slurmctld_state_save (no_data=0x0) at state_save.c:205
#2  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 15 (Thread 0x7f7bb1dde700 (LWP 27877)):
#0  0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0
#1  0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb1ddde70) at select_cray.c:320
---Type <return> to continue, or q <return> to quit---
#2  0x00007f7bc0ae6962 in _job_fini (args=0x3c74580) at select_cray.c:1074
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 14 (Thread 0x7f7bb35f5700 (LWP 31350)):
#0  0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6
#1  0x00007f7bc136137c in sleep () from /lib64/libc.so.6
#2  0x00007f7bc0ae6802 in _wait_job_completed (job_id=2548590, job_ptr=0x46b2570) at select_cray.c:1036
#3  0x00007f7bc0ae6959 in _job_fini (args=0x46b2570) at select_cray.c:1071
#4  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#6  0x0000000000000000 in ?? ()

Thread 13 (Thread 0x7f7bb15d6700 (LWP 27216)):
#0  0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0
#1  0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb15d5e70) at select_cray.c:320
#2  0x00007f7bc0ae6962 in _job_fini (args=0x498b590) at select_cray.c:1074
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 12 (Thread 0x7f7bb89ca700 (LWP 20599)):
#0  0x00007f7bc1639fe5 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f7bb8ad2c3e in _cleanup_thread (no_data=0x0) at priority_multifactor.c:1425
#2  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 11 (Thread 0x7f7b7d191700 (LWP 30950)):
#0  0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0
#1  0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7b7d190e70) at select_cray.c:320
#2  0x00007f7bc0ae6962 in _job_fini (args=0x3ebaba0) at select_cray.c:1074
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#5  0x0000000000000000 in ?? ()

Thread 10 (Thread 0x7f7bb8fdb700 (LWP 20539)):
#0  0x00007f7bc138b4d6 in poll () from /lib64/libc.so.6
#1  0x00007f7bc0ae506b in _aeld_event_loop (args=0x0) at select_cray.c:522
#2  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 9 (Thread 0x7f7ba24e4700 (LWP 31144)):
#0  0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6
#1  0x00007f7bc136137c in sleep () from /lib64/libc.so.6
#2  0x00007f7bc0ae6802 in _wait_job_completed (job_id=2548559, job_ptr=0x412cdb0) at select_cray.c:1036
#3  0x00007f7bc0ae6959 in _job_fini (args=0x412cdb0) at select_cray.c:1071
#4  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#6  0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f7bbb455700 (LWP 20529)):
#0  0x00007f7bc163d9fc in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f7bbb45a580 in _my_sleep (usec=180000000) at backfill.c:508
#2  0x00007f7bbb45b13f in backfill_agent (args=0x0) at backfill.c:798
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 7 (Thread 0x7f7bbbd85700 (LWP 20525)):
#0  0x00007f7bc163d9fc in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005b1615 in _agent (x=0x0) at slurmdbd_defs.c:2137
#2  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f7bb86c7700 (LWP 20600)):
#0  0x00007f7bc138dca3 in select () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#1  0x0000000000446051 in _slurmctld_rpc_mgr (no_data=0x0) at controller.c:1014
#2  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f7bbc398700 (LWP 20523)):
#0  0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6
#1  0x00007f7bc136137c in sleep () from /lib64/libc.so.6
#2  0x00007f7bbc39cc49 in _set_db_inx_thread (no_data=0x0) at accounting_storage_slurmdbd.c:435
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f7bb8acb700 (LWP 20598)):
#0  0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6
#1  0x00007f7bc136137c in sleep () from /lib64/libc.so.6
#2  0x00007f7bb8ad2b09 in _decay_thread (no_data=0x0) at priority_multifactor.c:1371
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f7bbc6a5700 (LWP 20522)):
#0  0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6
#1  0x00007f7bc136137c in sleep () from /lib64/libc.so.6
#2  0x00007f7bbd12a4f4 in _lease_extender (args=0x0) at cookies.c:350
#3  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f7bc1bdb700 (LWP 20520)):
#0  0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6
#1  0x00007f7bc138e574 in usleep () from /lib64/libc.so.6
#2  0x0000000000447a73 in _slurmctld_background (no_data=0x0) at controller.c:1713
#3  0x0000000000445379 in main (argc=1, argv=0x7fff7e89cb98) at controller.c:605

---Type <return> to continue, or q <return> to quit---
Thread 1 (Thread 0x7f7bb22e3700 (LWP 24099)):
#0  0x00007f7bc12e8875 in raise () from /lib64/libc.so.6
#1  0x00007f7bc12e9e51 in abort () from /lib64/libc.so.6
#2  0x00007f7bc12e1740 in __assert_fail () from /lib64/libc.so.6
#3  0x0000000000512bb0 in bit_ffs (b=0x0) at bitstring.c:484
#4  0x00007f7bbb869ecc in _build_row_bitmaps (p_ptr=0x7f7b5415c680, job_ptr=0x4080900) at select_cons_res.c:655
#5  0x00007f7bbb86bdfa in _rm_job_from_res (part_record_ptr=0x7f7b5412a510, node_usage=0x7f7b547f5f70, job_ptr=0x4080900, action=0) at select_cons_res.c:1273
#6  0x00007f7bbb86e984 in select_p_job_fini (job_ptr=0x4080900) at select_cons_res.c:2397
#7  0x00007f7bc0aeb11b in other_job_fini (job_ptr=0x4080900) at other_select.c:386
#8  0x00007f7bc0ae69b6 in _job_fini (args=0x4080900) at select_cray.c:1083
#9  0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f7bc13949bd in clone () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb)
(gdb)

Comment 1 Doug Jacobsen 2016-11-01 15:49:00 MDT

Created attachment 3667 [details]
slurmctld log leading up to the crash

Comment 2 Danny Auble 2016-11-01 15:57:58 MDT

Doug, this log is from 10/17?

I believe though the problem may actually be related to Bug 3211.

Commit 9c0a2f2bd9b9 might fix this.

Could you send out of frame 4 the output of

p *ss[x].tmpjobs

Comment 3 Doug Jacobsen 2016-11-01 16:02:39 MDT

Created attachment 3668 [details]
slurmctld log leading up to the crash

Comment 4 Doug Jacobsen 2016-11-01 16:03:08 MDT

Oh shoot, must have pulled down the wrong log.  I'll check on that.

I can't read bug #3211.

(gdb) p *ss[x].tmpjobs
$1 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, 
cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0,
   cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, 
nhosts = 1, node_bitmap = 0x0, node_req = 0, nodes = 0x0, ncpus = 1,
   sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000'}
(gdb) print x
$2 = 435
(gdb)



On 11/1/16 2:57 PM, bugs@schedmd.com wrote:
>
> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=3230#c2> on bug 
> 3230 <https://bugs.schedmd.com/show_bug.cgi?id=3230> from Danny Auble 
> <mailto:da@schedmd.com> *
> Doug, this log is from 10/17?
>
> I believe though the problem may actually be related toBug 3211 <show_bug.cgi?id=3211>.
>
> Commit 9c0a2f2bd9b9 might fix this.
>
> Could you send out of frame 4 the output of
>
> p *ss[x].tmpjobs
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 5 Danny Auble 2016-11-01 16:09:38 MDT

Hum, still only till the 17th :).

Last 2 lines of both files...

[2016-10-17T14:43:40.601] debug:  sched: already tested 402 jobs, breaking out
[2016-10-17T14:43:40.614] debug:  sched: Running job scheduler

I made 3211 possible to read now.  Though the bug deals with gang scheduling I believe it is also related to this as well.  If you didn't have any jobs trying to complete on unresponsive nodes then this might not fix the issue.

Comment 6 Doug Jacobsen 2016-11-01 16:13:11 MDT

Created attachment 3669 [details]
kept uploading the wrong path, stupid Documents folder

Comment 7 Danny Auble 2016-11-01 16:18:58 MDT

Third times' the charm!

Could you send me the slurmd log from nid03799?

I am sure you are aware there are a host of nodes running with different slurm.conf according to the slurmctld so I will only mention that once.

Comment 8 Doug Jacobsen 2016-11-01 16:30:11 MDT

hmmm.  The job_ptr indicates it was on nid06138, which is still up.

  next_step_id = 0,
  nodes = 0x7f7b5c112b70 "nid06138",
  node_addr = 0x7f7b5c1e4ad0,


nid02305:/var/tmp/slurm # sinfo --node=nid06138
PARTITION AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE      NODELIST
system    up    1-infini      30:00     48 2:12:2       1 idle       nid06138
debug*    up    1-infini      30:00      0  0:0:0       0 n/a
regular   up    1-infini 4-00:00:00      0  0:0:0       0 n/a
regularx  up    1-infini 2-00:00:00      0  0:0:0       0 n/a
realtime  up    1-4        12:00:00     48 2:12:2       1 idle       nid06138
shared    up    1        2-00:00:00     48 2:12:2       1 idle       nid06138
special   up    1-infini 10-00:00:0     48 2:12:2       1 idle       nid06138
nid02305:/var/tmp/slurm #

boot:~ # ssh nid06138
Welcome to the initramfs
# /sbin/chroot /dsl uptime
 15:15pm  up 12 days  7:03,  0 users,  load average: 0.30, 0.30, 0.30
#


Guess I don't know how to use sacctmgr to see events:
id01605:~ # sacctmgr show event where nodes=nid06138
   Cluster        NodeName           TimeStart             TimeEnd  State                         Reason       User
---------- --------------- ------------------- ------------------- ------ ------------------------------ ----------
nid01605:~ # sacctmgr show event where nodes=nid06138 cluster=edison
   Cluster        NodeName           TimeStart             TimeEnd  State                         Reason       User
---------- --------------- ------------------- ------------------- ------ ------------------------------ ----------
nid01605:~ #

but:
mysql> select * from edison_event_table where node_name="nid06138" order by time_start limit 5;
+------------+------------+-----------+---------------+----------------------------+------------+-------+--------------+
| time_start | time_end   | node_name | cluster_nodes | reason                     | reason_uid | state | tres         |
+------------+------------+-----------+---------------+----------------------------+------------+-------+--------------+
| 1452239802 | 1452239956 | nid06138  |               | Not responding             |          0 | 34849 | 1=48,2=64523 |
| 1452239956 | 1452240176 | nid06138  |               | Not responding             |          0 |  2049 | 1=48,2=64523 |
| 1453482650 | 1453573226 | nid06138  |               | Not responding             |          0 |  2049 | 1=48,2=64523 |
| 1453573226 | 1453574719 | nid06138  |               | Node unexpectedly rebooted |          0 |     1 | 1=48,2=64523 |
| 1453918943 | 1453919123 | nid06138  |               | Not responding             |          0 |  2049 | 1=48,2=64523 |
+------------+------------+-----------+---------------+----------------------------+------------+-------+--------------+
5 rows in set (0.00 sec)

mysql>

The last "event" on this node was completed (time_end non-zero).

The slurmctld log indicates it was actively starting jobs on that node immediately before the crash:

[2016-11-01T09:02:08.109] sched: Allocate JobID=2548645 NodeList=nid06138 #CPUs=2 Partition=shared
[2016-11-01T09:02:08.112] sched: Allocate JobID=2548646 NodeList=nid06138 #CPUs=2 Partition=shared
[2016-11-01T09:02:08.115] sched: Allocate JobID=2548647 NodeList=nid06138 #CPUs=2 Partition=shared
[2016-11-01T09:02:08.155] sched: Allocate JobID=2548648 NodeList=nid06138 #CPUs=2 Partition=shared
[2016-11-01T09:02:08.157] sched: Allocate JobID=2548649 NodeList=nid06138 #CPUs=2 Partition=shared



That user has lots of jobs, but not many super short, but nid06138 was definitely active for them.

edique01:~ # sacct -u heikki --format=job,start,end,elapsed,exit,nodelist  -X
       JobID               Start                 End    Elapsed ExitCode        NodeList
------------ ------------------- ------------------- ---------- -------- ---------------
2544790      2016-10-31T23:33:32 2016-11-01T01:33:00   01:59:28      0:0        nid06135
2544791      2016-10-31T23:33:32 2016-11-01T04:30:49   04:57:17      0:0        nid06132
...
2548729      2016-11-01T09:02:08 2016-11-01T09:22:22   00:20:14      0:0        nid06124
2548730      2016-11-01T09:02:08 2016-11-01T09:22:22   00:20:14      0:0        nid06124
2548731      2016-11-01T09:02:08 2016-11-01T09:22:22   00:20:14      0:0        nid06124
2548732      2016-11-01T09:02:08 2016-11-01T09:22:22   00:20:14      0:0        nid06124
2548733      2016-11-01T09:02:08 2016-11-01T09:22:22   00:20:14      0:0        nid06124
2548734      2016-11-01T09:02:08 2016-11-01T09:22:22   00:20:14      0:0        nid06124
2548735      2016-11-01T09:02:08 2016-11-01T09:22:22   00:20:14      0:0        nid06105
2548736      2016-11-01T09:02:08 2016-11-01T09:22:21   00:20:13      0:0        nid06105
2548737      2016-11-01T09:02:08 2016-11-01T09:22:22   00:20:14      0:0        nid06105
2548818      2016-11-01T09:54:21 2016-11-01T10:04:41   00:10:20      0:0        nid06143
2548820      2016-11-01T09:54:21 2016-11-01T10:04:15   00:09:54      0:0        nid06142
2548822      2016-11-01T09:54:21 2016-11-01T10:04:20   00:09:59      0:0        nid06142
2548823      2016-11-01T09:54:21 2016-11-01T10:05:57   00:11:36      0:0        nid06130
2548824      2016-11-01T09:54:21 2016-11-01T10:05:51   00:11:30      0:0        nid06130
2548825      2016-11-01T09:54:21 2016-11-01T10:05:20   00:10:59      0:0        nid06130
2548826      2016-11-01T09:54:21 2016-11-01T10:05:20   00:10:59      0:0        nid06130
2548827      2016-11-01T09:54:21 2016-11-01T10:04:49   00:10:28      0:0        nid06092
2548828      2016-11-01T09:54:21 2016-11-01T10:04:51   00:10:30      0:0        nid06092
2548829      2016-11-01T09:54:21 2016-11-01T10:04:46   00:10:25      0:0        nid06092
2548830      2016-11-01T09:54:21 2016-11-01T10:04:41   00:10:20      0:0        nid06092
2548831      2016-11-01T09:54:21 2016-11-01T10:04:41   00:10:20      0:0        nid06092
2548832      2016-11-01T09:54:21 2016-11-01T10:04:42   00:10:21      0:0        nid06092
...

little bit of weird missing data here from a few hours after the crash:
2551128      2016-11-01T11:49:32 2016-11-01T11:59:10   00:09:38      0:0        nid06116
2551129      2016-11-01T11:49:32 2016-11-01T11:59:23   00:09:51      0:0        nid06116
2551130      2016-11-01T11:49:32 2016-11-01T11:59:27   00:09:55      0:0        nid06116
2551131      2016-11-01T11:49:32 2016-11-01T11:59:26   00:09:54      0:0        nid06116
2551132      2016-11-01T11:49:32 2016-11-01T11:59:32   00:10:00      0:0        nid06116
2551133      2016-11-01T11:49:32 2016-11-01T11:59:17   00:09:45      0:0        nid06116
2551387      2016-11-01T12:07:40 2016-11-01T12:07:40   00:00:00      0:0   None assigned
2551388      2016-11-01T12:07:40 2016-11-01T12:07:40   00:00:00      0:0   None assigned
2551389      2016-11-01T12:07:40 2016-11-01T12:07:40   00:00:00      0:0   None assigned
2551390      2016-11-01T12:07:40 2016-11-01T12:07:40   00:00:00      0:0   None assigned
2551391      2016-11-01T12:07:40 2016-11-01T12:07:40   00:00:00      0:0   None assigned
2551392      2016-11-01T12:07:40 2016-11-01T12:07:40   00:00:00      0:0   None assigned
2551393      2016-11-01T12:07:40 2016-11-01T12:07:40   00:00:00      0:0   None assigned
2551394      2016-11-01T12:07:40 2016-11-01T12:07:40   00:00:00      0:0   None assigned
2551395      2016-11-01T12:07:40 2016-11-01T12:07:40   00:00:00      0:0   None assigned
2551396      2016-11-01T12:07:40 2016-11-01T12:07:40   00:00:00      0:0   None assigned
2551397      2016-11-01T12:07:40 2016-11-01T12:07:40   00:00:00      0:0   None assigned
2551660      2016-11-01T12:24:35 2016-11-01T12:33:38   00:09:03      0:0        nid06116
2551661      2016-11-01T12:24:35 2016-11-01T12:33:55   00:09:20      0:0        nid06142
2551662      2016-11-01T12:24:35 2016-11-01T12:35:27   00:10:52      0:0        nid06138
2551663      2016-11-01T12:24:35 2016-11-01T12:35:24   00:10:49      0:0        nid06138
...
...

and so on.

Comment 9 Doug Jacobsen 2016-11-01 16:41:32 MDT

ah, start filter on that sacctmgr show event.  that node did crash immediately before.

sadly the slurmd logs are in memory, and the node is gone:

sboot:~ # ssh nid03799
ssh: connect to host nid03799 port 22: Network is unreachable
boot:~ # sinfo --node=nid03799
PARTITION AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE      NODELIST
system    up    1-infini      30:00     48 2:12:2       1 down*      nid03799
debug*    up    1-infini      30:00     48 2:12:2       1 down*      nid03799
regular   up    1-infini 4-00:00:00     48 2:12:2       1 down*      nid03799
regularx  up    1-infini 2-00:00:00     48 2:12:2       1 down*      nid03799
realtime  up    1-4        12:00:00     48 2:12:2       1 down*      nid03799
shared    up    1        2-00:00:00      0  0:0:0       0 n/a
special   up    1-infini 10-00:00:0     48 2:12:2       1 down*      nid03799
boot:~ # sacctmgr show event nodes=nid03799
sacctmgr: error: Problem talking to the database: Network is unreachable
boot:~ # ssh ctl1 sacctmgr show event nodes=nid03799
   Cluster        NodeName           TimeStart             TimeEnd  State                         Reason       User
---------- --------------- ------------------- ------------------- ------ ------------------------------ ----------
    edison nid03799        2016-10-30T02:47:37 2016-11-01T09:01:18  DOWN*                 Not responding    root(0)
    edison nid03799        2016-11-01T09:29:04             Unknown  DOWN*                 Not responding    root(0)
boot:~ #

I've just implemented rotation and transference of slurmd logs on cori, but that only happens at job completion, which is unlikely to happen if the node crashes midjob.

Regarding the mismatch of slurm.conf; yes this happens because we don't sighup slurmd between jobs.  That's now happening on cori as part of log rotation, edison will get this when we move to CLE6.

It looks like the node actually went down on 10/30 and hopefully a job wasn't stuck completing for that long!

The database, at least, believes there wasn't:
nid01605:~ # sacct --node=nid03799
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
nid01605:~ #

Comment 10 Doug Jacobsen 2016-11-01 16:42:23 MDT

sorry, to clarify my statement, the node didn't actually crash immediately before this, but it appears there was a state change in sinfo (for reasons I didn't dig into)

Comment 11 Danny Auble 2016-11-01 16:46:02 MDT

(Written before comment 8)

The job_ptr you printed out in the beginning isn't involved in the issue though ;).

The bit_ffs is dying on ss[x].tmpjobs->node_bitmap which is "in theory" pointing to a job that was running on nid03799, I wouldn't expect nid06138 to be involved at this point.

You can use

sacct -Nnid03799 -S9:00 -E10:00 --format=jobid,nodelist,state,start,end

to give hopefully you what you need.

(Written afterwards)

Could you see if the sacct above gives you anything?  I am sad about the logs. That would had been the easiest way to figure things out.

squeue might have a hung job sitting around, or a note in the slurmctld log after the fact, though it might be hard to figure out.  But this does seem to sound like 3211 at this point (which fix is in 16.05.6).

Comment 12 Doug Jacobsen 2016-11-01 16:59:49 MDT

our console logs indicate that nid03799 crashed on 10/30 at 2:39am:

2016-10-30T02:39:41.896526-07:00 c3-2c2s5n3 [<ffffffff8124a3e4>] intel_idle+0xc4/0x150
2016-10-30T02:39:41.896554-07:00 c3-2c2s5n3 [<ffffffff813555a8>] cpuidle_idle_call+0x128/0x220
2016-10-30T02:39:41.896561-07:00 c3-2c2s5n3 [<ffffffff810020f6>] cpu_idle+0x76/0xd0
2016-10-30T02:39:41.896586-07:00 c3-2c2s5n3 [<ffffffff8141304d>] start_secondary+0x1dc/0x26f
2016-10-30T02:39:41.896598-07:00 c3-2c2s5n3 Stop NMI detected on CPU 0

sacct doesn't show anything on the node at that time:
nid01605:~ # sacct --node=nid03799 --start=2016-10-30 -X --format=job,start,end,state
       JobID               Start                 End      State
------------ ------------------- ------------------- ----------
2161624      2016-10-29T22:46:00 2016-10-30T00:29:02  COMPLETED
2485097      2016-10-30T02:22:06 2016-10-30T02:23:42     FAILED
2518692      2016-10-30T01:00:05 2016-10-30T01:45:43  COMPLETED
2518812      2016-10-30T01:47:19 2016-10-30T02:06:50  COMPLETED
nid01605:~ #

I don't see any evidence was allocated for a job after 2485097 before it crashed.

bzgrep 3799 slurmctld.log-20161031.bz2
...
...
[2016-10-30T02:22:31.932] debug:  _slurm_rpc_job_alloc_info_lite JobId=2485097 NodeList=nid0[0296-0323,0328-0378,0393-0451,0456-0497,0502-0511,0516-0579,0584-0613,0615-0767,0772-0835,0840-0895,0897-0963,0968-1151,1156-1219,1224-1233,1255-1279,1281-1321,1587-1603,1608-1637,1649-1663,1666-1717,2536-2687,2692-2755,2760-3071,3076-3139,3144-3455,3460-3523,3528-3647,3694-3839,3844-3907,3912-4124,4126-4209,4231-4291,4296-4300,4302-4607,4992-5351,5353-5443,5448-5452,5454-5759,5764-5827,5832-5951,5956-6076] usec=2
[2016-10-30T02:23:05.688] debug:  _slurm_rpc_job_alloc_info_lite JobId=2485097 NodeList=nid0[0296-0323,0328-0378,0393-0451,0456-0497,0502-0511,0516-0579,0584-0613,
0615-0767,0772-0835,0840-0895,0897-0963,0968-1151,1156-1219,1224-1233,1255-1279,1281-1321,1587-1603,1608-1637,1649-1663,1666-1717,2536-2687,2692-2755,2760-3071,307
6-3139,3144-3455,3460-3523,3528-3647,3694-3839,3844-3907,3912-4124,4126-4209,4231-4291,4296-4300,4302-4607,4992-5351,5353-5443,5448-5452,5454-5759,5764-5827,5832-5
951,5956-6076] usec=1
[2016-10-30T02:23:25.461] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518665 NodeList=nid0[1248-1249] usec=1
[2016-10-30T02:25:12.554] debug:  _slurm_rpc_job_alloc_info_lite JobId=2360338 NodeList=nid0[1248-1249] usec=858568
[2016-10-30T02:25:27.573] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518951 NodeList=nid00292 usec=641095
[2016-10-30T02:25:39.077] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=1
[2016-10-30T02:25:41.119] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=2
[2016-10-30T02:28:36.357] debug:  _slurm_rpc_job_alloc_info_lite JobId=2360339 NodeList=nid0[1248-1249] usec=519370
[2016-10-30T02:29:19.824] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=2
[2016-10-30T02:29:24.127] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=1
[2016-10-30T02:32:34.602] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518952 NodeList=nid00292 usec=607561
[2016-10-30T02:32:58.390] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=1
[2016-10-30T02:33:02.855] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=1
[2016-10-30T02:38:05.341] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=2
[2016-10-30T02:38:12.040] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=2
[2016-10-30T02:39:32.328] debug:  Note large processing time from _slurm_rpc_job_alloc_info_lite: usec=1247485 began=02:39:31.080
[2016-10-30T02:39:32.328] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518956 NodeList=nid000[08-23] usec=1247485
[2016-10-30T02:39:32.328] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518967 NodeList=nid00024 usec=31591
[2016-10-30T02:39:40.144] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518954 NodeList=nid0[0615-0767,0772-0835,0840-0895,0897-0963,0968-1127] usec=1
[2016-10-30T02:39:41.952] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518955 NodeList=nid0[5454-5759,5764-5827,5832-5861] usec=2
[2016-10-30T02:40:09.163] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518820 NodeList=nid00[296-323,328-383,388-431] usec=3
[2016-10-30T02:41:16.567] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518826 NodeList=nid0[5353-5436] usec=2
[2016-10-30T02:41:38.726] debug:  _slurm_rpc_job_alloc_info_lite JobId=2517441 NodeList=nid00[498-501] usec=2
[2016-10-30T02:41:39.529] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518819 NodeList=nid0[4126-4209] usec=2
[2016-10-30T02:41:55.695] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=1
[2016-10-30T02:41:57.474] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=1
[2016-10-30T02:43:01.249] debug:  _slurm_rpc_job_alloc_info_lite JobId=2516167 NodeList=nid0[5437-5443,5448] usec=899183
[2016-10-30T02:43:07.913] debug:  _slurm_rpc_job_alloc_info_lite JobId=2518966 NodeList=nid0[4552-4607,4992-5335] usec=2
[2016-10-30T02:43:09.749] debug:  _slurm_rpc_job_alloc_info_lite JobId=2161632 NodeList=nid0[2536-2687,2692-2755,2760-3071,3076-3139,3144-3238,3319-3455,3460-3523,
3528-3647,5336-5351] usec=2
[2016-10-30T02:43:09.760] debug:  _slurm_rpc_job_alloc_info_lite JobId=2161634 NodeList=nid0[1323-1347,1352-1535,1540-1603,1608-1647,3829-3839,3844-3907,3912-4118,
4302-4550,5862-5951,5956-6045] usec=4968
nid02305:/var/tmp/slurm # bzgrep 3799 slurmctld.log-20161031.bz2
[2016-10-30T00:43:23.536] backfill test for JobID=2379803 Prio=73799 Partition=regular
[2016-10-30T00:46:48.402] backfill test for JobID=2379803 Prio=73799 Partition=regular
[2016-10-30T00:50:38.940] debug:  reserved ports 63799 for step 2246906.72
[2016-10-30T02:00:56.224] Job 2512459 to start at 2016-10-31T07:05:00, end at 2016-10-31T19:05:00 on nid0[3790-3799]
[2016-10-30T02:34:24.605] error: Node nid03799 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2016-10-30T02:38:46.837] agent/is_node_resp: node:nid03799 RPC:REQUEST_PING : Socket timed out on send/recv operation
[2016-10-30T02:38:48.327] error: Nodes nid0[3239,3318,3799,3828] not responding
[2016-10-30T02:39:17.308] debug:  Spawning ping agent for nid0[0008-0063,0072-0127,0136-0191,0200-0244,0246-0255,0264-0291,0293-0323,0328-0383,0388-0451,0456-0511,0516-0579,0584-0613,0615-0767,0772-0835,0840-0895,0897-0963,0968-1151,1156-1219,1224-1233,1235-1279,1281-1321,1323-1347,1352-1535,1540-1591,1593-1603,1608-1663,1666-1717,1719-1792,1794-1919,1924-1987,1992-2047,2050-2303,2308-2371,2376-2499,2504-2687,2692-2755,2760-3071,3076-3139,3144-3238,3240-3317,3319-3455,3460-3523,3528-3670,3672-3795,3799,3814-3827,3829-3831,3872-3907,3912-4223,4228-4229,4231-4291,4296-4300,4302-4607,4992-5351,5353-5443,5448-5452,5454-5759,5764-5827,5832-5951,5956-6104,6106-6114,6116-6121,6123-6129,6132,6134-6139,6141-6142]
[2016-10-30T02:40:17.594] Node nid03799 now responding
[2016-10-30T02:42:37.670] debug:  Spawning ping agent for nid0[0008-0023,0025-0063,0072-0127,0136-0191,0200-0244,0246-0255,0264-0291,0293-0323,0328-0383,0388-0451,0456-0497,0502-0511,0516-0579,0584-0613,1124,1128-1151,1156-1219,1224-1233,1235-1279,1281-1321,1323-1347,1352-1535,1540-1591,1593-1603,1608-1663,1666-1717,1719-1792,1794-1919,1924-1987,1992-2047,2050-2303,2308-2371,2376-2499,2504-2687,2692-2755,2760-3012,3014-3071,3076-3139,3144-3238,3240-3317,3319-3455,3460-3523,3528-3670,3672-3795,3799,3814-3827,3829-3831,3854-3871,3912-4223,4228-4229,4231-4291,4296-4300,4302-4607,4992-5351,5353-5443,5448-5452,5454-5759,5764-5827,5832-5951,5956-6104,6106-6114,6116-6121,6123-6129,6132,6134-6139,6141]
...
...


It's a little odd that it keeps getting the "now responding" messages.  And then going down again.

It looks like slurmdbd only recorded the initial down event 10 minutes after it crashed (which is about right given that we allow nodes 600s.

nid02305:/var/tmp/slurm # sacctmgr show event nodes=nid03799 start=2016-10-01
   Cluster        NodeName           TimeStart             TimeEnd  State                         Reason       User
---------- --------------- ------------------- ------------------- ------ ------------------------------ ----------
    edison nid03799        2016-10-30T02:47:37 2016-11-01T09:01:18  DOWN*                 Not responding    root(0)
    edison nid03799        2016-11-01T09:29:04             Unknown  DOWN*                 Not responding    root(0)
nid02305:/var/tmp/slurm #


So it seems unlikely a job was completing on this node at the time of the slurmctld crash.

Comment 13 Danny Auble 2016-11-01 17:38:52 MDT

Yeah, I don't think the incident at 2:39 was related (and perhaps this node isn't related all together).  According to your sacctmgr output the node came back up at 2016-11-01T09:01:18 and then went down again at 2016-11-01T09:29:04.

That last down is the only time of interest.

From the logs

[2016-11-01T09:01:18.134] update_node: node nid03799 state set to IDLE

something appears to have made that node idle again.

But going through the logs I can't seem to find any evidence this node is related to the issue. (Meaning no jobs were running on it)

I am not totally convinced this isn't related to 3211 though.  In any case it sounds like things are up and running though and this only happened once.  As you are going to put 16.05.6 on tomorrow the problem might not ever show up again.

Comment 14 Doug Jacobsen 2016-11-01 18:11:30 MDT

Well, I see the nid03799 set to IDLE.  What's concerning about that is 
that the node was fully down

2016-10-30T02:39:41.896598-07:00 c3-2c2s5n3 Stop NMI detected on CPU 0


So it's unclear to me how slurmd could have responded at 09:01 this 
morning...


On 11/1/16 4:38 PM, bugs@schedmd.com wrote:
>
> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=3230#c13> on 
> bug 3230 <https://bugs.schedmd.com/show_bug.cgi?id=3230> from Danny 
> Auble <mailto:da@schedmd.com> *
> Yeah, I don't think the incident at 2:39 was related (and perhaps this node
> isn't related all together).  According to your sacctmgr output the node came
> back up at 2016-11-01T09:01:18 and then went down again at 2016-11-01T09:29:04.
>
> That last down is the only time of interest.
>
>  From the logs
>
> [2016-11-01T09:01:18.134] update_node: node nid03799 state set to IDLE
>
> something appears to have made that node idle again.
>
> But going through the logs I can't seem to find any evidence this node is
> related to the issue. (Meaning no jobs were running on it)
>
> I am not totally convinced this isn't related to 3211 though.  In any case it
> sounds like things are up and running though and this only happened once.  As
> you are going to put 16.05.6 on tomorrow the problem might not ever show up
> again.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 15 Danny Auble 2016-11-01 18:17:44 MDT

This looks like a scontrol thing, not anything from the node. 

On November 1, 2016 6:11:30 PM MDT, bugs@schedmd.com wrote:
>https://bugs.schedmd.com/show_bug.cgi?id=3230
>
>--- Comment #14 from Doug Jacobsen <dmjacobsen@lbl.gov> ---
>Well, I see the nid03799 set to IDLE.  What's concerning about that is 
>that the node was fully down
>
>2016-10-30T02:39:41.896598-07:00 c3-2c2s5n3 Stop NMI detected on CPU 0
>
>
>So it's unclear to me how slurmd could have responded at 09:01 this 
>morning...
>
>
>On 11/1/16 4:38 PM, bugs@schedmd.com wrote:
>>
>> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=3230#c13> on 
>> bug 3230 <https://bugs.schedmd.com/show_bug.cgi?id=3230> from Danny 
>> Auble <mailto:da@schedmd.com> *
>> Yeah, I don't think the incident at 2:39 was related (and perhaps
>this node
>> isn't related all together).  According to your sacctmgr output the
>node came
>> back up at 2016-11-01T09:01:18 and then went down again at
>2016-11-01T09:29:04.
>>
>> That last down is the only time of interest.
>>
>>  From the logs
>>
>> [2016-11-01T09:01:18.134] update_node: node nid03799 state set to
>IDLE
>>
>> something appears to have made that node idle again.
>>
>> But going through the logs I can't seem to find any evidence this
>node is
>> related to the issue. (Meaning no jobs were running on it)
>>
>> I am not totally convinced this isn't related to 3211 though.  In any
>case it
>> sounds like things are up and running though and this only happened
>once.  As
>> you are going to put 16.05.6 on tomorrow the problem might not ever
>show up
>> again.
>>
>------------------------------------------------------------------------
>> You are receiving this mail because:
>>
>>   * You reported the bug.
>>
>
>-- 
>You are receiving this mail because:
>You are watching all bug changes.

Comment 16 Doug Jacobsen 2016-11-03 05:14:02 MDT

Odd, I wonder who/what/where is attempting to manipulate the node (cle52 
didn't try to adjust the reason code nearly as much as cle6)

We updated edison to slurm 16.05.6 yesterday.  I suppose we can 
wait-and-see if this issue re-occurs.

-Doug

Comment 17 Doug Jacobsen 2016-11-03 11:34:08 MDT

I've updated the title of the bug. The reoccurred in slurm 16.05.6

(gdb) bt
#0  0x00007fc03cdf2875 in raise () from /lib64/libc.so.6
#1  0x00007fc03cdf3e51 in abort () from /lib64/libc.so.6
#2  0x00007fc03cdeb740 in __assert_fail () from /lib64/libc.so.6
#3  0x0000000000512f14 in bit_ffs (b=0x0) at bitstring.c:484
#4  0x00007fc030e28ecc in _build_row_bitmaps (p_ptr=0x4c62870,
    job_ptr=0x5c00ee0) at select_cons_res.c:655
#5  0x00007fc030e2adfa in _rm_job_from_res (part_record_ptr=0x26e59d0,
    node_usage=0x5391130, job_ptr=0x5c00ee0, action=0)
    at select_cons_res.c:1273
#6  0x00007fc030e2d984 in select_p_job_fini (job_ptr=0x5c00ee0)
    at select_cons_res.c:2397
#7  0x00007fc03c7ad11b in other_job_fini (job_ptr=0x5c00ee0)
    at other_select.c:386
#8  0x00007fc03c7a89b6 in _job_fini (args=0x5c00ee0) at select_cray.c:1083
#9  0x00007fc03d143806 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fc03ce9e9bd in clone () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb) set print pretty
(gdb) up
#1  0x00007fc03cdf3e51 in abort () from /lib64/libc.so.6
(gdb) up
#2  0x00007fc03cdeb740 in __assert_fail () from /lib64/libc.so.6
(gdb) up
#3  0x0000000000512f14 in bit_ffs (b=0x0) at bitstring.c:484
484	bitstring.c: No such file or directory.
(gdb) up
#4  0x00007fc030e28ecc in _build_row_bitmaps (p_ptr=0x4c62870,
    job_ptr=0x5c00ee0) at select_cons_res.c:655
655	select_cons_res.c: No such file or directory.
(gdb) up
#5  0x00007fc030e2adfa in _rm_job_from_res (part_record_ptr=0x26e59d0,
    node_usage=0x5391130, job_ptr=0x5c00ee0, action=0)
    at select_cons_res.c:1273
1273	in select_cons_res.c
(gdb) down
#4  0x00007fc030e28ecc in _build_row_bitmaps (p_ptr=0x4c62870,
    job_ptr=0x5c00ee0) at select_cons_res.c:655
655	in select_cons_res.c
(gdb) down
#3  0x0000000000512f14 in bit_ffs (b=0x0) at bitstring.c:484
484	bitstring.c: No such file or directory.
(gdb) up
#4  0x00007fc030e28ecc in _build_row_bitmaps (p_ptr=0x4c62870,
    job_ptr=0x5c00ee0) at select_cons_res.c:655
655	select_cons_res.c: No such file or directory.
(gdb) print *p_ptr
$1 = {
  next = 0x262f5b0,
  num_rows = 25,
  part_ptr = 0x240f0c0,
  row = 0x4041a60
}
(gdb) print *job_ptr
$2 = {
  account = 0x2813690 "m1304",
  alias_list = 0x0,
  alloc_node = 0x44e1300 "edison06",
  alloc_resp_port = 0,
  alloc_sid = 37672,
  array_job_id = 0,
  array_task_id = 4294967294,
  array_recs = 0x0,
  assoc_id = 7309,
  assoc_ptr = 0x16a8a80,
  batch_flag = 1,
  batch_host = 0x7fbff83da6b0 "nid06111",
  bit_flags = 0,
  burst_buffer = 0x0,
  check_job = 0x0,
  ckpt_interval = 0,
  ckpt_time = 0,
  comment = 0x0,
  cpu_cnt = 0,
  billable_tres = 2,
  cr_enabled = 1,
  deadline = 0,
  db_index = 13707499,
---Type <return> to continue, or q <return> to quit---
  derived_ec = 0,
  details = 0x5c01220,
  direct_set_prio = 0,
  end_time = 1478181082,
  end_time_exp = 4294967294,
  epilog_running = false,
  exit_code = 0,
  front_end_ptr = 0x0,
  gres = 0x5c01400 "craynetwork:0",
  gres_list = 0x0,
  gres_alloc = 0x7fbff83da5d0 "craynetwork:4",
  gres_req = 0x7fbff83da700 "craynetwork:0",
  gres_used = 0x0,
  group_id = 62595,
  job_id = 2571089,
  job_next = 0x0,
  job_array_next_j = 0x0,
  job_array_next_t = 0x0,
  job_resrcs = 0x7fbff8503d50,
  job_state = 3,
  kill_on_node_fail = 1,
  licenses = 0x5c01430 "cscratch1:1,project:1",
  license_list = 0x7fc012bcff60,
  limit_set = {
---Type <return> to continue, or q <return> to quit---
    qos = 0,
    time = 0,
    tres = 0x3c84680
  },
  mail_type = 0,
  mail_user = 0x0,
  magic = 4038539564,
  mcs_label = 0x0,
  name = 0x39585d0 "qsub_analyse-161103-RUN51186.sh",
  network = 0x0,
  next_step_id = 0,
  nodes = 0x7fbff81e09a0 "nid06111",
  node_addr = 0x7fbff83da630,
  node_bitmap = 0x7fbff80ac6f0,
  node_bitmap_cg = 0x7fbfd4173500,
  node_cnt = 0,
  node_cnt_wag = 1,
  nodes_completing = 0x0,
  other_port = 0,
  partition = 0x46735c0 "shared",
  part_ptr_list = 0x0,
  part_nodes_missing = false,
  part_ptr = 0x240f0c0,
  power_flags = 0 '\000',
---Type <return> to continue, or q <return> to quit---
  pre_sus_time = 0,
  preempt_time = 0,
  preempt_in_progress = false,
  priority = 57222,
  priority_array = 0x0,
  prio_factors = 0x3c84610,
  profile = 0,
  qos_id = 1,
  qos_ptr = 0x8f8520,
  reboot = 0 '\000',
  restart_cnt = 0,
  resize_time = 0,
  resv_id = 0,
  resv_name = 0x0,
  resv_ptr = 0x0,
  requid = 4294967295,
  resp_host = 0x0,
  sched_nodes = 0x0,
  select_jobinfo = 0x5c014e0,
  spank_job_env = 0x47a82d0,
  spank_job_env_size = 1,
  start_protocol_ver = 7680,
  start_time = 1478181072,
  state_desc = 0x0,
---Type <return> to continue, or q <return> to quit---
  state_reason = 0,
  state_reason_prev = 1,
  step_list = 0x456b700,
  suspend_time = 0,
  time_last_active = 1478181082,
  time_limit = 59,
  time_min = 0,
  tot_sus_time = 0,
  total_cpus = 2,
  total_nodes = 1,
  tres_req_cnt = 0x39fa0f0,
  tres_req_str = 0x44e1010 "1=1,2=1300,4=1",
  tres_fmt_req_str = 0x5c013a0 "cpu=1,mem=1300,node=1",
  tres_alloc_cnt = 0x7fbff857d120,
  tres_alloc_str = 0x7fbff857d230 "1=2,2=2600,4=1",
  tres_fmt_alloc_str = 0x7fbff818c840 "cpu=2,mem=2600M,node=1",
  user_id = 62595,
  wait_all_nodes = 0,
  warn_flags = 0,
  warn_signal = 0,
  warn_time = 0,
  wckey = 0x0,
  req_switch = 0,
  wait4switch = 0,
---Type <return> to continue, or q <return> to quit---
  best_switch = true,
  wait4switch_start = 0
}
(gdb) print *job_ptr->details
$3 = {
  acctg_freq = 0x0,
  argc = 1,
  argv = 0x3bdf730,
  begin_time = 1478181026,
  ckpt_dir = 0x5c01510 "/var/slurm/checkpoint",
  contiguous = 0,
  core_spec = 65534,
  cpu_bind = 0x0,
  cpu_bind_type = 65534,
  cpu_freq_min = 4294967294,
  cpu_freq_max = 4294967294,
  cpu_freq_gov = 4294967294,
  cpus_per_task = 1,
  depend_list = 0x0,
  dependency = 0x0,
  orig_dependency = 0x0,
  env_cnt = 0,
  env_sup = 0x0,
  exc_node_bitmap = 0x0,
  exc_nodes = 0x0,
  expanding_jobid = 0,
  feature_list = 0x0,
  features = 0x0,
---Type <return> to continue, or q <return> to quit---
  magic = 0,
  max_cpus = 4294967294,
  max_nodes = 1,
  mc_ptr = 0x276f5c0,
  mem_bind = 0x0,
  mem_bind_type = 0,
  min_cpus = 1,
  min_nodes = 1,
  nice = 2147483648,
  ntasks_per_node = 0,
  num_tasks = 1,
  open_mode = 0 '\000',
  overcommit = 0 '\000',
  plane_size = 0,
  pn_min_cpus = 1,
  pn_min_memory = 2147484948,
  pn_min_tmp_disk = 0,
  prolog_running = 0 '\000',
  reserved_resources = 0,
  req_node_bitmap = 0x0,
  req_node_layout = 0x0,
  preempt_start_time = 0,
  req_nodes = 0x0,
  requeue = 0,
---Type <return> to continue, or q <return> to quit---
  restart_dir = 0x0,
  share_res = 1 '\001',
  std_err = 0x0,
  std_in = 0x5c01460 "/dev/null",
  std_out = 0x0,
  submit_time = 1478181026,
  task_dist = 8192,
  usable_nodes = 0,
  whole_node = 0 '\000',
  work_dir = 0x5c01490 "/global/u1/e/epuesche/EBLtosummer2016/v480a/scripts/VTS"
}
(gdb)
(gdb)
(gdb) print *ss[x].tmpjobs
$4 = {
  core_bitmap = 0x0,
  core_bitmap_used = 0x0,
  cpu_array_cnt = 1,
  cpu_array_value = 0x0,
  cpu_array_reps = 0x0,
  cpus = 0x0,
  cpus_used = 0x0,
  cores_per_socket = 0x0,
  memory_allocated = 0x0,
  memory_used = 0x0,
  nhosts = 1,
  node_bitmap = 0x0,
  node_req = 0,
  nodes = 0x0,
  ncpus = 1,
  sock_core_rep_count = 0x0,
  sockets_per_node = 0x0,
  whole_node = 0 '\000'
}
(gdb)

Comment 18 Danny Auble 2016-11-03 13:02:55 MDT

Ah, well that is good confirmation this isn't related to bug 3211 then.

We will continue to look into it.  I am guessing this is more related to bug 3185 as previously thought.

Comment 19 Danny Auble 2016-11-03 13:04:04 MDT

Doug could you post an updated slurmctld log for the hour before this incident happened?

Comment 20 Doug Jacobsen 2016-11-03 13:24:27 MDT

Created attachment 3675 [details]
slurmctld for today

sorry i didn't have time to filter it.

Comment 21 Moe Jette 2016-11-04 16:36:11 MDT

I've been reviewing code and testing all day, but unfortunately don't have a solution for you.

What I can tell you is that the root problem is a data structure in the select/cons_res plugin contains a list of pointers to the active job's resource allocations. That structure contains a NULL node bitmap, which triggers the abort.

The job in the backtrace is not the job with the bad data, but it triggeres the failure due to a pre-existing bad data structure. My best guess is the underlying issue is some sort of race condition involving NHC and job requeue, but that's just a guess.

We could get you a patch that skips over the record with the NULL pointer, but at some point the pointer will likely be a invalid non-NULL pointer and that could result in severe and unpredictable consequences, so I'll concentrate on getting to the bottom of the problem and fixing it.

Notes to self:
1. My best guess is the free_job_resources() in cr_job_test() in the job_test.c module is getting called on a job that is still running the NHC/cleaning. When NHC completes, that will try to remove the job's record via other_job_fini() called form _job_fini() in select_cray.c, but not find it and leave an old/bad pointer.
2. The use of "job_list" in select/cons_res (see _rm_job_from_res() in select_cons_res.c) is fast and efficient, but not robust in the case of failures. It should probably be changed to a job pointer.

Comment 22 Moe Jette 2016-11-07 15:28:34 MST

I was just able to reproduce this failure with the addition of some well placed "sleep" calls in the slurm logic. Any time a running job is requeued and the job/allocation NHC runs for longer than the Munge credential lifetime (default 5 minutes), a data structure is cleared and later referenced. There are a lot of moving parts and the failure can not easily be reproduced, but the requeue/NHC timing issue described above makes this failure possible.

Comment 23 Moe Jette 2016-11-07 16:48:29 MST

This commit fixes a race condition that I was able to exploit and reproduce this failure. I don't have sufficient information to be certain this is what caused your aborts, but it was definitely a bug that could cause this failure and I don't see any other possibilities.
https://github.com/SchedMD/slurm/commit/69af50af965d70af5642e156b4186ed6cc13aaa8

Comment 24 Doug Jacobsen 2016-11-07 18:20:44 MST

OK, this seems promising.  I'll test this on alva a bit tonight for 
basic functionality and deploy to edison tomorrow.

Thanks,

Doug


On 11/7/16 3:48 PM, bugs@schedmd.com wrote:
>
> *Comment # 23 <https://bugs.schedmd.com/show_bug.cgi?id=3230#c23> on 
> bug 3230 <https://bugs.schedmd.com/show_bug.cgi?id=3230> from Moe 
> Jette <mailto:jette@schedmd.com> *
> This commit fixes a race condition that I was able to exploit and reproduce
> this failure. I don't have sufficient information to be certain this is what
> caused your aborts, but it was definitely a bug that could cause this failure
> and I don't see any other possibilities.
> https://github.com/SchedMD/slurm/commit/69af50af965d70af5642e156b4186ed6cc13aaa8
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 25 Moe Jette 2016-11-14 08:48:11 MST

(In reply to Doug Jacobsen from comment #24)
> OK, this seems promising.  I'll test this on alva a bit tonight for 
> basic functionality and deploy to edison tomorrow.

Since I haven't heard back, I'll close the ticket based upon the assumption this patch fixed the problem. Please re-open if necessary.

Comment 26 Doug Jacobsen 2016-11-14 09:00:21 MST

Hi Moe,

Sorry about that.  I applied the patch on edison (but not cori) and we 
haven't had any problems since.  The key difference between edison and 
cori is that cori doesn't run the NHC via the cray/slurm plugin 
mechanisms any more.  Though I gather from reading the patch there are 
other pathways for finding this bug. I'll roll this patch into cori at 
the next opportunity.

-Doug