Hello, We had another slurmctld crash on edison this morning, now with 16.05.5 (updated Thursday morning, just before the 16.05.6 announcement). I'll attached the logs shortly. This looks be related to bug 3185. Our 16.05.5 does have the patch provided in 3185, but we still got this crash. We're having an unrelated system maintenance tomorrow, so I will have an opportunity to upgrade from 16.05.5 to 16.05.6 on edison. -Doug Program terminated with signal 6, Aborted. #0 0x00007f7bc12e8875 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00007f7bc12e8875 in raise () from /lib64/libc.so.6 #1 0x00007f7bc12e9e51 in abort () from /lib64/libc.so.6 #2 0x00007f7bc12e1740 in __assert_fail () from /lib64/libc.so.6 #3 0x0000000000512bb0 in bit_ffs (b=0x0) at bitstring.c:484 #4 0x00007f7bbb869ecc in _build_row_bitmaps (p_ptr=0x7f7b5415c680, job_ptr=0x4080900) at select_cons_res.c:655 #5 0x00007f7bbb86bdfa in _rm_job_from_res (part_record_ptr=0x7f7b5412a510, node_usage=0x7f7b547f5f70, job_ptr=0x4080900, action=0) at select_cons_res.c:1273 #6 0x00007f7bbb86e984 in select_p_job_fini (job_ptr=0x4080900) at select_cons_res.c:2397 #7 0x00007f7bc0aeb11b in other_job_fini (job_ptr=0x4080900) at other_select.c:386 #8 0x00007f7bc0ae69b6 in _job_fini (args=0x4080900) at select_cray.c:1083 #9 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #10 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #11 0x0000000000000000 in ?? () (gdb) set print pretty (gdb) up #1 0x00007f7bc12e9e51 in abort () from /lib64/libc.so.6 (gdb) up #2 0x00007f7bc12e1740 in __assert_fail () from /lib64/libc.so.6 (gdb) up #3 0x0000000000512bb0 in bit_ffs (b=0x0) at bitstring.c:484 484 bitstring.c: No such file or directory. (gdb) up #4 0x00007f7bbb869ecc in _build_row_bitmaps (p_ptr=0x7f7b5415c680, job_ptr=0x4080900) at select_cons_res.c:655 655 select_cons_res.c: No such file or directory. (gdb) print *job_ptr $1 = { account = 0x2ddc440 "m1820", alias_list = 0x0, alloc_node = 0x46dee20 "edison12", alloc_resp_port = 0, alloc_sid = 4664, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 9101, assoc_ptr = 0x137dbc0, batch_flag = 1, batch_host = 0x7f7b5c112da0 "nid06138", bit_flags = 0, burst_buffer = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 0, billable_tres = 2, cr_enabled = 1, deadline = 0, db_index = 13617048, derived_ec = 0, details = 0x3bc28b0, direct_set_prio = 0, end_time = 1478015968, end_time_exp = 4294967294, epilog_running = false, exit_code = 0, front_end_ptr = 0x0, gres = 0x2dbc700 "craynetwork:0", gres_list = 0x0, gres_alloc = 0x7f7b5c141f30 "craynetwork:4", ---Type <return> to continue, or q <return> to quit--- gres_req = 0x7f7b5c112ca0 "craynetwork:0", gres_used = 0x0, group_id = 70270, job_id = 2546456, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x7f7b5c1fa8d0, job_state = 3, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set = { qos = 0, time = 0, tres = 0x45fa4f0 }, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x3be1170 "runjob.sh", network = 0x0, next_step_id = 0, nodes = 0x7f7b5c112b70 "nid06138", node_addr = 0x7f7b5c1e4ad0, node_bitmap = 0x7f7b5c19d190, node_bitmap_cg = 0x3f10580, node_cnt = 0, node_cnt_wag = 1, nodes_completing = 0x0, other_port = 0, partition = 0x2722060 "shared", part_ptr_list = 0x0, part_nodes_missing = false, ---Type <return> to continue, or q <return> to quit--- part_ptr = 0x7f7b544441b0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 56933, priority_array = 0x0, prio_factors = 0x4116860, profile = 0, qos_id = 1, qos_ptr = 0x8f71d0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x34e8e30, spank_job_env = 0x4973f30, spank_job_env_size = 1, start_protocol_ver = 7680, start_time = 1478001339, state_desc = 0x0, state_reason = 0, state_reason_prev = 0, step_list = 0x42a5ff0, suspend_time = 0, time_last_active = 1478015968, time_limit = 360, time_min = 0, tot_sus_time = 1, total_cpus = 2, ---Type <return> to continue, or q <return> to quit--- total_nodes = 1, tres_req_cnt = 0x39555a0, tres_req_str = 0x8bef40 "1=1,2=1300,4=1", tres_fmt_req_str = 0x3543b10 "cpu=1,mem=1300,node=1", tres_alloc_cnt = 0x7f7b5c171a30, tres_alloc_str = 0x7f7b5c112df0 "1=2,2=2600,4=1", tres_fmt_alloc_str = 0x7f7b5c141ed0 "cpu=2,mem=2600M,node=1", user_id = 70270, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0 } (gdb) thread apply all bt Thread 31 (Thread 0x7f7bb1edf700 (LWP 29557)): #0 0x00007f7bc138b4d6 in poll () from /lib64/libc.so.6 #1 0x000000000057fa8e in slurm_recv_timeout (fd=24, buffer=0x7f7bb1edebac "{\177<", size=4, flags=0, timeout=60000) at slurm_protocol_socket_implementation.c:360 #2 0x000000000057f517 in slurm_msg_recvfrom_timeout (fd=24, pbuf=0x7f7bb1edec18, lenp=0x7f7bb1edec10, flags=0, tmout=60000) at slurm_protocol_socket_implementation.c:160 #3 0x0000000000541d6a in slurm_receive_msgs (fd=24, steps=0, timeout=60000) at slurm_protocol_api.c:3275 #4 0x0000000000543894 in _send_and_recv_msgs (fd=24, req=0x7f7bb1edee50, timeout=60000) at slurm_protocol_api.c:4099 #5 0x0000000000543f7b in slurm_send_addr_recv_msgs (msg=0x7f7bb1edee50, name=0x267b7c0 "nid01793", timeout=60000) at slurm_protocol_api.c:4377 #6 0x0000000000502289 in _fwd_tree_thread (arg=0x7f7b5435bfe0) at forward.c:398 #7 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #9 0x0000000000000000 in ?? () Thread 30 (Thread 0x7f7bb14d5700 (LWP 28213)): #0 0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0 #1 0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb14d4e70) at select_cray.c:320 #2 0x00007f7bc0ae6962 in _job_fini (args=0x3892b70) at select_cray.c:1074 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 29 (Thread 0x7f7bb23e4700 (LWP 28305)): #0 0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0 #1 0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb23e3e70) at select_cray.c:320 #2 0x00007f7bc0ae6962 in _job_fini (args=0x3d9adc0) at select_cray.c:1074 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 28 (Thread 0x7f7ba02c2700 (LWP 31261)): #0 0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6 #1 0x00007f7bc136137c in sleep () from /lib64/libc.so.6 #2 0x00007f7bc0ae6802 in _wait_job_completed (job_id=2548627, job_ptr=0x7f7ba83a2780) at select_cray.c:1036 #3 0x00007f7bc0ae6959 in _job_fini (args=0x7f7ba83a2780) at select_cray.c:1071 ---Type <return> to continue, or q <return> to quit--- #4 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () Thread 27 (Thread 0x7f7bb25e6700 (LWP 27013)): #0 0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0 #1 0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb25e5e70) at select_cray.c:320 #2 0x00007f7bc0ae6962 in _job_fini (args=0x3b19950) at select_cray.c:1074 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 26 (Thread 0x7f7bb11d2700 (LWP 29538)): #0 0x00007f7bc163d66c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000502fb6 in start_msg_tree (hl=0x4cdbf80, msg=0x7f7bb11d1e40, timeout=0) at forward.c:727 #2 0x0000000000543dac in slurm_send_recv_msgs (nodelist=0x340b720 "nid0[0245,0614,1234,1322,1718,1793,3671,3799,4119,4230,4301,4551,5352,5453,5622-5639,6115]", msg=0x7f7bb11d1e40, timeout=0, quiet=true) at slurm_protocol_api.c:4327 #3 0x000000000043ed05 in _thread_per_group_rpc (args=0x24c9540) at agent.c:902 #4 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () Thread 25 (Thread 0x7f7bb28e9700 (LWP 29537)): #0 0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6 #1 0x00007f7bc138e574 in usleep () from /lib64/libc.so.6 #2 0x000000000043e03c in _wdog (args=0x3cc43f0) at agent.c:586 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 24 (Thread 0x7f7ba2ded700 (LWP 27984)): #0 0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0 #1 0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7ba2dece70) at select_cray.c:320 #2 0x00007f7bc0ae6962 in _job_fini (args=0x7f7b4c133aa0) at select_cray.c:1074 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 23 (Thread 0x7f7bb30f0700 (LWP 28295)): #0 0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0 #1 0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb30efe70) at select_cray.c:320 #2 0x00007f7bc0ae6962 in _job_fini (args=0x7f7b4c132d30) at select_cray.c:1074 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 22 (Thread 0x7f7bb09c9700 (LWP 27445)): #0 0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0 #1 0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb09c8e70) at select_cray.c:320 #2 0x00007f7bc0ae6962 in _job_fini (args=0x7f7ba8176160) at select_cray.c:1074 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 21 (Thread 0x7f7bb2fef700 (LWP 29536)): #0 0x00007f7bc1639fe5 in pthread_join () from /lib64/libpthread.so.0 #1 0x000000000043d57d in agent (args=0x7f7b8c180e10) at agent.c:338 #2 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () Thread 20 (Thread 0x7f7bb39f9700 (LWP 28311)): #0 0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0 #1 0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb39f8e70) at select_cray.c:320 #2 0x00007f7bc0ae6962 in _job_fini (args=0x413fc30) at select_cray.c:1074 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 19 (Thread 0x7f7bb85c6700 (LWP 20601)): ---Type <return> to continue, or q <return> to quit--- #0 0x00007f7bc1641527 in do_sigwait () from /lib64/libpthread.so.0 #1 0x00007f7bc16415cd in sigwait () from /lib64/libpthread.so.0 #2 0x00000000004459ba in _slurmctld_signal_hand (no_data=0x0) at controller.c:876 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 18 (Thread 0x7f7bbc297700 (LWP 20524)): #0 0x00007f7bc1639fe5 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007f7bbc39cc6b in _cleanup_thread (no_data=0x0) at accounting_storage_slurmdbd.c:443 #2 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () Thread 17 (Thread 0x7f7b83bfb700 (LWP 31355)): #0 0x00007f7bc138b4d6 in poll () from /lib64/libc.so.6 #1 0x000000000057fa8e in slurm_recv_timeout (fd=4, buffer=0x7f7b83bfad5c "", size=4, flags=0, timeout=60000) at slurm_protocol_socket_implementation.c:360 #2 0x000000000057f517 in slurm_msg_recvfrom_timeout (fd=4, pbuf=0x7f7b83bfadc0, lenp=0x7f7b83bfadb8, flags=0, tmout=60000) at slurm_protocol_socket_implementation.c:160 #3 0x0000000000541723 in slurm_receive_msg (fd=4, msg=0x26c0380, timeout=60000) at slurm_protocol_api.c:3096 #4 0x0000000000446461 in _service_connection (arg=0x45cc930) at controller.c:1104 #5 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #6 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? () Thread 16 (Thread 0x7f7bb84c5700 (LWP 20602)): #0 0x00007f7bc163d9fc in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004d1ad5 in slurmctld_state_save (no_data=0x0) at state_save.c:205 #2 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () Thread 15 (Thread 0x7f7bb1dde700 (LWP 27877)): #0 0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0 #1 0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb1ddde70) at select_cray.c:320 ---Type <return> to continue, or q <return> to quit--- #2 0x00007f7bc0ae6962 in _job_fini (args=0x3c74580) at select_cray.c:1074 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 14 (Thread 0x7f7bb35f5700 (LWP 31350)): #0 0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6 #1 0x00007f7bc136137c in sleep () from /lib64/libc.so.6 #2 0x00007f7bc0ae6802 in _wait_job_completed (job_id=2548590, job_ptr=0x46b2570) at select_cray.c:1036 #3 0x00007f7bc0ae6959 in _job_fini (args=0x46b2570) at select_cray.c:1071 #4 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () Thread 13 (Thread 0x7f7bb15d6700 (LWP 27216)): #0 0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0 #1 0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7bb15d5e70) at select_cray.c:320 #2 0x00007f7bc0ae6962 in _job_fini (args=0x498b590) at select_cray.c:1074 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 12 (Thread 0x7f7bb89ca700 (LWP 20599)): #0 0x00007f7bc1639fe5 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007f7bb8ad2c3e in _cleanup_thread (no_data=0x0) at priority_multifactor.c:1425 #2 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () Thread 11 (Thread 0x7f7b7d191700 (LWP 30950)): #0 0x00007f7bc16413bf in waitpid () from /lib64/libpthread.so.0 #1 0x00007f7bc0ae479a in _run_nhc (nhc_info=0x7f7b7d190e70) at select_cray.c:320 #2 0x00007f7bc0ae6962 in _job_fini (args=0x3ebaba0) at select_cray.c:1074 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- #5 0x0000000000000000 in ?? () Thread 10 (Thread 0x7f7bb8fdb700 (LWP 20539)): #0 0x00007f7bc138b4d6 in poll () from /lib64/libc.so.6 #1 0x00007f7bc0ae506b in _aeld_event_loop (args=0x0) at select_cray.c:522 #2 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () Thread 9 (Thread 0x7f7ba24e4700 (LWP 31144)): #0 0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6 #1 0x00007f7bc136137c in sleep () from /lib64/libc.so.6 #2 0x00007f7bc0ae6802 in _wait_job_completed (job_id=2548559, job_ptr=0x412cdb0) at select_cray.c:1036 #3 0x00007f7bc0ae6959 in _job_fini (args=0x412cdb0) at select_cray.c:1071 #4 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () Thread 8 (Thread 0x7f7bbb455700 (LWP 20529)): #0 0x00007f7bc163d9fc in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f7bbb45a580 in _my_sleep (usec=180000000) at backfill.c:508 #2 0x00007f7bbb45b13f in backfill_agent (args=0x0) at backfill.c:798 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 7 (Thread 0x7f7bbbd85700 (LWP 20525)): #0 0x00007f7bc163d9fc in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000005b1615 in _agent (x=0x0) at slurmdbd_defs.c:2137 #2 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () Thread 6 (Thread 0x7f7bb86c7700 (LWP 20600)): #0 0x00007f7bc138dca3 in select () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- #1 0x0000000000446051 in _slurmctld_rpc_mgr (no_data=0x0) at controller.c:1014 #2 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () Thread 5 (Thread 0x7f7bbc398700 (LWP 20523)): #0 0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6 #1 0x00007f7bc136137c in sleep () from /lib64/libc.so.6 #2 0x00007f7bbc39cc49 in _set_db_inx_thread (no_data=0x0) at accounting_storage_slurmdbd.c:435 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 4 (Thread 0x7f7bb8acb700 (LWP 20598)): #0 0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6 #1 0x00007f7bc136137c in sleep () from /lib64/libc.so.6 #2 0x00007f7bb8ad2b09 in _decay_thread (no_data=0x0) at priority_multifactor.c:1371 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 3 (Thread 0x7f7bbc6a5700 (LWP 20522)): #0 0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6 #1 0x00007f7bc136137c in sleep () from /lib64/libc.so.6 #2 0x00007f7bbd12a4f4 in _lease_extender (args=0x0) at cookies.c:350 #3 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 2 (Thread 0x7f7bc1bdb700 (LWP 20520)): #0 0x00007f7bc136155d in nanosleep () from /lib64/libc.so.6 #1 0x00007f7bc138e574 in usleep () from /lib64/libc.so.6 #2 0x0000000000447a73 in _slurmctld_background (no_data=0x0) at controller.c:1713 #3 0x0000000000445379 in main (argc=1, argv=0x7fff7e89cb98) at controller.c:605 ---Type <return> to continue, or q <return> to quit--- Thread 1 (Thread 0x7f7bb22e3700 (LWP 24099)): #0 0x00007f7bc12e8875 in raise () from /lib64/libc.so.6 #1 0x00007f7bc12e9e51 in abort () from /lib64/libc.so.6 #2 0x00007f7bc12e1740 in __assert_fail () from /lib64/libc.so.6 #3 0x0000000000512bb0 in bit_ffs (b=0x0) at bitstring.c:484 #4 0x00007f7bbb869ecc in _build_row_bitmaps (p_ptr=0x7f7b5415c680, job_ptr=0x4080900) at select_cons_res.c:655 #5 0x00007f7bbb86bdfa in _rm_job_from_res (part_record_ptr=0x7f7b5412a510, node_usage=0x7f7b547f5f70, job_ptr=0x4080900, action=0) at select_cons_res.c:1273 #6 0x00007f7bbb86e984 in select_p_job_fini (job_ptr=0x4080900) at select_cons_res.c:2397 #7 0x00007f7bc0aeb11b in other_job_fini (job_ptr=0x4080900) at other_select.c:386 #8 0x00007f7bc0ae69b6 in _job_fini (args=0x4080900) at select_cray.c:1083 #9 0x00007f7bc1639806 in start_thread () from /lib64/libpthread.so.0 #10 0x00007f7bc13949bd in clone () from /lib64/libc.so.6 #11 0x0000000000000000 in ?? () (gdb) (gdb)
Created attachment 3667 [details] slurmctld log leading up to the crash
Doug, this log is from 10/17? I believe though the problem may actually be related to Bug 3211. Commit 9c0a2f2bd9b9 might fix this. Could you send out of frame 4 the output of p *ss[x].tmpjobs
Created attachment 3668 [details] slurmctld log leading up to the crash
Oh shoot, must have pulled down the wrong log. I'll check on that. I can't read bug #3211. (gdb) p *ss[x].tmpjobs $1 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 0, nodes = 0x0, ncpus = 1, sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000'} (gdb) print x $2 = 435 (gdb) On 11/1/16 2:57 PM, bugs@schedmd.com wrote: > > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=3230#c2> on bug > 3230 <https://bugs.schedmd.com/show_bug.cgi?id=3230> from Danny Auble > <mailto:da@schedmd.com> * > Doug, this log is from 10/17? > > I believe though the problem may actually be related toBug 3211 <show_bug.cgi?id=3211>. > > Commit 9c0a2f2bd9b9 might fix this. > > Could you send out of frame 4 the output of > > p *ss[x].tmpjobs > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Hum, still only till the 17th :). Last 2 lines of both files... [2016-10-17T14:43:40.601] debug: sched: already tested 402 jobs, breaking out [2016-10-17T14:43:40.614] debug: sched: Running job scheduler I made 3211 possible to read now. Though the bug deals with gang scheduling I believe it is also related to this as well. If you didn't have any jobs trying to complete on unresponsive nodes then this might not fix the issue.
Created attachment 3669 [details] kept uploading the wrong path, stupid Documents folder
Third times' the charm! Could you send me the slurmd log from nid03799? I am sure you are aware there are a host of nodes running with different slurm.conf according to the slurmctld so I will only mention that once.
hmmm. The job_ptr indicates it was on nid06138, which is still up. next_step_id = 0, nodes = 0x7f7b5c112b70 "nid06138", node_addr = 0x7f7b5c1e4ad0, nid02305:/var/tmp/slurm # sinfo --node=nid06138 PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST system up 1-infini 30:00 48 2:12:2 1 idle nid06138 debug* up 1-infini 30:00 0 0:0:0 0 n/a regular up 1-infini 4-00:00:00 0 0:0:0 0 n/a regularx up 1-infini 2-00:00:00 0 0:0:0 0 n/a realtime up 1-4 12:00:00 48 2:12:2 1 idle nid06138 shared up 1 2-00:00:00 48 2:12:2 1 idle nid06138 special up 1-infini 10-00:00:0 48 2:12:2 1 idle nid06138 nid02305:/var/tmp/slurm # boot:~ # ssh nid06138 Welcome to the initramfs # /sbin/chroot /dsl uptime 15:15pm up 12 days 7:03, 0 users, load average: 0.30, 0.30, 0.30 # Guess I don't know how to use sacctmgr to see events: id01605:~ # sacctmgr show event where nodes=nid06138 Cluster NodeName TimeStart TimeEnd State Reason User ---------- --------------- ------------------- ------------------- ------ ------------------------------ ---------- nid01605:~ # sacctmgr show event where nodes=nid06138 cluster=edison Cluster NodeName TimeStart TimeEnd State Reason User ---------- --------------- ------------------- ------------------- ------ ------------------------------ ---------- nid01605:~ # but: mysql> select * from edison_event_table where node_name="nid06138" order by time_start limit 5; +------------+------------+-----------+---------------+----------------------------+------------+-------+--------------+ | time_start | time_end | node_name | cluster_nodes | reason | reason_uid | state | tres | +------------+------------+-----------+---------------+----------------------------+------------+-------+--------------+ | 1452239802 | 1452239956 | nid06138 | | Not responding | 0 | 34849 | 1=48,2=64523 | | 1452239956 | 1452240176 | nid06138 | | Not responding | 0 | 2049 | 1=48,2=64523 | | 1453482650 | 1453573226 | nid06138 | | Not responding | 0 | 2049 | 1=48,2=64523 | | 1453573226 | 1453574719 | nid06138 | | Node unexpectedly rebooted | 0 | 1 | 1=48,2=64523 | | 1453918943 | 1453919123 | nid06138 | | Not responding | 0 | 2049 | 1=48,2=64523 | +------------+------------+-----------+---------------+----------------------------+------------+-------+--------------+ 5 rows in set (0.00 sec) mysql> The last "event" on this node was completed (time_end non-zero). The slurmctld log indicates it was actively starting jobs on that node immediately before the crash: [2016-11-01T09:02:08.109] sched: Allocate JobID=2548645 NodeList=nid06138 #CPUs=2 Partition=shared [2016-11-01T09:02:08.112] sched: Allocate JobID=2548646 NodeList=nid06138 #CPUs=2 Partition=shared [2016-11-01T09:02:08.115] sched: Allocate JobID=2548647 NodeList=nid06138 #CPUs=2 Partition=shared [2016-11-01T09:02:08.155] sched: Allocate JobID=2548648 NodeList=nid06138 #CPUs=2 Partition=shared [2016-11-01T09:02:08.157] sched: Allocate JobID=2548649 NodeList=nid06138 #CPUs=2 Partition=shared That user has lots of jobs, but not many super short, but nid06138 was definitely active for them. edique01:~ # sacct -u heikki --format=job,start,end,elapsed,exit,nodelist -X JobID Start End Elapsed ExitCode NodeList ------------ ------------------- ------------------- ---------- -------- --------------- 2544790 2016-10-31T23:33:32 2016-11-01T01:33:00 01:59:28 0:0 nid06135 2544791 2016-10-31T23:33:32 2016-11-01T04:30:49 04:57:17 0:0 nid06132 ... 2548729 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06124 2548730 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06124 2548731 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06124 2548732 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06124 2548733 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06124 2548734 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06124 2548735 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06105 2548736 2016-11-01T09:02:08 2016-11-01T09:22:21 00:20:13 0:0 nid06105 2548737 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06105 2548818 2016-11-01T09:54:21 2016-11-01T10:04:41 00:10:20 0:0 nid06143 2548820 2016-11-01T09:54:21 2016-11-01T10:04:15 00:09:54 0:0 nid06142 2548822 2016-11-01T09:54:21 2016-11-01T10:04:20 00:09:59 0:0 nid06142 2548823 2016-11-01T09:54:21 2016-11-01T10:05:57 00:11:36 0:0 nid06130 2548824 2016-11-01T09:54:21 2016-11-01T10:05:51 00:11:30 0:0 nid06130 2548825 2016-11-01T09:54:21 2016-11-01T10:05:20 00:10:59 0:0 nid06130 2548826 2016-11-01T09:54:21 2016-11-01T10:05:20 00:10:59 0:0 nid06130 2548827 2016-11-01T09:54:21 2016-11-01T10:04:49 00:10:28 0:0 nid06092 2548828 2016-11-01T09:54:21 2016-11-01T10:04:51 00:10:30 0:0 nid06092 2548829 2016-11-01T09:54:21 2016-11-01T10:04:46 00:10:25 0:0 nid06092 2548830 2016-11-01T09:54:21 2016-11-01T10:04:41 00:10:20 0:0 nid06092 2548831 2016-11-01T09:54:21 2016-11-01T10:04:41 00:10:20 0:0 nid06092 2548832 2016-11-01T09:54:21 2016-11-01T10:04:42 00:10:21 0:0 nid06092 ... little bit of weird missing data here from a few hours after the crash: 2551128 2016-11-01T11:49:32 2016-11-01T11:59:10 00:09:38 0:0 nid06116 2551129 2016-11-01T11:49:32 2016-11-01T11:59:23 00:09:51 0:0 nid06116 2551130 2016-11-01T11:49:32 2016-11-01T11:59:27 00:09:55 0:0 nid06116 2551131 2016-11-01T11:49:32 2016-11-01T11:59:26 00:09:54 0:0 nid06116 2551132 2016-11-01T11:49:32 2016-11-01T11:59:32 00:10:00 0:0 nid06116 2551133 2016-11-01T11:49:32 2016-11-01T11:59:17 00:09:45 0:0 nid06116 2551387 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned 2551388 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned 2551389 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned 2551390 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned 2551391 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned 2551392 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned 2551393 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned 2551394 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned 2551395 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned 2551396 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned 2551397 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned 2551660 2016-11-01T12:24:35 2016-11-01T12:33:38 00:09:03 0:0 nid06116 2551661 2016-11-01T12:24:35 2016-11-01T12:33:55 00:09:20 0:0 nid06142 2551662 2016-11-01T12:24:35 2016-11-01T12:35:27 00:10:52 0:0 nid06138 2551663 2016-11-01T12:24:35 2016-11-01T12:35:24 00:10:49 0:0 nid06138 ... ... and so on.
ah, start filter on that sacctmgr show event. that node did crash immediately before. sadly the slurmd logs are in memory, and the node is gone: sboot:~ # ssh nid03799 ssh: connect to host nid03799 port 22: Network is unreachable boot:~ # sinfo --node=nid03799 PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST system up 1-infini 30:00 48 2:12:2 1 down* nid03799 debug* up 1-infini 30:00 48 2:12:2 1 down* nid03799 regular up 1-infini 4-00:00:00 48 2:12:2 1 down* nid03799 regularx up 1-infini 2-00:00:00 48 2:12:2 1 down* nid03799 realtime up 1-4 12:00:00 48 2:12:2 1 down* nid03799 shared up 1 2-00:00:00 0 0:0:0 0 n/a special up 1-infini 10-00:00:0 48 2:12:2 1 down* nid03799 boot:~ # sacctmgr show event nodes=nid03799 sacctmgr: error: Problem talking to the database: Network is unreachable boot:~ # ssh ctl1 sacctmgr show event nodes=nid03799 Cluster NodeName TimeStart TimeEnd State Reason User ---------- --------------- ------------------- ------------------- ------ ------------------------------ ---------- edison nid03799 2016-10-30T02:47:37 2016-11-01T09:01:18 DOWN* Not responding root(0) edison nid03799 2016-11-01T09:29:04 Unknown DOWN* Not responding root(0) boot:~ # I've just implemented rotation and transference of slurmd logs on cori, but that only happens at job completion, which is unlikely to happen if the node crashes midjob. Regarding the mismatch of slurm.conf; yes this happens because we don't sighup slurmd between jobs. That's now happening on cori as part of log rotation, edison will get this when we move to CLE6. It looks like the node actually went down on 10/30 and hopefully a job wasn't stuck completing for that long! The database, at least, believes there wasn't: nid01605:~ # sacct --node=nid03799 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- nid01605:~ #
sorry, to clarify my statement, the node didn't actually crash immediately before this, but it appears there was a state change in sinfo (for reasons I didn't dig into)
(Written before comment 8) The job_ptr you printed out in the beginning isn't involved in the issue though ;). The bit_ffs is dying on ss[x].tmpjobs->node_bitmap which is "in theory" pointing to a job that was running on nid03799, I wouldn't expect nid06138 to be involved at this point. You can use sacct -Nnid03799 -S9:00 -E10:00 --format=jobid,nodelist,state,start,end to give hopefully you what you need. (Written afterwards) Could you see if the sacct above gives you anything? I am sad about the logs. That would had been the easiest way to figure things out. squeue might have a hung job sitting around, or a note in the slurmctld log after the fact, though it might be hard to figure out. But this does seem to sound like 3211 at this point (which fix is in 16.05.6).
our console logs indicate that nid03799 crashed on 10/30 at 2:39am: 2016-10-30T02:39:41.896526-07:00 c3-2c2s5n3 [<ffffffff8124a3e4>] intel_idle+0xc4/0x150 2016-10-30T02:39:41.896554-07:00 c3-2c2s5n3 [<ffffffff813555a8>] cpuidle_idle_call+0x128/0x220 2016-10-30T02:39:41.896561-07:00 c3-2c2s5n3 [<ffffffff810020f6>] cpu_idle+0x76/0xd0 2016-10-30T02:39:41.896586-07:00 c3-2c2s5n3 [<ffffffff8141304d>] start_secondary+0x1dc/0x26f 2016-10-30T02:39:41.896598-07:00 c3-2c2s5n3 Stop NMI detected on CPU 0 sacct doesn't show anything on the node at that time: nid01605:~ # sacct --node=nid03799 --start=2016-10-30 -X --format=job,start,end,state JobID Start End State ------------ ------------------- ------------------- ---------- 2161624 2016-10-29T22:46:00 2016-10-30T00:29:02 COMPLETED 2485097 2016-10-30T02:22:06 2016-10-30T02:23:42 FAILED 2518692 2016-10-30T01:00:05 2016-10-30T01:45:43 COMPLETED 2518812 2016-10-30T01:47:19 2016-10-30T02:06:50 COMPLETED nid01605:~ # I don't see any evidence was allocated for a job after 2485097 before it crashed. bzgrep 3799 slurmctld.log-20161031.bz2 ... ... [2016-10-30T02:22:31.932] debug: _slurm_rpc_job_alloc_info_lite JobId=2485097 NodeList=nid0[0296-0323,0328-0378,0393-0451,0456-0497,0502-0511,0516-0579,0584-0613,0615-0767,0772-0835,0840-0895,0897-0963,0968-1151,1156-1219,1224-1233,1255-1279,1281-1321,1587-1603,1608-1637,1649-1663,1666-1717,2536-2687,2692-2755,2760-3071,3076-3139,3144-3455,3460-3523,3528-3647,3694-3839,3844-3907,3912-4124,4126-4209,4231-4291,4296-4300,4302-4607,4992-5351,5353-5443,5448-5452,5454-5759,5764-5827,5832-5951,5956-6076] usec=2 [2016-10-30T02:23:05.688] debug: _slurm_rpc_job_alloc_info_lite JobId=2485097 NodeList=nid0[0296-0323,0328-0378,0393-0451,0456-0497,0502-0511,0516-0579,0584-0613, 0615-0767,0772-0835,0840-0895,0897-0963,0968-1151,1156-1219,1224-1233,1255-1279,1281-1321,1587-1603,1608-1637,1649-1663,1666-1717,2536-2687,2692-2755,2760-3071,307 6-3139,3144-3455,3460-3523,3528-3647,3694-3839,3844-3907,3912-4124,4126-4209,4231-4291,4296-4300,4302-4607,4992-5351,5353-5443,5448-5452,5454-5759,5764-5827,5832-5 951,5956-6076] usec=1 [2016-10-30T02:23:25.461] debug: _slurm_rpc_job_alloc_info_lite JobId=2518665 NodeList=nid0[1248-1249] usec=1 [2016-10-30T02:25:12.554] debug: _slurm_rpc_job_alloc_info_lite JobId=2360338 NodeList=nid0[1248-1249] usec=858568 [2016-10-30T02:25:27.573] debug: _slurm_rpc_job_alloc_info_lite JobId=2518951 NodeList=nid00292 usec=641095 [2016-10-30T02:25:39.077] debug: _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=1 [2016-10-30T02:25:41.119] debug: _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=2 [2016-10-30T02:28:36.357] debug: _slurm_rpc_job_alloc_info_lite JobId=2360339 NodeList=nid0[1248-1249] usec=519370 [2016-10-30T02:29:19.824] debug: _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=2 [2016-10-30T02:29:24.127] debug: _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=1 [2016-10-30T02:32:34.602] debug: _slurm_rpc_job_alloc_info_lite JobId=2518952 NodeList=nid00292 usec=607561 [2016-10-30T02:32:58.390] debug: _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=1 [2016-10-30T02:33:02.855] debug: _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=1 [2016-10-30T02:38:05.341] debug: _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=2 [2016-10-30T02:38:12.040] debug: _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=2 [2016-10-30T02:39:32.328] debug: Note large processing time from _slurm_rpc_job_alloc_info_lite: usec=1247485 began=02:39:31.080 [2016-10-30T02:39:32.328] debug: _slurm_rpc_job_alloc_info_lite JobId=2518956 NodeList=nid000[08-23] usec=1247485 [2016-10-30T02:39:32.328] debug: _slurm_rpc_job_alloc_info_lite JobId=2518967 NodeList=nid00024 usec=31591 [2016-10-30T02:39:40.144] debug: _slurm_rpc_job_alloc_info_lite JobId=2518954 NodeList=nid0[0615-0767,0772-0835,0840-0895,0897-0963,0968-1127] usec=1 [2016-10-30T02:39:41.952] debug: _slurm_rpc_job_alloc_info_lite JobId=2518955 NodeList=nid0[5454-5759,5764-5827,5832-5861] usec=2 [2016-10-30T02:40:09.163] debug: _slurm_rpc_job_alloc_info_lite JobId=2518820 NodeList=nid00[296-323,328-383,388-431] usec=3 [2016-10-30T02:41:16.567] debug: _slurm_rpc_job_alloc_info_lite JobId=2518826 NodeList=nid0[5353-5436] usec=2 [2016-10-30T02:41:38.726] debug: _slurm_rpc_job_alloc_info_lite JobId=2517441 NodeList=nid00[498-501] usec=2 [2016-10-30T02:41:39.529] debug: _slurm_rpc_job_alloc_info_lite JobId=2518819 NodeList=nid0[4126-4209] usec=2 [2016-10-30T02:41:55.695] debug: _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=1 [2016-10-30T02:41:57.474] debug: _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=1 [2016-10-30T02:43:01.249] debug: _slurm_rpc_job_alloc_info_lite JobId=2516167 NodeList=nid0[5437-5443,5448] usec=899183 [2016-10-30T02:43:07.913] debug: _slurm_rpc_job_alloc_info_lite JobId=2518966 NodeList=nid0[4552-4607,4992-5335] usec=2 [2016-10-30T02:43:09.749] debug: _slurm_rpc_job_alloc_info_lite JobId=2161632 NodeList=nid0[2536-2687,2692-2755,2760-3071,3076-3139,3144-3238,3319-3455,3460-3523, 3528-3647,5336-5351] usec=2 [2016-10-30T02:43:09.760] debug: _slurm_rpc_job_alloc_info_lite JobId=2161634 NodeList=nid0[1323-1347,1352-1535,1540-1603,1608-1647,3829-3839,3844-3907,3912-4118, 4302-4550,5862-5951,5956-6045] usec=4968 nid02305:/var/tmp/slurm # bzgrep 3799 slurmctld.log-20161031.bz2 [2016-10-30T00:43:23.536] backfill test for JobID=2379803 Prio=73799 Partition=regular [2016-10-30T00:46:48.402] backfill test for JobID=2379803 Prio=73799 Partition=regular [2016-10-30T00:50:38.940] debug: reserved ports 63799 for step 2246906.72 [2016-10-30T02:00:56.224] Job 2512459 to start at 2016-10-31T07:05:00, end at 2016-10-31T19:05:00 on nid0[3790-3799] [2016-10-30T02:34:24.605] error: Node nid03799 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2016-10-30T02:38:46.837] agent/is_node_resp: node:nid03799 RPC:REQUEST_PING : Socket timed out on send/recv operation [2016-10-30T02:38:48.327] error: Nodes nid0[3239,3318,3799,3828] not responding [2016-10-30T02:39:17.308] debug: Spawning ping agent for nid0[0008-0063,0072-0127,0136-0191,0200-0244,0246-0255,0264-0291,0293-0323,0328-0383,0388-0451,0456-0511,0516-0579,0584-0613,0615-0767,0772-0835,0840-0895,0897-0963,0968-1151,1156-1219,1224-1233,1235-1279,1281-1321,1323-1347,1352-1535,1540-1591,1593-1603,1608-1663,1666-1717,1719-1792,1794-1919,1924-1987,1992-2047,2050-2303,2308-2371,2376-2499,2504-2687,2692-2755,2760-3071,3076-3139,3144-3238,3240-3317,3319-3455,3460-3523,3528-3670,3672-3795,3799,3814-3827,3829-3831,3872-3907,3912-4223,4228-4229,4231-4291,4296-4300,4302-4607,4992-5351,5353-5443,5448-5452,5454-5759,5764-5827,5832-5951,5956-6104,6106-6114,6116-6121,6123-6129,6132,6134-6139,6141-6142] [2016-10-30T02:40:17.594] Node nid03799 now responding [2016-10-30T02:42:37.670] debug: Spawning ping agent for nid0[0008-0023,0025-0063,0072-0127,0136-0191,0200-0244,0246-0255,0264-0291,0293-0323,0328-0383,0388-0451,0456-0497,0502-0511,0516-0579,0584-0613,1124,1128-1151,1156-1219,1224-1233,1235-1279,1281-1321,1323-1347,1352-1535,1540-1591,1593-1603,1608-1663,1666-1717,1719-1792,1794-1919,1924-1987,1992-2047,2050-2303,2308-2371,2376-2499,2504-2687,2692-2755,2760-3012,3014-3071,3076-3139,3144-3238,3240-3317,3319-3455,3460-3523,3528-3670,3672-3795,3799,3814-3827,3829-3831,3854-3871,3912-4223,4228-4229,4231-4291,4296-4300,4302-4607,4992-5351,5353-5443,5448-5452,5454-5759,5764-5827,5832-5951,5956-6104,6106-6114,6116-6121,6123-6129,6132,6134-6139,6141] ... ... It's a little odd that it keeps getting the "now responding" messages. And then going down again. It looks like slurmdbd only recorded the initial down event 10 minutes after it crashed (which is about right given that we allow nodes 600s. nid02305:/var/tmp/slurm # sacctmgr show event nodes=nid03799 start=2016-10-01 Cluster NodeName TimeStart TimeEnd State Reason User ---------- --------------- ------------------- ------------------- ------ ------------------------------ ---------- edison nid03799 2016-10-30T02:47:37 2016-11-01T09:01:18 DOWN* Not responding root(0) edison nid03799 2016-11-01T09:29:04 Unknown DOWN* Not responding root(0) nid02305:/var/tmp/slurm # So it seems unlikely a job was completing on this node at the time of the slurmctld crash.
Yeah, I don't think the incident at 2:39 was related (and perhaps this node isn't related all together). According to your sacctmgr output the node came back up at 2016-11-01T09:01:18 and then went down again at 2016-11-01T09:29:04. That last down is the only time of interest. From the logs [2016-11-01T09:01:18.134] update_node: node nid03799 state set to IDLE something appears to have made that node idle again. But going through the logs I can't seem to find any evidence this node is related to the issue. (Meaning no jobs were running on it) I am not totally convinced this isn't related to 3211 though. In any case it sounds like things are up and running though and this only happened once. As you are going to put 16.05.6 on tomorrow the problem might not ever show up again.
Well, I see the nid03799 set to IDLE. What's concerning about that is that the node was fully down 2016-10-30T02:39:41.896598-07:00 c3-2c2s5n3 Stop NMI detected on CPU 0 So it's unclear to me how slurmd could have responded at 09:01 this morning... On 11/1/16 4:38 PM, bugs@schedmd.com wrote: > > *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=3230#c13> on > bug 3230 <https://bugs.schedmd.com/show_bug.cgi?id=3230> from Danny > Auble <mailto:da@schedmd.com> * > Yeah, I don't think the incident at 2:39 was related (and perhaps this node > isn't related all together). According to your sacctmgr output the node came > back up at 2016-11-01T09:01:18 and then went down again at 2016-11-01T09:29:04. > > That last down is the only time of interest. > > From the logs > > [2016-11-01T09:01:18.134] update_node: node nid03799 state set to IDLE > > something appears to have made that node idle again. > > But going through the logs I can't seem to find any evidence this node is > related to the issue. (Meaning no jobs were running on it) > > I am not totally convinced this isn't related to 3211 though. In any case it > sounds like things are up and running though and this only happened once. As > you are going to put 16.05.6 on tomorrow the problem might not ever show up > again. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
This looks like a scontrol thing, not anything from the node. On November 1, 2016 6:11:30 PM MDT, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=3230 > >--- Comment #14 from Doug Jacobsen <dmjacobsen@lbl.gov> --- >Well, I see the nid03799 set to IDLE. What's concerning about that is >that the node was fully down > >2016-10-30T02:39:41.896598-07:00 c3-2c2s5n3 Stop NMI detected on CPU 0 > > >So it's unclear to me how slurmd could have responded at 09:01 this >morning... > > >On 11/1/16 4:38 PM, bugs@schedmd.com wrote: >> >> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=3230#c13> on >> bug 3230 <https://bugs.schedmd.com/show_bug.cgi?id=3230> from Danny >> Auble <mailto:da@schedmd.com> * >> Yeah, I don't think the incident at 2:39 was related (and perhaps >this node >> isn't related all together). According to your sacctmgr output the >node came >> back up at 2016-11-01T09:01:18 and then went down again at >2016-11-01T09:29:04. >> >> That last down is the only time of interest. >> >> From the logs >> >> [2016-11-01T09:01:18.134] update_node: node nid03799 state set to >IDLE >> >> something appears to have made that node idle again. >> >> But going through the logs I can't seem to find any evidence this >node is >> related to the issue. (Meaning no jobs were running on it) >> >> I am not totally convinced this isn't related to 3211 though. In any >case it >> sounds like things are up and running though and this only happened >once. As >> you are going to put 16.05.6 on tomorrow the problem might not ever >show up >> again. >> >------------------------------------------------------------------------ >> You are receiving this mail because: >> >> * You reported the bug. >> > >-- >You are receiving this mail because: >You are watching all bug changes.
Odd, I wonder who/what/where is attempting to manipulate the node (cle52 didn't try to adjust the reason code nearly as much as cle6) We updated edison to slurm 16.05.6 yesterday. I suppose we can wait-and-see if this issue re-occurs. -Doug
I've updated the title of the bug. The reoccurred in slurm 16.05.6 (gdb) bt #0 0x00007fc03cdf2875 in raise () from /lib64/libc.so.6 #1 0x00007fc03cdf3e51 in abort () from /lib64/libc.so.6 #2 0x00007fc03cdeb740 in __assert_fail () from /lib64/libc.so.6 #3 0x0000000000512f14 in bit_ffs (b=0x0) at bitstring.c:484 #4 0x00007fc030e28ecc in _build_row_bitmaps (p_ptr=0x4c62870, job_ptr=0x5c00ee0) at select_cons_res.c:655 #5 0x00007fc030e2adfa in _rm_job_from_res (part_record_ptr=0x26e59d0, node_usage=0x5391130, job_ptr=0x5c00ee0, action=0) at select_cons_res.c:1273 #6 0x00007fc030e2d984 in select_p_job_fini (job_ptr=0x5c00ee0) at select_cons_res.c:2397 #7 0x00007fc03c7ad11b in other_job_fini (job_ptr=0x5c00ee0) at other_select.c:386 #8 0x00007fc03c7a89b6 in _job_fini (args=0x5c00ee0) at select_cray.c:1083 #9 0x00007fc03d143806 in start_thread () from /lib64/libpthread.so.0 #10 0x00007fc03ce9e9bd in clone () from /lib64/libc.so.6 #11 0x0000000000000000 in ?? () (gdb) set print pretty (gdb) up #1 0x00007fc03cdf3e51 in abort () from /lib64/libc.so.6 (gdb) up #2 0x00007fc03cdeb740 in __assert_fail () from /lib64/libc.so.6 (gdb) up #3 0x0000000000512f14 in bit_ffs (b=0x0) at bitstring.c:484 484 bitstring.c: No such file or directory. (gdb) up #4 0x00007fc030e28ecc in _build_row_bitmaps (p_ptr=0x4c62870, job_ptr=0x5c00ee0) at select_cons_res.c:655 655 select_cons_res.c: No such file or directory. (gdb) up #5 0x00007fc030e2adfa in _rm_job_from_res (part_record_ptr=0x26e59d0, node_usage=0x5391130, job_ptr=0x5c00ee0, action=0) at select_cons_res.c:1273 1273 in select_cons_res.c (gdb) down #4 0x00007fc030e28ecc in _build_row_bitmaps (p_ptr=0x4c62870, job_ptr=0x5c00ee0) at select_cons_res.c:655 655 in select_cons_res.c (gdb) down #3 0x0000000000512f14 in bit_ffs (b=0x0) at bitstring.c:484 484 bitstring.c: No such file or directory. (gdb) up #4 0x00007fc030e28ecc in _build_row_bitmaps (p_ptr=0x4c62870, job_ptr=0x5c00ee0) at select_cons_res.c:655 655 select_cons_res.c: No such file or directory. (gdb) print *p_ptr $1 = { next = 0x262f5b0, num_rows = 25, part_ptr = 0x240f0c0, row = 0x4041a60 } (gdb) print *job_ptr $2 = { account = 0x2813690 "m1304", alias_list = 0x0, alloc_node = 0x44e1300 "edison06", alloc_resp_port = 0, alloc_sid = 37672, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 7309, assoc_ptr = 0x16a8a80, batch_flag = 1, batch_host = 0x7fbff83da6b0 "nid06111", bit_flags = 0, burst_buffer = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 0, billable_tres = 2, cr_enabled = 1, deadline = 0, db_index = 13707499, ---Type <return> to continue, or q <return> to quit--- derived_ec = 0, details = 0x5c01220, direct_set_prio = 0, end_time = 1478181082, end_time_exp = 4294967294, epilog_running = false, exit_code = 0, front_end_ptr = 0x0, gres = 0x5c01400 "craynetwork:0", gres_list = 0x0, gres_alloc = 0x7fbff83da5d0 "craynetwork:4", gres_req = 0x7fbff83da700 "craynetwork:0", gres_used = 0x0, group_id = 62595, job_id = 2571089, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x7fbff8503d50, job_state = 3, kill_on_node_fail = 1, licenses = 0x5c01430 "cscratch1:1,project:1", license_list = 0x7fc012bcff60, limit_set = { ---Type <return> to continue, or q <return> to quit--- qos = 0, time = 0, tres = 0x3c84680 }, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x39585d0 "qsub_analyse-161103-RUN51186.sh", network = 0x0, next_step_id = 0, nodes = 0x7fbff81e09a0 "nid06111", node_addr = 0x7fbff83da630, node_bitmap = 0x7fbff80ac6f0, node_bitmap_cg = 0x7fbfd4173500, node_cnt = 0, node_cnt_wag = 1, nodes_completing = 0x0, other_port = 0, partition = 0x46735c0 "shared", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x240f0c0, power_flags = 0 '\000', ---Type <return> to continue, or q <return> to quit--- pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 57222, priority_array = 0x0, prio_factors = 0x3c84610, profile = 0, qos_id = 1, qos_ptr = 0x8f8520, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x5c014e0, spank_job_env = 0x47a82d0, spank_job_env_size = 1, start_protocol_ver = 7680, start_time = 1478181072, state_desc = 0x0, ---Type <return> to continue, or q <return> to quit--- state_reason = 0, state_reason_prev = 1, step_list = 0x456b700, suspend_time = 0, time_last_active = 1478181082, time_limit = 59, time_min = 0, tot_sus_time = 0, total_cpus = 2, total_nodes = 1, tres_req_cnt = 0x39fa0f0, tres_req_str = 0x44e1010 "1=1,2=1300,4=1", tres_fmt_req_str = 0x5c013a0 "cpu=1,mem=1300,node=1", tres_alloc_cnt = 0x7fbff857d120, tres_alloc_str = 0x7fbff857d230 "1=2,2=2600,4=1", tres_fmt_alloc_str = 0x7fbff818c840 "cpu=2,mem=2600M,node=1", user_id = 62595, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, ---Type <return> to continue, or q <return> to quit--- best_switch = true, wait4switch_start = 0 } (gdb) print *job_ptr->details $3 = { acctg_freq = 0x0, argc = 1, argv = 0x3bdf730, begin_time = 1478181026, ckpt_dir = 0x5c01510 "/var/slurm/checkpoint", contiguous = 0, core_spec = 65534, cpu_bind = 0x0, cpu_bind_type = 65534, cpu_freq_min = 4294967294, cpu_freq_max = 4294967294, cpu_freq_gov = 4294967294, cpus_per_task = 1, depend_list = 0x0, dependency = 0x0, orig_dependency = 0x0, env_cnt = 0, env_sup = 0x0, exc_node_bitmap = 0x0, exc_nodes = 0x0, expanding_jobid = 0, feature_list = 0x0, features = 0x0, ---Type <return> to continue, or q <return> to quit--- magic = 0, max_cpus = 4294967294, max_nodes = 1, mc_ptr = 0x276f5c0, mem_bind = 0x0, mem_bind_type = 0, min_cpus = 1, min_nodes = 1, nice = 2147483648, ntasks_per_node = 0, num_tasks = 1, open_mode = 0 '\000', overcommit = 0 '\000', plane_size = 0, pn_min_cpus = 1, pn_min_memory = 2147484948, pn_min_tmp_disk = 0, prolog_running = 0 '\000', reserved_resources = 0, req_node_bitmap = 0x0, req_node_layout = 0x0, preempt_start_time = 0, req_nodes = 0x0, requeue = 0, ---Type <return> to continue, or q <return> to quit--- restart_dir = 0x0, share_res = 1 '\001', std_err = 0x0, std_in = 0x5c01460 "/dev/null", std_out = 0x0, submit_time = 1478181026, task_dist = 8192, usable_nodes = 0, whole_node = 0 '\000', work_dir = 0x5c01490 "/global/u1/e/epuesche/EBLtosummer2016/v480a/scripts/VTS" } (gdb) (gdb) (gdb) print *ss[x].tmpjobs $4 = { core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 0, nodes = 0x0, ncpus = 1, sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000' } (gdb)
Ah, well that is good confirmation this isn't related to bug 3211 then. We will continue to look into it. I am guessing this is more related to bug 3185 as previously thought.
Doug could you post an updated slurmctld log for the hour before this incident happened?
Created attachment 3675 [details] slurmctld for today sorry i didn't have time to filter it.
I've been reviewing code and testing all day, but unfortunately don't have a solution for you. What I can tell you is that the root problem is a data structure in the select/cons_res plugin contains a list of pointers to the active job's resource allocations. That structure contains a NULL node bitmap, which triggers the abort. The job in the backtrace is not the job with the bad data, but it triggeres the failure due to a pre-existing bad data structure. My best guess is the underlying issue is some sort of race condition involving NHC and job requeue, but that's just a guess. We could get you a patch that skips over the record with the NULL pointer, but at some point the pointer will likely be a invalid non-NULL pointer and that could result in severe and unpredictable consequences, so I'll concentrate on getting to the bottom of the problem and fixing it. Notes to self: 1. My best guess is the free_job_resources() in cr_job_test() in the job_test.c module is getting called on a job that is still running the NHC/cleaning. When NHC completes, that will try to remove the job's record via other_job_fini() called form _job_fini() in select_cray.c, but not find it and leave an old/bad pointer. 2. The use of "job_list" in select/cons_res (see _rm_job_from_res() in select_cons_res.c) is fast and efficient, but not robust in the case of failures. It should probably be changed to a job pointer.
I was just able to reproduce this failure with the addition of some well placed "sleep" calls in the slurm logic. Any time a running job is requeued and the job/allocation NHC runs for longer than the Munge credential lifetime (default 5 minutes), a data structure is cleared and later referenced. There are a lot of moving parts and the failure can not easily be reproduced, but the requeue/NHC timing issue described above makes this failure possible.
This commit fixes a race condition that I was able to exploit and reproduce this failure. I don't have sufficient information to be certain this is what caused your aborts, but it was definitely a bug that could cause this failure and I don't see any other possibilities. https://github.com/SchedMD/slurm/commit/69af50af965d70af5642e156b4186ed6cc13aaa8
OK, this seems promising. I'll test this on alva a bit tonight for basic functionality and deploy to edison tomorrow. Thanks, Doug On 11/7/16 3:48 PM, bugs@schedmd.com wrote: > > *Comment # 23 <https://bugs.schedmd.com/show_bug.cgi?id=3230#c23> on > bug 3230 <https://bugs.schedmd.com/show_bug.cgi?id=3230> from Moe > Jette <mailto:jette@schedmd.com> * > This commit fixes a race condition that I was able to exploit and reproduce > this failure. I don't have sufficient information to be certain this is what > caused your aborts, but it was definitely a bug that could cause this failure > and I don't see any other possibilities. > https://github.com/SchedMD/slurm/commit/69af50af965d70af5642e156b4186ed6cc13aaa8 > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
(In reply to Doug Jacobsen from comment #24) > OK, this seems promising. I'll test this on alva a bit tonight for > basic functionality and deploy to edison tomorrow. Since I haven't heard back, I'll close the ticket based upon the assumption this patch fixed the problem. Please re-open if necessary.
Hi Moe, Sorry about that. I applied the patch on edison (but not cori) and we haven't had any problems since. The key difference between edison and cori is that cori doesn't run the NHC via the cray/slurm plugin mechanisms any more. Though I gather from reading the patch there are other pathways for finding this bug. I'll roll this patch into cori at the next opportunity. -Doug