slurmctld version 16.05.9 segfaulted last weekend. I'm still looking for the slurmctld log file, but I do have the core file. #1 0x000000000050b1c7 in _start_msg_tree_internal (hl=0x0, sp_hl=0x7fe968009d30, fwd_tree_in=0x7fe9e9edacd0, hl_count=42) at forward.c:538 attr_agent = { __size = "\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\020", '\000' <repeats 16 times>, "\020", '\000' <repeats 20 times>, __align = 0} thread_agent = 140642610882304 retries = 0 j = 32 fwd_tree = 0x7fe968000ba0 #2 0x000000000050b86a in start_msg_tree (hl=0x7fe968008b60, msg=0x7fe9e9edaec0, timeout=0) at forward.c:718 fwd_tree = {notify = 0x7fe9e9edad40, p_thr_count = 0x7fe9e9edaca0, orig_msg = 0x7fe9e9edaec0, ret_list = 0x7fe9d0084080, timeout = 10000, tree_hl = 0x0, tree_mutex = 0x7fe9e9edad10} tree_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = { __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0} notify = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0} count = 0 ret_list = 0x7fe9d0084080 thr_count = 1 host_count = 42 sp_hl = 0x7fe968009d30 hl_count = 42 sp_hl = 0x7fe968009d30 hl_count = 42 #3 0x000000000054d61b in slurm_send_recv_msgs (nodelist=0x7fe8d0018f00 "nid00[145-167,189-191,392-403,440-443]", msg=0x7fe9e9edaec0, timeout=0, quiet=true) at slurm_protocol_api.c:4345 ret_list = 0x0 hl = 0x7fe968008b60 #4 0x000000000043e316 in _thread_per_group_rpc (args=0x7fe8d00055f0) at agent.c:908 rc = 0 msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, conn_fd = -1, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 1008, protocol_version = 7680, forward = {cnt = 0, init = 65534, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0} task_ptr = 0x7fe8d00055f0 thread_mutex_ptr = 0x7fe8d0002ed0 thread_cond_ptr = 0x7fe8d0002ef8 threads_active_ptr = 0x7fe8d0002f2c thread_ptr = 0x7fe8d0005eb0 thread_state = DSH_NO_RESP msg_type = REQUEST_PING is_kill_msg = false srun_agent = false ret_list = 0x0 itr = 0x0 ret_data_info = 0x0 sig_array = {10, 0} job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK} node_read_lock = {config = NO_LOCK, job = NO_LOCK, node = READ_LOCK, partition = NO_LOCK} node_write_lock = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = NO_LOCK} #5 0x00007fe9f8ebe734 in pthread_create@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 No symbol table info available. Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Hello David, There isn't a lot to go on with that backtrace but it appears to be memory related because it crashed on a glibc call, pthread_create. In gdb, could you run: thread apply all bt full and paste in the results. Also, please attach the controller log so I can go through it. I'm going to need to do some serious investigating here. Thanks Tim
Created attachment 4665 [details] gdb thread apply all bt
David, Can you get the controller log file for this crash? Thanks Tim
David, Have you heard back from the customer about getting the log for this crash? Thanks Tim
Created attachment 4730 [details] Compressed slurmctld log Here are the last few lines from the log file before the segfault: [2017-05-29T06:24:49.218] error: bb_run_script: teardown poll timeout @ 300000 msec [2017-05-29T06:24:49.229] _start_teardown: teardown for job 138990 ran for usec=300015463 [2017-05-29T06:24:49.230] error: burst_buffer cray plugin: _start_teardown: teardown for job 138990 status:9 response: [2017-05-29T06:29:54.235] error: bb_run_script: teardown poll timeout @ 300000 msec [2017-05-29T06:29:54.247] _start_teardown: teardown for job 138990 ran for usec=300015091 [2017-05-29T06:29:54.248] error: burst_buffer cray plugin: _start_teardown: teardown for job 138990 status:9 response:
David, There's some definite memory corruption going on here because the crash occurs on a pthread_create call. From the log, it shows the burst buffer teardown taking over 5 minutes to complete so it's being killed with a -9 signal. So there may also be an issue with the api to unmount the file system. Do you see this issue with 17.02? I know there are some segfault fixes around the burst buffer code there. Tim
Actually, this doesn't appear to be related to burst buffers after all. However, I would still be interested to know if this still happens in our latest 17.02.
We've only seen this once; never on 17.02.
David, There have been a lot of memory fixes in 17.02. Because of that, I'm going to close this bug but you can reopen it if you see this problem in 17.02. Regards. Tim