Ticket 3854 - slurmctld segfault in _start_msg_tree_internal
Summary: slurmctld segfault in _start_msg_tree_internal
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 16.05.9
Hardware: Linux Linux
: 2 - High Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-05-31 13:17 MDT by David Gloe
Modified: 2017-06-19 16:09 MDT (History)
0 users

See Also:
Site: CRAY
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: Cray Internal
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
gdb thread apply all bt (54.00 KB, text/plain)
2017-05-31 13:46 MDT, David Gloe
Details
Compressed slurmctld log (26.73 MB, application/x-gzip)
2017-06-08 11:55 MDT, David Gloe
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description David Gloe 2017-05-31 13:17:02 MDT
slurmctld version 16.05.9 segfaulted last weekend.
I'm still looking for the slurmctld log file, but I do have the core file.

#1  0x000000000050b1c7 in _start_msg_tree_internal (hl=0x0, sp_hl=0x7fe968009d30, fwd_tree_in=0x7fe9e9edacd0, hl_count=42)
    at forward.c:538
        attr_agent = {
          __size = "\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\020", '\000' <repeats 16 times>, "\020", '\000'
 <repeats 20 times>, __align = 0}
        thread_agent = 140642610882304
        retries = 0
        j = 32
        fwd_tree = 0x7fe968000ba0
#2  0x000000000050b86a in start_msg_tree (hl=0x7fe968008b60, msg=0x7fe9e9edaec0, timeout=0) at forward.c:718
        fwd_tree = {notify = 0x7fe9e9edad40, p_thr_count = 0x7fe9e9edaca0, orig_msg = 0x7fe9e9edaec0, ret_list = 0x7fe9d0084080, 
          timeout = 10000, tree_hl = 0x0, tree_mutex = 0x7fe9e9edad10}
        tree_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {
              __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        notify = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, 
            __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}
        count = 0
        ret_list = 0x7fe9d0084080
        thr_count = 1
        host_count = 42
        sp_hl = 0x7fe968009d30
        hl_count = 42
        sp_hl = 0x7fe968009d30
        hl_count = 42
#3  0x000000000054d61b in slurm_send_recv_msgs (nodelist=0x7fe8d0018f00 "nid00[145-167,189-191,392-403,440-443]", msg=0x7fe9e9edaec0, 
    timeout=0, quiet=true) at slurm_protocol_api.c:4345
        ret_list = 0x0
        hl = 0x7fe968008b60
#4  0x000000000043e316 in _thread_per_group_rpc (args=0x7fe8d00055f0) at agent.c:908
        rc = 0
        msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, 
          auth_cred = 0x0, conn_fd = -1, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 1008, 
          protocol_version = 7680, forward = {cnt = 0, init = 65534, nodelist = 0x0, timeout = 0, tree_width = 0}, 
          forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, 
            sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
        task_ptr = 0x7fe8d00055f0
        thread_mutex_ptr = 0x7fe8d0002ed0
        thread_cond_ptr = 0x7fe8d0002ef8
        threads_active_ptr = 0x7fe8d0002f2c
        thread_ptr = 0x7fe8d0005eb0
        thread_state = DSH_NO_RESP
        msg_type = REQUEST_PING
        is_kill_msg = false
        srun_agent = false
        ret_list = 0x0
        itr = 0x0
        ret_data_info = 0x0
        sig_array = {10, 0}
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
        node_read_lock = {config = NO_LOCK, job = NO_LOCK, node = READ_LOCK, partition = NO_LOCK}
        node_write_lock = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
#5  0x00007fe9f8ebe734 in pthread_create@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Comment 3 Tim Shaw 2017-05-31 13:32:24 MDT
Hello David,

There isn't a lot to go on with that backtrace but it appears to be memory related because it crashed on a glibc call, pthread_create.  In gdb, could you run:

thread apply all bt full

and paste in the results.  Also, please attach the controller log so I can go through it.  I'm going to need to do some serious investigating here.

Thanks

Tim
Comment 4 David Gloe 2017-05-31 13:46:15 MDT
Created attachment 4665 [details]
gdb thread apply all bt
Comment 6 Tim Shaw 2017-06-02 10:20:28 MDT
David,

Can you get the controller log file for this crash?

Thanks

Tim
Comment 7 Tim Shaw 2017-06-08 09:08:31 MDT
David,

Have you heard back from the customer about getting the log for this crash?

Thanks

Tim
Comment 8 David Gloe 2017-06-08 11:55:06 MDT
Created attachment 4730 [details]
Compressed slurmctld log

Here are the last few lines from the log file before the segfault:

[2017-05-29T06:24:49.218] error: bb_run_script: teardown poll timeout @ 300000 msec
[2017-05-29T06:24:49.229] _start_teardown: teardown for job 138990 ran for usec=300015463
[2017-05-29T06:24:49.230] error: burst_buffer cray plugin: _start_teardown: teardown for job 138990 status:9 response:
[2017-05-29T06:29:54.235] error: bb_run_script: teardown poll timeout @ 300000 msec
[2017-05-29T06:29:54.247] _start_teardown: teardown for job 138990 ran for usec=300015091
[2017-05-29T06:29:54.248] error: burst_buffer cray plugin: _start_teardown: teardown for job 138990 status:9 response:
Comment 9 Tim Shaw 2017-06-14 10:03:43 MDT
David,

There's some definite memory corruption going on here because the crash occurs on a pthread_create call.  From the log, it shows the burst buffer teardown taking over 5 minutes to complete so it's being killed with a -9 signal.  So there may also be an issue with the api to unmount the file system.

Do you see this issue with 17.02?  I know there are some segfault fixes around the burst buffer code there.

Tim
Comment 10 Tim Shaw 2017-06-15 08:17:17 MDT
Actually, this doesn't appear to be related to burst buffers after all.  However, I would still be interested to know if this still happens in our latest 17.02.
Comment 11 David Gloe 2017-06-15 08:28:59 MDT
We've only seen this once; never on 17.02.
Comment 12 Tim Shaw 2017-06-19 16:09:58 MDT
David,

There have been a lot of memory fixes in 17.02.  Because of that, I'm going to close this bug but you can reopen it if you see this problem in 17.02.

Regards.

Tim