3854 – slurmctld segfault in _start_msg_tree_internal

Ticket 3854 - slurmctld segfault in _start_msg_tree_internal

Summary: slurmctld segfault in _start_msg_tree_internal

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	16.05.9
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-05-31 13:17 MDT by David Gloe
Modified:	2017-06-19 16:09 MDT (History)
CC List:	0 users

See Also:
Site:	CRAY
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	Cray Internal
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
gdb thread apply all bt (54.00 KB, text/plain) 2017-05-31 13:46 MDT, David Gloe	Details
Compressed slurmctld log (26.73 MB, application/x-gzip) 2017-06-08 11:55 MDT, David Gloe	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description David Gloe 2017-05-31 13:17:02 MDT

slurmctld version 16.05.9 segfaulted last weekend.
I'm still looking for the slurmctld log file, but I do have the core file.

#1  0x000000000050b1c7 in _start_msg_tree_internal (hl=0x0, sp_hl=0x7fe968009d30, fwd_tree_in=0x7fe9e9edacd0, hl_count=42)
    at forward.c:538
        attr_agent = {
          __size = "\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\020", '\000' <repeats 16 times>, "\020", '\000'
 <repeats 20 times>, __align = 0}
        thread_agent = 140642610882304
        retries = 0
        j = 32
        fwd_tree = 0x7fe968000ba0
#2  0x000000000050b86a in start_msg_tree (hl=0x7fe968008b60, msg=0x7fe9e9edaec0, timeout=0) at forward.c:718
        fwd_tree = {notify = 0x7fe9e9edad40, p_thr_count = 0x7fe9e9edaca0, orig_msg = 0x7fe9e9edaec0, ret_list = 0x7fe9d0084080, 
          timeout = 10000, tree_hl = 0x0, tree_mutex = 0x7fe9e9edad10}
        tree_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {
              __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}
        notify = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, 
            __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}
        count = 0
        ret_list = 0x7fe9d0084080
        thr_count = 1
        host_count = 42
        sp_hl = 0x7fe968009d30
        hl_count = 42
        sp_hl = 0x7fe968009d30
        hl_count = 42
#3  0x000000000054d61b in slurm_send_recv_msgs (nodelist=0x7fe8d0018f00 "nid00[145-167,189-191,392-403,440-443]", msg=0x7fe9e9edaec0, 
    timeout=0, quiet=true) at slurm_protocol_api.c:4345
        ret_list = 0x0
        hl = 0x7fe968008b60
#4  0x000000000043e316 in _thread_per_group_rpc (args=0x7fe8d00055f0) at agent.c:908
        rc = 0
        msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, 
          auth_cred = 0x0, conn_fd = -1, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 1008, 
          protocol_version = 7680, forward = {cnt = 0, init = 65534, nodelist = 0x0, timeout = 0, tree_width = 0}, 
          forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, 
            sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
        task_ptr = 0x7fe8d00055f0
        thread_mutex_ptr = 0x7fe8d0002ed0
        thread_cond_ptr = 0x7fe8d0002ef8
        threads_active_ptr = 0x7fe8d0002f2c
        thread_ptr = 0x7fe8d0005eb0
        thread_state = DSH_NO_RESP
        msg_type = REQUEST_PING
        is_kill_msg = false
        srun_agent = false
        ret_list = 0x0
        itr = 0x0
        ret_data_info = 0x0
        sig_array = {10, 0}
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
        node_read_lock = {config = NO_LOCK, job = NO_LOCK, node = READ_LOCK, partition = NO_LOCK}
        node_write_lock = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
#5  0x00007fe9f8ebe734 in pthread_create@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Comment 3 Tim Shaw 2017-05-31 13:32:24 MDT

Hello David,

There isn't a lot to go on with that backtrace but it appears to be memory related because it crashed on a glibc call, pthread_create.  In gdb, could you run:

thread apply all bt full

and paste in the results.  Also, please attach the controller log so I can go through it.  I'm going to need to do some serious investigating here.

Thanks

Tim

Comment 4 David Gloe 2017-05-31 13:46:15 MDT

Created attachment 4665 [details]
gdb thread apply all bt

Comment 6 Tim Shaw 2017-06-02 10:20:28 MDT

David,

Can you get the controller log file for this crash?

Thanks

Tim

Comment 7 Tim Shaw 2017-06-08 09:08:31 MDT

David,

Have you heard back from the customer about getting the log for this crash?

Thanks

Tim

Comment 8 David Gloe 2017-06-08 11:55:06 MDT

Created attachment 4730 [details]
Compressed slurmctld log

Here are the last few lines from the log file before the segfault:

[2017-05-29T06:24:49.218] error: bb_run_script: teardown poll timeout @ 300000 msec
[2017-05-29T06:24:49.229] _start_teardown: teardown for job 138990 ran for usec=300015463
[2017-05-29T06:24:49.230] error: burst_buffer cray plugin: _start_teardown: teardown for job 138990 status:9 response:
[2017-05-29T06:29:54.235] error: bb_run_script: teardown poll timeout @ 300000 msec
[2017-05-29T06:29:54.247] _start_teardown: teardown for job 138990 ran for usec=300015091
[2017-05-29T06:29:54.248] error: burst_buffer cray plugin: _start_teardown: teardown for job 138990 status:9 response:

Comment 9 Tim Shaw 2017-06-14 10:03:43 MDT

David,

There's some definite memory corruption going on here because the crash occurs on a pthread_create call.  From the log, it shows the burst buffer teardown taking over 5 minutes to complete so it's being killed with a -9 signal.  So there may also be an issue with the api to unmount the file system.

Do you see this issue with 17.02?  I know there are some segfault fixes around the burst buffer code there.

Tim

Comment 10 Tim Shaw 2017-06-15 08:17:17 MDT

Actually, this doesn't appear to be related to burst buffers after all.  However, I would still be interested to know if this still happens in our latest 17.02.

Comment 11 David Gloe 2017-06-15 08:28:59 MDT

We've only seen this once; never on 17.02.

Comment 12 Tim Shaw 2017-06-19 16:09:58 MDT

David,

There have been a lot of memory fixes in 17.02.  Because of that, I'm going to close this bug but you can reopen it if you see this problem in 17.02.

Regards.

Tim