Ticket 3237 - slurmd chrashes randomly on restart
Summary: slurmd chrashes randomly on restart
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 16.05.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-11-03 10:46 MDT by CSC sysadmins
Modified: 2016-11-11 08:05 MST (History)
1 user (show)

See Also:
Site: CSC - IT Center for Science
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 16.05.7
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmd.log and relevant lines from slurmctld.log (3.71 KB, text/plain)
2016-11-08 07:12 MST, CSC sysadmins
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description CSC sysadmins 2016-11-03 10:46:29 MDT
If I restart slurmd service on all compute nodes, about 5% of nodes wont be able to start slurmd due to crash on rapl and I need to start service again. 

Nov  3 18:27:37 c2 kernel: slurmd[21570]: segfault at 18 ip 00007fda7cc81df2 sp 00007fda7cf84d30 error 4 in acct_gather_energy_rapl.so[7fda7cc80000+3000]

# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=energy=30,task=30
AcctGatherEnergyType=acct_gather_energy/rapl
AcctGatherNodeFreq=30
Comment 1 Alejandro Sanchez 2016-11-07 04:35:24 MST
Hi Tommi - did the crash generate any core files? if so you could you please attach the output of:

$ gdb path/to/slurmd path/to/core
(gdb) set pagination off
(gdb) bt
(gdb) thread apply all bt full

If it didn't generate any core files, could you please check whether you have the ulimit core file size set to unlimited and try to start slurmd again, so the crash might generate a core file this time?

Also could you please attach the output of:

# addr2line -e /path/to/slurmd 00007fda7cc81df2

Your slurm.conf, slurmctld.log and slurmd.log from a node where the crash happens might be useful too. Thanks.
Comment 2 CSC sysadmins 2016-11-07 08:12:16 MST
Finally I got a core dump, had to run service slurm restart in a while true loop. Config can be found on BR#3215

GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmd...done.
[New Thread 4705]
[New Thread 4704]
[New Thread 4694]
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libhwloc.so.5
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /lib64/libpci.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib64/libpci.so.3
Reading symbols from /usr/lib64/libxml2.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libxml2.so.2
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /usr/lib64/slurm/select_cons_res.so...done.
Loaded symbols for /usr/lib64/slurm/select_cons_res.so
Reading symbols from /usr/lib64/slurm/gres_mic.so...done.
Loaded symbols for /usr/lib64/slurm/gres_mic.so
Reading symbols from /usr/lib64/slurm/gres_gpu.so...done.
Loaded symbols for /usr/lib64/slurm/gres_gpu.so
Reading symbols from /usr/lib64/slurm/topology_tree.so...done.
Loaded symbols for /usr/lib64/slurm/topology_tree.so
Reading symbols from /usr/lib64/slurm/route_default.so...done.
Loaded symbols for /usr/lib64/slurm/route_default.so
Reading symbols from /usr/lib64/slurm/proctrack_cgroup.so...done.
Loaded symbols for /usr/lib64/slurm/proctrack_cgroup.so
Reading symbols from /usr/lib64/slurm/task_cgroup.so...done.
Loaded symbols for /usr/lib64/slurm/task_cgroup.so
Reading symbols from /usr/lib64/slurm/auth_munge.so...done.
Loaded symbols for /usr/lib64/slurm/auth_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /usr/lib64/slurm/crypto_munge.so...done.
Loaded symbols for /usr/lib64/slurm/crypto_munge.so
Reading symbols from /usr/lib64/slurm/jobacct_gather_linux.so...done.
Loaded symbols for /usr/lib64/slurm/jobacct_gather_linux.so
Reading symbols from /usr/lib64/slurm/job_container_none.so...done.
Loaded symbols for /usr/lib64/slurm/job_container_none.so
Reading symbols from /usr/lib64/slurm/core_spec_none.so...done.
Loaded symbols for /usr/lib64/slurm/core_spec_none.so
Reading symbols from /usr/lib64/slurm/switch_none.so...done.
Loaded symbols for /usr/lib64/slurm/switch_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_energy_rapl.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_energy_rapl.so
Reading symbols from /usr/lib64/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_profile_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_infiniband_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_infiniband_none.so
Core was generated by `/usr/sbin/slurmd -M'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f6a7fdfcdf2 in acct_gather_energy_p_update_node_energy () at acct_gather_energy_rapl.c:438
438     acct_gather_energy_rapl.c: No such file or directory.
        in acct_gather_energy_rapl.c
Missing separate debuginfos, use: debuginfo-install slurm-16.05.6-1.el6.x86_64
(gdb) set pagination off
(gdb) bt
#0  0x00007f6a7fdfcdf2 in acct_gather_energy_p_update_node_energy () at acct_gather_energy_rapl.c:438
#1  0x00000000005317ce in acct_gather_energy_g_update_node_energy () at slurm_acct_gather_energy.c:287
#2  0x0000000000437aaf in _rpc_acct_gather_update (msg=0x7f6a780008d0) at req.c:2712
#3  0x0000000000430cd8 in slurmd_req (msg=0x7f6a780008d0) at req.c:405
#4  0x000000000042b41f in _service_connection (arg=0x23991e0) at slurmd.c:606
#5  0x00007f6a8a93eaa1 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f6a8a68baad in clone () from /lib64/libc.so.6

(gdb) thread apply all bt full

Thread 3 (Thread 0x7f6a8b398700 (LWP 4694)):
#0  0x00007f6a8a945b2d in accept () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00000000004edee3 in slurm_accept_msg_conn (fd=4, addr=0x2396bb0) at slurm_protocol_socket_implementation.c:477
        len = 16
#2  0x000000000042add5 in _msg_engine () at slurmd.c:470
        cli = 0x2396bb0
        sock = 6
#3  0x000000000042aadb in main (argc=2, argv=0x7ffc1919ac08) at slurmd.c:380
        i = 32768
        pidfd = 5
        blocked_signals = {13, 0}
        cc = 32768
        oom_value = 0x0
        slurmd_uid = 0
        curr_uid = 0
        time_stamp = "Mon, 07 Nov 2016 17:05:19 +0200\000\001\000\000\000j\177\000\000\210q;\213j\177\000\000Ȣ ", '\000' <repeats 13 times>"\270, \346\061\212j\177\000\000\240Z*\002", '\000' <repeats 12 times>"\340, t;\213j\177\000\000\360\252\031\031\374\177\000\000\000\000\000\000\000\000\000\000\b\253\031\031\374\177\000\000\020N\021\212\001\000\000\000\001\000\000\000\000\000\000\000\070yA", '\000' <repeats 13 times>, "\027\000\000\000\000\000\000\000`X*\002\000\000\000\000թ\032\213j\177\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000p6Z", '\000' <repeats 13 times>, "{\223B\000\000\000\000\000\002\000\000\000\000\000\000\000\265"...
        lopts = {stderr_level = LOG_LEVEL_INFO, syslog_level = LOG_LEVEL_INFO, logfile_level = LOG_LEVEL_INFO, prefix_level = 1, buffered = 0}

Thread 2 (Thread 0x7f6a87cd8700 (LWP 4704)):
#0  0x00007f6a8b19f659 in do_lookup_x () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#1  0x00007f6a8b19ffea in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#2  0x00007f6a8b1a1834 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#3  0x00007f6a8b1a8dba in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#4  0x00007f6a8b1a4366 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#5  0x00007f6a8b1a871a in _dl_open () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#6  0x00007f6a8ab54f66 in dlopen_doit () from /lib64/libdl.so.2
No symbol table info available.
#7  0x00007f6a8b1a4366 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#8  0x00007f6a8ab5529c in _dlerror_run () from /lib64/libdl.so.2
No symbol table info available.
#9  0x00007f6a8ab54ee1 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
No symbol table info available.
#10 0x00000000004854cd in plugin_load_from_file (p=0x7f6a87cd7968, fq_path=0x7f6a80000fb0 "/usr/lib64/slurm/acct_gather_infiniband_none.so") at plugin.c:189
        plug = 0x11
        init = 0x10
        version = 0x49066a
        type = 0x0
#11 0x0000000000485899 in plugin_load_and_link (type_name=0x7f6a80001d40 "acct_gather_infiniband/none", n_syms=4, names=0x7f8f40, ptrs=0x801380) at plugin.c:280
        got_colon = false
        plug = 0x0
        st = {st_dev = 2049, st_ino = 807936, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 10522, st_blksize = 4096, st_blocks = 24, st_atim = {tv_sec = 1478510745, tv_nsec = 0}, st_mtim = {tv_sec = 1477901738, tv_nsec = 0}, st_ctim = {tv_sec = 1477904537, tv_nsec = 0}, __unused = {0, 0, 0}}
        head = 0x7f6a80001d80 "/usr/lib64/slurm"
        dir_array = 0x7f6a80001d80 "/usr/lib64/slurm"
        so_name = 0x7f6a80000f00 "acct_gather_infiniband_none.so"
        file_name = 0x7f6a80000fb0 "/usr/lib64/slurm/acct_gather_infiniband_none.so"
        i = 16
        err = EPLUGIN_NOTFOUND
#12 0x0000000000485cac in plugin_context_create (plugin_type=0x5c3555 "acct_gather_infiniband", uler_type=0x7f6a800016b0 "acct_gather_infiniband/none", ptrs=0x801380, names=0x7f8f40, names_size=32) at plugin.c:431
        c = 0x7f6a80001730
        n_names = 4
#13 0x00000000005338d8 in acct_gather_infiniband_init () at slurm_acct_gather_infiniband.c:132
        retval = 0
        plugin_type = 0x5c3555 "acct_gather_infiniband"
        type = 0x7f6a800016b0 "acct_gather_infiniband/none"
#14 0x0000000000533c29 in acct_gather_infiniband_g_conf_options (full_options=0x7f6a87cd7b00, full_options_cnt=0x7f6a87cd7b1c) at slurm_acct_gather_infiniband.c:212
No locals.
#15 0x000000000052b1e3 in acct_gather_conf_init () at slurm_acct_gather.c:82
        tbl = 0x0
        conf_path = 0x0
        full_options = 0x0
        full_options_cnt = 0
        i = 4890160
        buf = {st_dev = 140095390748296, st_ino = 0, st_nlink = 0, st_mode = 0, st_uid = 0, st_gid = 2147487616, __pad0 = 32618, st_rdev = 140095390748816, st_size = 8393056, st_blksize = 0, st_blocks = 140095568222535, st_atim = {tv_sec = 6, tv_nsec = 140095521651504}, st_mtim = {tv_sec = 4742316, tv_nsec = 140095521651472}, st_ctim = {tv_sec = 48, tv_nsec = 8359552}, __unused = {8392992, 140095390748272, 6040493}}
#16 0x00000000005311d7 in slurm_acct_gather_energy_init () at slurm_acct_gather_energy.c:156
        retval = 0
        plugin_type = 0x5c2bad "acct_gather_energy"
        type = 0x7f6a80000e70 "acct_gather_energy/rapl"
#17 0x00000000005317f1 in acct_gather_energy_g_get_data (data_type=ENERGY_DATA_NODE_ENERGY, data=0x7f6a800034b0) at slurm_acct_gather_energy.c:297
        retval = -1
#18 0x000000000042be3e in _fill_registration_msg (msg=0x7f6a800008d0) at slurmd.c:781
        steps = 0x0
        i = 0x2348980
        stepd = 0x0
        n = 0
        arch = 0x0
        os = 0x0
        buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "c913", '\000' <repeats 60 times>, release = "2.6.32-642.6.1.el6.x86_64", '\000' <repeats 39 times>, version = "#1 SMP Mon Oct 24 16:32:22 EEST 2016", '\000' <repeats 28 times>, machine = "x86_64", '\000' <repeats 58 times>, __domainname = "(none)", '\000' <repeats 58 times>}
        first_msg = false
        slurmd_start_time = 1478531120
        gres_info = 0x7f6a800009e0
        __func__ = "_fill_registration_msg"
#19 0x000000000042b523 in send_registration_msg (status=0, startup=true) at slurmd.c:627
        rc = 0
        ret_val = 0
        msg = 0x7f6a800008d0
#20 0x000000000042ad01 in _registration_engine (arg=0x0) at slurmd.c:440
No locals.
#21 0x00007f6a8a93eaa1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#22 0x00007f6a8a68baad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x7f6a87bd7700 (LWP 4705)):
#0  0x00007f6a7fdfcdf2 in acct_gather_energy_p_update_node_energy () at acct_gather_energy_rapl.c:438
        rc = 0
#1  0x00000000005317ce in acct_gather_energy_g_update_node_energy () at slurm_acct_gather_energy.c:287
        retval = -1
#2  0x0000000000437aaf in _rpc_acct_gather_update (msg=0x7f6a780008d0) at req.c:2712
        resp_msg = {address = {sin_family = 28080, sin_port = 34749, sin_addr = {s_addr = 32618}, sin_zero = "`\n\000xj\177\000"}, auth_cred = 0xad, conn_fd = 1017, data = 0x0, data_size = 5957208, flags = 0, msg_index = 0, msg_type = 65535, protocol_version = 65535, forward = {cnt = 27456, init = 34749, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 2, sin_port = 28348, sin_addr = {s_addr = 16779786}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
        acct_msg = {energy = 0x3f1, node_name = 0x5af1f0 "", sensor_cnt = 56224}
        rc = 0
        req_uid = 88
        first_msg = false
#3  0x0000000000430cd8 in slurmd_req (msg=0x7f6a780008d0) at req.c:405
        rc = 0
#4  0x000000000042b41f in _service_connection (arg=0x23991e0) at slurmd.c:606
        con = 0x23991e0
        msg = 0x7f6a780008d0
        rc = 0
#5  0x00007f6a8a93eaa1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#6  0x00007f6a8a68baad in clone () from /lib64/libc.so.6
No symbol table info available.
(gdb) quit

addr2line -e /usr/sbin/slurmd 00007f6a7fdfcdf2
??:0
Comment 3 Alejandro Sanchez 2016-11-07 09:11:24 MST
Tommi - can you check if any of your nodes doesn't have the 'msr' module loaded?

$ lsmod | grep msr

Anyhow I've tried to load the RAPL plugin without the 'msr' module loaded and I get this error:

slurmd: error: MSR register problem (/dev/cpu/0/msr): No such file or directory
slurmd: error: lseek of /dev/cpu/#/msr: Bad file descriptor

But slurmd doesn't crash in my case.
Comment 4 Alejandro Sanchez 2016-11-07 09:44:12 MST
Also please attach the slurmctld.log and the slurmd.log of a crashing node.
Comment 5 CSC sysadmins 2016-11-08 07:12:56 MST
Created attachment 3690 [details]
slurmd.log and relevant lines from slurmctld.log

msr is not module in el6 kernel

grep -i msr /boot/config-2.6.32-642.6.1.el6.x86_64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_X86_MSR=y
CONFIG_SCSI_ARCMSR=m
CONFIG_SCSI_ARCMSR_AER=y


ls -al /dev/cpu/0/msr
crw-rw---- 1 root root 202, 0 Oct 25 14:47 /dev/cpu/0/msr
Comment 6 Alejandro Sanchez 2016-11-08 10:16:32 MST
Don't think it is related to the abort, but you have warnings logged in your slurmd.log like this:

Nov  7 16:27:54 c913 slurmd[31568]: TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology

that's just to inform you of no common top-level switch. Not sure if it is intentional, but it is an uncommon configuration.

Since I also see these messages in slurmctld.log:

slurmctld.log
[2016-11-07T16:27:50.820] agent/is_node_resp: node:c913 RPC:REQUEST_ACCT_GATHER_UPDATE : Zero Bytes were transmitted or received
[2016-11-07T16:27:52.858] Node c913 now responding

I'm wondering if there are communication errors when you restart slurmd service on all compute nodes. A healthy RPC looks like this:

slurmctld.log:
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_ACCT_GATHER_UPDATE

slurmd.log:
slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE

Can you please confirm whether no common top-level switch is intentional?
Comment 7 CSC sysadmins 2016-11-09 07:25:43 MST
> I'm wondering if there are communication errors when you restart slurmd service > on all compute nodes. A healthy RPC looks like this:
> 
> slurmctld.log:
> slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_ACCT_GATHER_UPDATE
> 
> slurmd.log:
> slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE

Hi,

That "Zero Bytes.." error message comes only when slurmd fails to start on the node. I need to run it in loop so reproduce crash:

while true; do service slurm restart; sleep 1; done

> Can you please confirm whether no common top-level switch is intentional?

Yes it's intentional because there is no common top-level Infiniband switch for all compute nodes. We have multiple islands: snb, hsw, two gpu islands, etc.
Comment 8 Alejandro Sanchez 2016-11-09 07:44:21 MST
I'm wondering if there's a sync problem at restart time. The crash occurs accessing this global variable in the acct_gather_energy/rapl.c plugin in the acct_gather_energy_p_update_node_energy(void) function:

if (local_energy->current_watts == NO_VAL)

local_energy is defined as a static global variable in the plugin like this:

static acct_gather_energy_t *local_energy = NULL;

When the plugin finishes, which I expect happens at slurmd restart, the plugin fini() function is called:

extern int fini(void)
{
        int i;

        if (!_run_in_daemon())
                return SLURM_SUCCESS;

        for (i = 0; i < nb_pkg; i++) {
                if (pkg_fd[i] != -1) {
                        close(pkg_fd[i]);
                        pkg_fd[i] = -1;
                }
        }

        acct_gather_energy_destroy(local_energy);
        local_energy = NULL;
        return SLURM_SUCCESS;
}

I'm wondering if a REQUEST_ACCT_GATHER_UPDATE RPC request is trying to be processed by slurmd before local_energy is allocated in the heap again. Perhaps this scenario could lead to this crash.
Comment 11 Alejandro Sanchez 2016-11-09 11:53:47 MST
Tommi - following commit should prevent the segfault and will be available in 16.05.7:

https://github.com/SchedMD/slurm/commit/c00b3da7fdf3

You can apply it earlier as a local patch if desired. Closing the bug, please reopen if you encounter further issues.
Comment 12 CSC sysadmins 2016-11-10 02:14:55 MST
I restarted slurmd 339 times with this patch and could not reproduce this crash anymore. Thanks for the fix.
Comment 13 CSC sysadmins 2016-11-10 02:27:52 MST
Found these two lines from log, so failure rate was two times out of 339 restarts.

Nov 10 11:06:54 c913 slurmd[14651]: debug:  acct_gather_energy_p_update_node_energy: trying to update node energy, but no local_energyyet.
Nov 10 11:09:54 c913 slurmd[18980]: debug:  acct_gather_energy_p_update_node_energy: trying to update node energy, but no local_energyyet.
Comment 14 Alejandro Sanchez 2016-11-10 03:29:27 MST
All right. Gonna patch the space typo "but no local_energyyet.".