If I restart slurmd service on all compute nodes, about 5% of nodes wont be able to start slurmd due to crash on rapl and I need to start service again. Nov 3 18:27:37 c2 kernel: slurmd[21570]: segfault at 18 ip 00007fda7cc81df2 sp 00007fda7cf84d30 error 4 in acct_gather_energy_rapl.so[7fda7cc80000+3000] # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=energy=30,task=30 AcctGatherEnergyType=acct_gather_energy/rapl AcctGatherNodeFreq=30
Hi Tommi - did the crash generate any core files? if so you could you please attach the output of: $ gdb path/to/slurmd path/to/core (gdb) set pagination off (gdb) bt (gdb) thread apply all bt full If it didn't generate any core files, could you please check whether you have the ulimit core file size set to unlimited and try to start slurmd again, so the crash might generate a core file this time? Also could you please attach the output of: # addr2line -e /path/to/slurmd 00007fda7cc81df2 Your slurm.conf, slurmctld.log and slurmd.log from a node where the crash happens might be useful too. Thanks.
Finally I got a core dump, had to run service slurm restart in a while true loop. Config can be found on BR#3215 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmd...done. [New Thread 4705] [New Thread 4704] [New Thread 4694] Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libz.so.1 Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libhwloc.so.5 Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnuma.so.1 Reading symbols from /lib64/libpci.so.3...(no debugging symbols found)...done. Loaded symbols for /lib64/libpci.so.3 Reading symbols from /usr/lib64/libxml2.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libxml2.so.2 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /usr/lib64/slurm/select_cons_res.so...done. Loaded symbols for /usr/lib64/slurm/select_cons_res.so Reading symbols from /usr/lib64/slurm/gres_mic.so...done. Loaded symbols for /usr/lib64/slurm/gres_mic.so Reading symbols from /usr/lib64/slurm/gres_gpu.so...done. Loaded symbols for /usr/lib64/slurm/gres_gpu.so Reading symbols from /usr/lib64/slurm/topology_tree.so...done. Loaded symbols for /usr/lib64/slurm/topology_tree.so Reading symbols from /usr/lib64/slurm/route_default.so...done. Loaded symbols for /usr/lib64/slurm/route_default.so Reading symbols from /usr/lib64/slurm/proctrack_cgroup.so...done. Loaded symbols for /usr/lib64/slurm/proctrack_cgroup.so Reading symbols from /usr/lib64/slurm/task_cgroup.so...done. Loaded symbols for /usr/lib64/slurm/task_cgroup.so Reading symbols from /usr/lib64/slurm/auth_munge.so...done. Loaded symbols for /usr/lib64/slurm/auth_munge.so Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libmunge.so.2 Reading symbols from /usr/lib64/slurm/crypto_munge.so...done. Loaded symbols for /usr/lib64/slurm/crypto_munge.so Reading symbols from /usr/lib64/slurm/jobacct_gather_linux.so...done. Loaded symbols for /usr/lib64/slurm/jobacct_gather_linux.so Reading symbols from /usr/lib64/slurm/job_container_none.so...done. Loaded symbols for /usr/lib64/slurm/job_container_none.so Reading symbols from /usr/lib64/slurm/core_spec_none.so...done. Loaded symbols for /usr/lib64/slurm/core_spec_none.so Reading symbols from /usr/lib64/slurm/switch_none.so...done. Loaded symbols for /usr/lib64/slurm/switch_none.so Reading symbols from /usr/lib64/slurm/acct_gather_energy_rapl.so...done. Loaded symbols for /usr/lib64/slurm/acct_gather_energy_rapl.so Reading symbols from /usr/lib64/slurm/acct_gather_profile_none.so...done. Loaded symbols for /usr/lib64/slurm/acct_gather_profile_none.so Reading symbols from /usr/lib64/slurm/acct_gather_infiniband_none.so...done. Loaded symbols for /usr/lib64/slurm/acct_gather_infiniband_none.so Core was generated by `/usr/sbin/slurmd -M'. Program terminated with signal 11, Segmentation fault. #0 0x00007f6a7fdfcdf2 in acct_gather_energy_p_update_node_energy () at acct_gather_energy_rapl.c:438 438 acct_gather_energy_rapl.c: No such file or directory. in acct_gather_energy_rapl.c Missing separate debuginfos, use: debuginfo-install slurm-16.05.6-1.el6.x86_64 (gdb) set pagination off (gdb) bt #0 0x00007f6a7fdfcdf2 in acct_gather_energy_p_update_node_energy () at acct_gather_energy_rapl.c:438 #1 0x00000000005317ce in acct_gather_energy_g_update_node_energy () at slurm_acct_gather_energy.c:287 #2 0x0000000000437aaf in _rpc_acct_gather_update (msg=0x7f6a780008d0) at req.c:2712 #3 0x0000000000430cd8 in slurmd_req (msg=0x7f6a780008d0) at req.c:405 #4 0x000000000042b41f in _service_connection (arg=0x23991e0) at slurmd.c:606 #5 0x00007f6a8a93eaa1 in start_thread () from /lib64/libpthread.so.0 #6 0x00007f6a8a68baad in clone () from /lib64/libc.so.6 (gdb) thread apply all bt full Thread 3 (Thread 0x7f6a8b398700 (LWP 4694)): #0 0x00007f6a8a945b2d in accept () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00000000004edee3 in slurm_accept_msg_conn (fd=4, addr=0x2396bb0) at slurm_protocol_socket_implementation.c:477 len = 16 #2 0x000000000042add5 in _msg_engine () at slurmd.c:470 cli = 0x2396bb0 sock = 6 #3 0x000000000042aadb in main (argc=2, argv=0x7ffc1919ac08) at slurmd.c:380 i = 32768 pidfd = 5 blocked_signals = {13, 0} cc = 32768 oom_value = 0x0 slurmd_uid = 0 curr_uid = 0 time_stamp = "Mon, 07 Nov 2016 17:05:19 +0200\000\001\000\000\000j\177\000\000\210q;\213j\177\000\000Ȣ ", '\000' <repeats 13 times>"\270, \346\061\212j\177\000\000\240Z*\002", '\000' <repeats 12 times>"\340, t;\213j\177\000\000\360\252\031\031\374\177\000\000\000\000\000\000\000\000\000\000\b\253\031\031\374\177\000\000\020N\021\212\001\000\000\000\001\000\000\000\000\000\000\000\070yA", '\000' <repeats 13 times>, "\027\000\000\000\000\000\000\000`X*\002\000\000\000\000թ\032\213j\177\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000p6Z", '\000' <repeats 13 times>, "{\223B\000\000\000\000\000\002\000\000\000\000\000\000\000\265"... lopts = {stderr_level = LOG_LEVEL_INFO, syslog_level = LOG_LEVEL_INFO, logfile_level = LOG_LEVEL_INFO, prefix_level = 1, buffered = 0} Thread 2 (Thread 0x7f6a87cd8700 (LWP 4704)): #0 0x00007f6a8b19f659 in do_lookup_x () from /lib64/ld-linux-x86-64.so.2 No symbol table info available. #1 0x00007f6a8b19ffea in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2 No symbol table info available. #2 0x00007f6a8b1a1834 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2 No symbol table info available. #3 0x00007f6a8b1a8dba in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 No symbol table info available. #4 0x00007f6a8b1a4366 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 No symbol table info available. #5 0x00007f6a8b1a871a in _dl_open () from /lib64/ld-linux-x86-64.so.2 No symbol table info available. #6 0x00007f6a8ab54f66 in dlopen_doit () from /lib64/libdl.so.2 No symbol table info available. #7 0x00007f6a8b1a4366 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 No symbol table info available. #8 0x00007f6a8ab5529c in _dlerror_run () from /lib64/libdl.so.2 No symbol table info available. #9 0x00007f6a8ab54ee1 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 No symbol table info available. #10 0x00000000004854cd in plugin_load_from_file (p=0x7f6a87cd7968, fq_path=0x7f6a80000fb0 "/usr/lib64/slurm/acct_gather_infiniband_none.so") at plugin.c:189 plug = 0x11 init = 0x10 version = 0x49066a type = 0x0 #11 0x0000000000485899 in plugin_load_and_link (type_name=0x7f6a80001d40 "acct_gather_infiniband/none", n_syms=4, names=0x7f8f40, ptrs=0x801380) at plugin.c:280 got_colon = false plug = 0x0 st = {st_dev = 2049, st_ino = 807936, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 10522, st_blksize = 4096, st_blocks = 24, st_atim = {tv_sec = 1478510745, tv_nsec = 0}, st_mtim = {tv_sec = 1477901738, tv_nsec = 0}, st_ctim = {tv_sec = 1477904537, tv_nsec = 0}, __unused = {0, 0, 0}} head = 0x7f6a80001d80 "/usr/lib64/slurm" dir_array = 0x7f6a80001d80 "/usr/lib64/slurm" so_name = 0x7f6a80000f00 "acct_gather_infiniband_none.so" file_name = 0x7f6a80000fb0 "/usr/lib64/slurm/acct_gather_infiniband_none.so" i = 16 err = EPLUGIN_NOTFOUND #12 0x0000000000485cac in plugin_context_create (plugin_type=0x5c3555 "acct_gather_infiniband", uler_type=0x7f6a800016b0 "acct_gather_infiniband/none", ptrs=0x801380, names=0x7f8f40, names_size=32) at plugin.c:431 c = 0x7f6a80001730 n_names = 4 #13 0x00000000005338d8 in acct_gather_infiniband_init () at slurm_acct_gather_infiniband.c:132 retval = 0 plugin_type = 0x5c3555 "acct_gather_infiniband" type = 0x7f6a800016b0 "acct_gather_infiniband/none" #14 0x0000000000533c29 in acct_gather_infiniband_g_conf_options (full_options=0x7f6a87cd7b00, full_options_cnt=0x7f6a87cd7b1c) at slurm_acct_gather_infiniband.c:212 No locals. #15 0x000000000052b1e3 in acct_gather_conf_init () at slurm_acct_gather.c:82 tbl = 0x0 conf_path = 0x0 full_options = 0x0 full_options_cnt = 0 i = 4890160 buf = {st_dev = 140095390748296, st_ino = 0, st_nlink = 0, st_mode = 0, st_uid = 0, st_gid = 2147487616, __pad0 = 32618, st_rdev = 140095390748816, st_size = 8393056, st_blksize = 0, st_blocks = 140095568222535, st_atim = {tv_sec = 6, tv_nsec = 140095521651504}, st_mtim = {tv_sec = 4742316, tv_nsec = 140095521651472}, st_ctim = {tv_sec = 48, tv_nsec = 8359552}, __unused = {8392992, 140095390748272, 6040493}} #16 0x00000000005311d7 in slurm_acct_gather_energy_init () at slurm_acct_gather_energy.c:156 retval = 0 plugin_type = 0x5c2bad "acct_gather_energy" type = 0x7f6a80000e70 "acct_gather_energy/rapl" #17 0x00000000005317f1 in acct_gather_energy_g_get_data (data_type=ENERGY_DATA_NODE_ENERGY, data=0x7f6a800034b0) at slurm_acct_gather_energy.c:297 retval = -1 #18 0x000000000042be3e in _fill_registration_msg (msg=0x7f6a800008d0) at slurmd.c:781 steps = 0x0 i = 0x2348980 stepd = 0x0 n = 0 arch = 0x0 os = 0x0 buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "c913", '\000' <repeats 60 times>, release = "2.6.32-642.6.1.el6.x86_64", '\000' <repeats 39 times>, version = "#1 SMP Mon Oct 24 16:32:22 EEST 2016", '\000' <repeats 28 times>, machine = "x86_64", '\000' <repeats 58 times>, __domainname = "(none)", '\000' <repeats 58 times>} first_msg = false slurmd_start_time = 1478531120 gres_info = 0x7f6a800009e0 __func__ = "_fill_registration_msg" #19 0x000000000042b523 in send_registration_msg (status=0, startup=true) at slurmd.c:627 rc = 0 ret_val = 0 msg = 0x7f6a800008d0 #20 0x000000000042ad01 in _registration_engine (arg=0x0) at slurmd.c:440 No locals. #21 0x00007f6a8a93eaa1 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #22 0x00007f6a8a68baad in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x7f6a87bd7700 (LWP 4705)): #0 0x00007f6a7fdfcdf2 in acct_gather_energy_p_update_node_energy () at acct_gather_energy_rapl.c:438 rc = 0 #1 0x00000000005317ce in acct_gather_energy_g_update_node_energy () at slurm_acct_gather_energy.c:287 retval = -1 #2 0x0000000000437aaf in _rpc_acct_gather_update (msg=0x7f6a780008d0) at req.c:2712 resp_msg = {address = {sin_family = 28080, sin_port = 34749, sin_addr = {s_addr = 32618}, sin_zero = "`\n\000xj\177\000"}, auth_cred = 0xad, conn_fd = 1017, data = 0x0, data_size = 5957208, flags = 0, msg_index = 0, msg_type = 65535, protocol_version = 65535, forward = {cnt = 27456, init = 34749, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 2, sin_port = 28348, sin_addr = {s_addr = 16779786}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0} acct_msg = {energy = 0x3f1, node_name = 0x5af1f0 "", sensor_cnt = 56224} rc = 0 req_uid = 88 first_msg = false #3 0x0000000000430cd8 in slurmd_req (msg=0x7f6a780008d0) at req.c:405 rc = 0 #4 0x000000000042b41f in _service_connection (arg=0x23991e0) at slurmd.c:606 con = 0x23991e0 msg = 0x7f6a780008d0 rc = 0 #5 0x00007f6a8a93eaa1 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #6 0x00007f6a8a68baad in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) quit addr2line -e /usr/sbin/slurmd 00007f6a7fdfcdf2 ??:0
Tommi - can you check if any of your nodes doesn't have the 'msr' module loaded? $ lsmod | grep msr Anyhow I've tried to load the RAPL plugin without the 'msr' module loaded and I get this error: slurmd: error: MSR register problem (/dev/cpu/0/msr): No such file or directory slurmd: error: lseek of /dev/cpu/#/msr: Bad file descriptor But slurmd doesn't crash in my case.
Also please attach the slurmctld.log and the slurmd.log of a crashing node.
Created attachment 3690 [details] slurmd.log and relevant lines from slurmctld.log msr is not module in el6 kernel grep -i msr /boot/config-2.6.32-642.6.1.el6.x86_64 CONFIG_X86_DEBUGCTLMSR=y CONFIG_X86_MSR=y CONFIG_SCSI_ARCMSR=m CONFIG_SCSI_ARCMSR_AER=y ls -al /dev/cpu/0/msr crw-rw---- 1 root root 202, 0 Oct 25 14:47 /dev/cpu/0/msr
Don't think it is related to the abort, but you have warnings logged in your slurmd.log like this: Nov 7 16:27:54 c913 slurmd[31568]: TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology that's just to inform you of no common top-level switch. Not sure if it is intentional, but it is an uncommon configuration. Since I also see these messages in slurmctld.log: slurmctld.log [2016-11-07T16:27:50.820] agent/is_node_resp: node:c913 RPC:REQUEST_ACCT_GATHER_UPDATE : Zero Bytes were transmitted or received [2016-11-07T16:27:52.858] Node c913 now responding I'm wondering if there are communication errors when you restart slurmd service on all compute nodes. A healthy RPC looks like this: slurmctld.log: slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_ACCT_GATHER_UPDATE slurmd.log: slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE Can you please confirm whether no common top-level switch is intentional?
> I'm wondering if there are communication errors when you restart slurmd service > on all compute nodes. A healthy RPC looks like this: > > slurmctld.log: > slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_ACCT_GATHER_UPDATE > > slurmd.log: > slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE Hi, That "Zero Bytes.." error message comes only when slurmd fails to start on the node. I need to run it in loop so reproduce crash: while true; do service slurm restart; sleep 1; done > Can you please confirm whether no common top-level switch is intentional? Yes it's intentional because there is no common top-level Infiniband switch for all compute nodes. We have multiple islands: snb, hsw, two gpu islands, etc.
I'm wondering if there's a sync problem at restart time. The crash occurs accessing this global variable in the acct_gather_energy/rapl.c plugin in the acct_gather_energy_p_update_node_energy(void) function: if (local_energy->current_watts == NO_VAL) local_energy is defined as a static global variable in the plugin like this: static acct_gather_energy_t *local_energy = NULL; When the plugin finishes, which I expect happens at slurmd restart, the plugin fini() function is called: extern int fini(void) { int i; if (!_run_in_daemon()) return SLURM_SUCCESS; for (i = 0; i < nb_pkg; i++) { if (pkg_fd[i] != -1) { close(pkg_fd[i]); pkg_fd[i] = -1; } } acct_gather_energy_destroy(local_energy); local_energy = NULL; return SLURM_SUCCESS; } I'm wondering if a REQUEST_ACCT_GATHER_UPDATE RPC request is trying to be processed by slurmd before local_energy is allocated in the heap again. Perhaps this scenario could lead to this crash.
Tommi - following commit should prevent the segfault and will be available in 16.05.7: https://github.com/SchedMD/slurm/commit/c00b3da7fdf3 You can apply it earlier as a local patch if desired. Closing the bug, please reopen if you encounter further issues.
I restarted slurmd 339 times with this patch and could not reproduce this crash anymore. Thanks for the fix.
Found these two lines from log, so failure rate was two times out of 339 restarts. Nov 10 11:06:54 c913 slurmd[14651]: debug: acct_gather_energy_p_update_node_energy: trying to update node energy, but no local_energyyet. Nov 10 11:09:54 c913 slurmd[18980]: debug: acct_gather_energy_p_update_node_energy: trying to update node energy, but no local_energyyet.
All right. Gonna patch the space typo "but no local_energyyet.".