| Summary: | squeue -w segfault | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ryan Cox <ryan_cox> |
| Component: | Other | Assignee: | David Bigagli <david> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.11.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | BYU - Brigham Young University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 14.11.5 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmconf-2015-02-13.tar.gz
squeue patch squeue output after patch |
||
Hi, we cannot reproduce this. Could build squeue with CFLAGS=-ggdb -O0 so we can print the values from the stack. We are interested in seeing the value of host and node. (gdb) frame 1 (gdb) p * host (gdb) p * node (gdb) p cc (gdb) p * node->node_array[cc] Thanks, David I assume that you want me to build like this? root@sched2:/usr/local/src/slurm/src/squeue# make CFLAGS="-ggdb -O0" install make[1]: Entering directory `/usr/local/src/slurm/src/api' make[2]: Entering directory `/usr/local/src/slurm/src/common' make[2]: `libeio.o' is up to date. make[2]: Leaving directory `/usr/local/src/slurm/src/common' make[2]: Entering directory `/usr/local/src/slurm/src/common' make[2]: `libspank.o' is up to date. make[2]: Leaving directory `/usr/local/src/slurm/src/common' make[2]: Entering directory `/usr/local/src/slurm/src/common' make[2]: `libcommon.o' is up to date. make[2]: Leaving directory `/usr/local/src/slurm/src/common' make[1]: Leaving directory `/usr/local/src/slurm/src/api' make[1]: Entering directory `/usr/local/src/slurm/src/squeue' make[2]: Entering directory `/usr/local/src/slurm/src/api' make[3]: Entering directory `/usr/local/src/slurm/src/common' make[3]: `libeio.o' is up to date. make[3]: Leaving directory `/usr/local/src/slurm/src/common' make[3]: Entering directory `/usr/local/src/slurm/src/common' make[3]: `libspank.o' is up to date. make[3]: Leaving directory `/usr/local/src/slurm/src/common' make[3]: Entering directory `/usr/local/src/slurm/src/common' make[3]: `libcommon.o' is up to date. make[3]: Leaving directory `/usr/local/src/slurm/src/common' make[2]: Leaving directory `/usr/local/src/slurm/src/api' /bin/mkdir -p '/usr/local/bin' /bin/bash ../../libtool --mode=install /usr/bin/install -c squeue '/usr/local/bin' libtool: install: /usr/bin/install -c squeue /usr/local/bin/squeue make[1]: Nothing to be done for `install-data-am'. make[1]: Leaving directory `/usr/local/src/slurm/src/squeue' root@sched2:/usr/local/src/slurm/src/squeue# ls -l /usr/local/bin/squeue -rwxr-xr-x 1 root staff 4167346 Feb 13 11:08 /usr/local/bin/squeue root@sched2:/usr/local/src/slurm/src/squeue# /usr/local/bin/squeue -w m8-1-1 Segmentation fault root@sched2:/usr/local/src/slurm/src/squeue# gdb /usr/local/bin/squeue GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/local/bin/squeue...done. (gdb) run -w m8-1-1 Starting program: /usr/local/bin/squeue -w m8-1-1 [Thread debugging using libthread_db enabled] Program received signal SIGSEGV, Segmentation fault. 0x00007ffff77644ea in ?? () from /lib/libc.so.6 (gdb) frame 1 #1 0x000000000043271c in _find_a_host (argc=<value optimized out>, argv=<value optimized out>) at opts.c:1974 1974 if (strcmp(host, node->node_array[cc].name) == 0) (gdb) p * host Cannot access memory at address 0x0 (gdb) p * node Cannot access memory at address 0x0 (gdb) p cc $1 = 0 (gdb) p * node->node_array[cc] Cannot access memory at address 0x10 (gdb) Interesting everything is NULL. Did you restart the nodes and the controller after the name change? I have prepared a little instrumentation to check what hostnames are being returned from slurmctld. Could you please apply and send the output. David Created attachment 1641 [details]
squeue patch
(In reply to David Bigagli from comment #3) > Interesting everything is NULL. Did you restart the nodes and the controller > after the name change? We restarted all slurmctld, slurmdbd, and slurmd instances everywhere in the correrct order. I guess it's possible that a compute node was unresponsive at the time and hasn't been restarted or received the 14.11.4 update (yes we pushed it out yesterday afternoon :) ). Created attachment 1642 [details]
squeue output after patch
I see the problem, the first two records have a name NULL which causes the SEGV in strcmp(), otherwise the host m8-1-5 is there. squeue: _check_node_names: node name (null) squeue: _check_node_names: node name (null) I wonder how this is possible. In gdb could you please print this from the core: (gdb) frame 1 (gdb) p node->node_array[cc] This will print the entire data structure. Is there anything in the slurmctld.log when you run this command? David (In reply to David Bigagli from comment #7) > I see the problem, the first two records have a name NULL which causes > the SEGV in strcmp(), otherwise the host m8-1-5 is there. > > squeue: _check_node_names: node name (null) > squeue: _check_node_names: node name (null) > > I wonder how this is possible. In gdb could you please print this from the > core: > > (gdb) frame 1 > (gdb) p node->node_array[cc] > > This will print the entire data structure. > > Is there anything in the slurmctld.log when you run this command? > > David (gdb) p node->node_array[cc] $4 = {arch = 0x0, boards = 1, boot_time = 0, cores = 4, core_spec_cnt = 0, cpu_load = 4294967294, cpus = 8, cpu_spec_list = 0x0, energy = 0x7af230, ext_sensors = 0x7ad490, features = 0x7af1d0 "ib,dol4,m5,intel,nehalem,cpu2800MHz,mem1333MHz,private,sse4.1,sse4.2", gres = 0x0, gres_drain = 0x0, gres_used = 0x0, mem_spec_limit = 0, name = 0x0, node_addr = 0x7a3280 "m5-21-1", node_hostname = 0x75af10 "m5-21-1", node_state = 32806, os = 0x0, real_memory = 24576, reason = 0x7a32a0 "NO NETWORK ADDRESS FOUND", reason_time = 1423833902, reason_uid = 1000, select_nodeinfo = 0x7a3ea0, slurmd_start_time = 0, sockets = 2, threads = 1, tmp_disk = 0, weight = 1, version = 0x0} I know what the problem is now. We retired that hardware and I accidentally left it in the Slurm config. I didn't think it was a problem since it is actually still alive and (I thought) in DNS. Turns out that someone else proactively removed it from DNS... Somehow two nodes out of m5-21-[1-16] were non-existent in scontrol show node but still in slurm.conf. The others all showed up in scontrol show node. There is nothing relevant or interesting in the slurmctld logs about it but we're only at SlurmctldDebug=3. I'm removing the retired nodes/partition that I missed out of all the others and I expect it to work after that. I'll update you shortly. This means that we can have node records with name as NULL so we have to patch the squeue and perhaps some other places as well. Team work :-) David Everything now works as expected. Thanks Commit c13e8540a. David |
Created attachment 1640 [details] slurmconf-2015-02-13.tar.gz I'm getting segfaults when I specify squeue -w $some_node. The node can exist or not exist and it segfaults. I think this started happening this week after nodes were added, removed, or renamed. It happens in 14.11.3 and .4. Core was generated by `squeue -l -w m8-6-[1-16]'. Program terminated with signal 11, Segmentation fault. #0 0x00007f3db88832ba in __strcmp_sse42 () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install slurm-14.11.4-1.el6fslgit201502121534.x86_64 (gdb) bt #0 0x00007f3db88832ba in __strcmp_sse42 () from /lib64/libc.so.6 #1 0x000000000042e9cc in _find_a_host (argc=<value optimized out>, argv=<value optimized out>) at opts.c:1974 #2 _check_node_names (argc=<value optimized out>, argv=<value optimized out>) at opts.c:1955 #3 parse_command_line (argc=<value optimized out>, argv=<value optimized out>) at opts.c:381 #4 0x00000000004255ad in main (argc=4, argv=0x7fff705fab78) at squeue.c:80