Ticket 1455

Summary: squeue -w segfault
Product: Slurm Reporter: Ryan Cox <ryan_cox>
Component: OtherAssignee: David Bigagli <david>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: brian, da
Version: 14.11.3   
Hardware: Linux   
OS: Linux   
Site: BYU - Brigham Young University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 14.11.5 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmconf-2015-02-13.tar.gz
squeue patch
squeue output after patch

Description Ryan Cox 2015-02-13 03:22:35 MST
Created attachment 1640 [details]
slurmconf-2015-02-13.tar.gz

I'm getting segfaults when I specify squeue -w $some_node.  The node can exist or not exist and it segfaults.  I think this started happening this week after nodes were added, removed, or renamed.  It happens in 14.11.3 and .4.

Core was generated by `squeue -l -w m8-6-[1-16]'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f3db88832ba in __strcmp_sse42 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install slurm-14.11.4-1.el6fslgit201502121534.x86_64
(gdb) bt
#0  0x00007f3db88832ba in __strcmp_sse42 () from /lib64/libc.so.6
#1  0x000000000042e9cc in _find_a_host (argc=<value optimized out>, argv=<value optimized out>) at opts.c:1974
#2  _check_node_names (argc=<value optimized out>, argv=<value optimized out>) at opts.c:1955
#3  parse_command_line (argc=<value optimized out>, argv=<value optimized out>) at opts.c:381
#4  0x00000000004255ad in main (argc=4, argv=0x7fff705fab78) at squeue.c:80
Comment 1 David Bigagli 2015-02-13 04:01:49 MST
Hi,
  we cannot reproduce this. Could build squeue with CFLAGS=-ggdb -O0 
so we can print the values from the stack.
We are interested in seeing the value of host and node.

(gdb) frame 1
(gdb) p * host
(gdb) p * node
(gdb) p cc
(gdb) p * node->node_array[cc]

Thanks,
David
Comment 2 Ryan Cox 2015-02-13 04:10:28 MST
I assume that you want me to build like this?

root@sched2:/usr/local/src/slurm/src/squeue# make CFLAGS="-ggdb -O0" install
make[1]: Entering directory `/usr/local/src/slurm/src/api'
make[2]: Entering directory `/usr/local/src/slurm/src/common'
make[2]: `libeio.o' is up to date.
make[2]: Leaving directory `/usr/local/src/slurm/src/common'
make[2]: Entering directory `/usr/local/src/slurm/src/common'
make[2]: `libspank.o' is up to date.
make[2]: Leaving directory `/usr/local/src/slurm/src/common'
make[2]: Entering directory `/usr/local/src/slurm/src/common'
make[2]: `libcommon.o' is up to date.
make[2]: Leaving directory `/usr/local/src/slurm/src/common'
make[1]: Leaving directory `/usr/local/src/slurm/src/api'
make[1]: Entering directory `/usr/local/src/slurm/src/squeue'
make[2]: Entering directory `/usr/local/src/slurm/src/api'
make[3]: Entering directory `/usr/local/src/slurm/src/common'
make[3]: `libeio.o' is up to date.
make[3]: Leaving directory `/usr/local/src/slurm/src/common'
make[3]: Entering directory `/usr/local/src/slurm/src/common'
make[3]: `libspank.o' is up to date.
make[3]: Leaving directory `/usr/local/src/slurm/src/common'
make[3]: Entering directory `/usr/local/src/slurm/src/common'
make[3]: `libcommon.o' is up to date.
make[3]: Leaving directory `/usr/local/src/slurm/src/common'
make[2]: Leaving directory `/usr/local/src/slurm/src/api'
 /bin/mkdir -p '/usr/local/bin'
  /bin/bash ../../libtool   --mode=install /usr/bin/install -c squeue '/usr/local/bin'
libtool: install: /usr/bin/install -c squeue /usr/local/bin/squeue
make[1]: Nothing to be done for `install-data-am'.
make[1]: Leaving directory `/usr/local/src/slurm/src/squeue'
root@sched2:/usr/local/src/slurm/src/squeue# ls -l /usr/local/bin/squeue
-rwxr-xr-x 1 root staff 4167346 Feb 13 11:08 /usr/local/bin/squeue
root@sched2:/usr/local/src/slurm/src/squeue# /usr/local/bin/squeue -w m8-1-1
Segmentation fault
root@sched2:/usr/local/src/slurm/src/squeue# gdb /usr/local/bin/squeue 
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/local/bin/squeue...done.
(gdb) run -w m8-1-1
Starting program: /usr/local/bin/squeue -w m8-1-1
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff77644ea in ?? () from /lib/libc.so.6
(gdb) frame 1
#1  0x000000000043271c in _find_a_host (argc=<value optimized out>, argv=<value optimized out>) at opts.c:1974
1974			if (strcmp(host, node->node_array[cc].name) == 0)
(gdb) p * host
Cannot access memory at address 0x0
(gdb) p * node
Cannot access memory at address 0x0
(gdb) p cc
$1 = 0
(gdb) p * node->node_array[cc]
Cannot access memory at address 0x10
(gdb)
Comment 3 David Bigagli 2015-02-13 04:45:15 MST
Interesting everything is NULL. Did you restart the nodes and the controller
after the name change?
I have prepared a little instrumentation to check what hostnames are being returned from slurmctld. Could you please apply and send the output.

David
Comment 4 David Bigagli 2015-02-13 04:45:45 MST
Created attachment 1641 [details]
squeue patch
Comment 5 Ryan Cox 2015-02-13 04:59:23 MST
(In reply to David Bigagli from comment #3)
> Interesting everything is NULL. Did you restart the nodes and the controller
> after the name change?

We restarted all slurmctld, slurmdbd, and slurmd instances everywhere in the correrct order.  I guess it's possible that a compute node was unresponsive at the time and hasn't been restarted or received the 14.11.4 update (yes we pushed it out yesterday afternoon :) ).
Comment 6 Ryan Cox 2015-02-13 05:00:04 MST
Created attachment 1642 [details]
squeue output after patch
Comment 7 David Bigagli 2015-02-13 05:10:20 MST
I see the problem, the first two records have a name NULL which causes
the SEGV in strcmp(), otherwise the host m8-1-5 is there.

squeue: _check_node_names: node name (null)
squeue: _check_node_names: node name (null)

I wonder how this is possible. In gdb could you please print this from the
core:

(gdb) frame 1
(gdb) p node->node_array[cc]

This will print the entire data structure.

Is there anything in the slurmctld.log when you run this command?

David
Comment 8 Ryan Cox 2015-02-13 05:22:18 MST
(In reply to David Bigagli from comment #7)
> I see the problem, the first two records have a name NULL which causes
> the SEGV in strcmp(), otherwise the host m8-1-5 is there.
> 
> squeue: _check_node_names: node name (null)
> squeue: _check_node_names: node name (null)
> 
> I wonder how this is possible. In gdb could you please print this from the
> core:
> 
> (gdb) frame 1
> (gdb) p node->node_array[cc]
> 
> This will print the entire data structure.
> 
> Is there anything in the slurmctld.log when you run this command?
> 
> David

(gdb) p node->node_array[cc]
$4 = {arch = 0x0, boards = 1, boot_time = 0, cores = 4, core_spec_cnt = 0, cpu_load = 4294967294, cpus = 8, cpu_spec_list = 0x0, energy = 0x7af230, ext_sensors = 0x7ad490, features = 0x7af1d0 "ib,dol4,m5,intel,nehalem,cpu2800MHz,mem1333MHz,private,sse4.1,sse4.2", gres = 0x0, gres_drain = 0x0, gres_used = 0x0, 
  mem_spec_limit = 0, name = 0x0, node_addr = 0x7a3280 "m5-21-1", node_hostname = 0x75af10 "m5-21-1", node_state = 32806, os = 0x0, real_memory = 24576, reason = 0x7a32a0 "NO NETWORK ADDRESS FOUND", reason_time = 1423833902, reason_uid = 1000, select_nodeinfo = 0x7a3ea0, slurmd_start_time = 0, sockets = 2, 
  threads = 1, tmp_disk = 0, weight = 1, version = 0x0}


I know what the problem is now.  We retired that hardware and I accidentally left it in the Slurm config.  I didn't think it was a problem since it is actually still alive and (I thought) in DNS.  Turns out that someone else proactively removed it from DNS...  Somehow two nodes out of m5-21-[1-16] were non-existent in scontrol show node but still in slurm.conf.  The others all showed up in scontrol show node.

There is nothing relevant or interesting in the slurmctld logs about it but we're only at SlurmctldDebug=3.

I'm removing the retired nodes/partition that I missed out of all the others and I expect it to work after that.  I'll update you shortly.
Comment 9 David Bigagli 2015-02-13 05:24:41 MST
This means that we can have node records with name as NULL so we have to 
patch the squeue and perhaps some other places as well. Team work :-)

David
Comment 10 Ryan Cox 2015-02-13 05:37:55 MST
Everything now works as expected.  Thanks
Comment 11 David Bigagli 2015-02-13 05:46:42 MST
Commit c13e8540a.

David