Ticket 532

Summary: slurmctld crash after reconfigure + node brought online
Product: Slurm Reporter: Tim Wickberg <wickberg>
Component: slurmctldAssignee: David Bigagli <david>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: da, maclach
Version: 2.6.x   
Hardware: Linux   
OS: Linux   
Site: George Washington University, The Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: GW slurm configuration

Description Tim Wickberg 2013-11-25 03:10:39 MST
I'll see if I can reproduce this in test mode, but I've seen this happen twice with 2.6.3. In both cases, I was adding a new node in to the system. First by editing slurm.conf, then "scontrol reconfigure".

Today (not sure if I did this last time), I then did:

scontrol update nodename=node097 state=drain reason=new

The node comes up successfully, then I run

scontrol update nodename=node097 state=resume

and slurmctld crashes. Last few lines of slurmctld.log are

[2013-11-25T11:48:46.260] Requeue JobId=352867 due to node failure
[2013-11-25T11:48:46.260] sched: job_complete for JobId=352867 successful, exit code=4294967294
[2013-11-25T11:48:46.260] Node node097 unexpectedly rebooted
[2013-11-25T11:49:18.074] completing job 352859
[2013-11-25T11:49:18.074] sched: job_complete for JobId=352859 successful, exit code=256
[2013-11-25T11:49:19.769] sched: Allocate JobId=352873 NodeList=node074 #CPUs=16
[2013-11-25T11:49:54.487] completing job 352862
[2013-11-25T11:49:54.487] sched: job_complete for JobId=352862 successful, exit code=256
[2013-11-25T11:50:09.105] backfill: Started JobId=352867 on node080
[2013-11-25T11:50:09.269] _slurm_rpc_submit_batch_job JobId=352888 usec=5401
[2013-11-25T11:50:20.533] completing job 352841
[2013-11-25T11:50:20.533] sched: job_complete for JobId=352841 successful, exit code=0
[2013-11-25T11:50:39.105] backfill: Started JobId=352874 on node072
[2013-11-25T11:50:39.264] _slurm_rpc_submit_batch_job JobId=352889 usec=5533
[2013-11-25T11:51:13.681] update_node: node node097 state set to IDLE
[2013-11-25T11:51:54.550] error: chdir(/var/log): Permission denied
[2013-11-25T11:51:57.894] slurmctld version 2.6.3 started on cluster colonialone
[2013-11-25T11:51:59.973] Recovered state of 98 nodes
Comment 1 David Bigagli 2013-11-25 08:31:43 MST
Hi we tried to reproduce the problem following your steps:

1) add host in slurm.conf
2) scontrol reconfig
3) scontrol update node=x state=drain reason=new
4) boot the new node
5) scontrol update node=x state=resume

is this the correct sequence? We were not able to generate
a core file using slurm 2.6.3.

We notice this error message when your slurmctld starts:

[2013-11-25T11:51:54.550] error: chdir(/var/log): Permission denied

where do you find your slurmctld.log? Could you also append 
your configuration? Please tar your etc directory and append it to this 
problem number.

Thanks,
       David
Comment 2 Tim Wickberg 2013-11-26 00:19:17 MST
Created attachment 536 [details]
GW slurm configuration

most of GW's current slurm configuration.

(slurmdbd.conf excluded for security reasons)
Comment 3 Tim Wickberg 2013-11-26 02:19:30 MST
(In reply to David Bigagli from comment #1)
> Hi we tried to reproduce the problem following your steps:
> 

To modify the specific set of steps a bit:

0) Start host, host launches slurmd and attempts to join
0.5) Realize I haven't added that node to slurm.conf yet
> 1) add host in slurm.conf
> 2) scontrol reconfig
> 3) scontrol update node=x state=drain reason=new

> 4) boot the new node
this is more accurately:
4) verify node is still functional, restart slurmd on node

> 5) scontrol update node=x state=resume



> is this the correct sequence? We were not able to generate
> a core file using slurm 2.6.3.
> 
> We notice this error message when your slurmctld starts:
> 
> [2013-11-25T11:51:54.550] error: chdir(/var/log): Permission denied
>
> where do you find your slurmctld.log? Could you also append 
> your configuration? Please tar your etc directory and append it to this 
> problem number.

Its under /var/log/slurmctld.log . I attached the configs, I don't think we're doing anything unusual though, we're running a relatively simple config using multifactor priority with backfill.
Comment 4 David Bigagli 2013-11-26 05:04:28 MST
Hi Tim,
        this appears to be an elusive bug. I started a cluster on your 
configuration:

david@prometeo ~/customers/gw/c1/apps/slurm/2.6.3/etc>\sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
128gb        up 14-00:00:0     25  down* node[041-064,097]
128gb        up 14-00:00:0      1   idle prometeo
256gb        up 14-00:00:0      8  down* node[033-040]
defq*        up 14-00:00:0     65  down* node[033-097]
defq*        up 14-00:00:0      1   idle prometeo
gpu          up 14-00:00:0     32  down* node[001-032]

followed your steps but did not get any core and no errors
in valgrind output.

If you still have the core file could you please send me the
stack output using gdb? After loading the executable and the core
file:

gdb slurmctld core.xxx

at the prompt type:

(gdb) where
(gdb) thread apply all where

Thanks,
         David

On Tue 26 Nov 2013 08:19:30 AM PST, bugs@schedmd.com wrote:
> *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=532#c3> on bug
> 532 <http://bugs.schedmd.com/show_bug.cgi?id=532> from Tim Wickberg
> <mailto:wickberg@gwu.edu> *
> (In reply to David Bigagli fromcomment #1  <show_bug.cgi?id=532#c1>)
> > Hi we tried to reproduce the problem following your steps:
> >
>
> To modify the specific set of steps a bit:
>
> 0) Start host, host launches slurmd and attempts to join
> 0.5) Realize I haven't added that node to slurm.conf yet
> > 1) add host in slurm.conf
> > 2) scontrol reconfig
> > 3) scontrol update node=x state=drain reason=new
>
> > 4) boot the new node
> this is more accurately:
> 4) verify node is still functional, restart slurmd on node
>
> > 5) scontrol update node=x state=resume
>
>
>
> > is this the correct sequence? We were not able to generate
> > a core file using slurm 2.6.3.
> >
> > We notice this error message when your slurmctld starts:
> >
> > [2013-11-25T11:51:54.550] error: chdir(/var/log): Permission denied
> >
> > where do you find your slurmctld.log? Could you also append
> > your configuration? Please tar your etc directory and append it to this
> > problem number.
>
> Its under /var/log/slurmctld.log . I attached the configs, I don't think we're
> doing anything unusual though, we're running a relatively simple config using
> multifactor priority with backfill.
>
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>   * You are watching someone on the CC list of the bug.
>   * You are watching the assignee of the bug.
>
Comment 5 Tim Wickberg 2013-11-26 06:03:16 MST
Stack output from the more recent crash is:

[root@login2 ~]# gdb slurmctld core.28622 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /c1/apps/slurm/2.6.3/sbin/slurmctld...done.
[New Thread 28622]
[New Thread 28624]
[New Thread 28708]
[New Thread 28711]
[New Thread 28712]
[New Thread 28683]
[New Thread 28625]
[New Thread 28709]
[New Thread 28710]
[New Thread 28628]
Missing separate debuginfo for 
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/80/1b9608daa2cd5f7035ad415e9c7dd06ebdb0a2
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_ldap.so.2
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/accounting_storage_slurmdbd.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/accounting_storage_slurmdbd.so
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/auth_munge.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/auth_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/crypto_munge.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/crypto_munge.so
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/select_linear.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/select_linear.so
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/preempt_none.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/preempt_none.so
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/checkpoint_none.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/checkpoint_none.so
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/jobacct_gather_linux.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/jobacct_gather_linux.so
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/job_submit_require_timelimit.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/job_submit_require_timelimit.so
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/ext_sensors_none.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/ext_sensors_none.so
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/switch_none.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/switch_none.so
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/topology_none.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/topology_none.so
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/jobcomp_none.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/jobcomp_none.so
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/sched_backfill.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/sched_backfill.so
Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_dns.so.2
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/priority_multifactor.so...done.
Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/priority_multifactor.so
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `/c1/apps/slurm/current/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x00000034c6e328a5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6.x86_64 libgcc-4.4.6-4.el6.x86_64 munge-0.5.10-21_cm6.0.x86_64 nss-pam-ldapd-0.7.5-14.el6_2.1.x86_64
(gdb) where
#0  0x00000034c6e328a5 in raise () from /lib64/libc.so.6
#1  0x00000034c6e34085 in abort () from /lib64/libc.so.6
#2  0x00000034c6e2ba1e in __assert_fail_base () from /lib64/libc.so.6
#3  0x00000034c6e2bae0 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000049f57a in bit_set (b=Unhandled dwarf expression opcode 0xf3
) at bitstring.c:196
#5  0x000000000047feca in trigger_node_down (node_ptr=Unhandled dwarf expression opcode 0xf3
) at trigger_mgr.c:507
#6  0x0000000000451fca in _make_node_down (node_ptr=0x2aac0c01dca8, event_time=1385398273) at node_mgr.c:2938
#7  0x0000000000455cc4 in set_node_down_ptr (node_ptr=0x2aac0c01dca8, reason=0x5512a6 "Not responding") at node_mgr.c:2702
#8  0x000000000045f192 in ping_nodes () at ping_nodes.c:265
#9  0x0000000000432429 in _slurmctld_background (argc=Unhandled dwarf expression opcode 0xf3
) at controller.c:1505
#10 main (argc=Unhandled dwarf expression opcode 0xf3
) at controller.c:586
(gdb) thread apply all where

Thread 10 (Thread 0x2aaaab82b700 (LWP 28628)):
#0  0x00000034c720b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000050097a in _agent (x=Unhandled dwarf expression opcode 0xf3
) at slurmdbd_defs.c:2056
#2  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
#3  0x00000034c6ee767d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x2aaaf4201700 (LWP 28710)):
#0  0x00000034c6ee0263 in select () from /lib64/libc.so.6
#1  0x00000000004300ae in _slurmctld_rpc_mgr (no_data=Unhandled dwarf expression opcode 0xf3
) at controller.c:958
#2  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
#3  0x00000034c6ee767d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x2aaad4908700 (LWP 28709)):
#0  0x00000034c72080ad in pthread_join () from /lib64/libpthread.so.0
#1  0x00002aaaf4204c12 in _cleanup_thread (no_data=Unhandled dwarf expression opcode 0xf3
) at priority_multifactor.c:1456
#2  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
#3  0x00000034c6ee767d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x2aaaab31e700 (LWP 28625)):
#0  0x00000034c72080ad in pthread_join () from /lib64/libpthread.so.0
#1  0x00002aaaaaf16cb2 in _cleanup_thread (no_data=Unhandled dwarf expression opcode 0xf3
) at accounting_storage_slurmdbd.c:380
#2  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
#3  0x00000034c6ee767d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x2aaaabd33700 (LWP 28683)):
#0  0x00000034c720b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002aaaaba2f6a2 in _my_sleep (secs=30) at backfill.c:381
#2  0x00002aaaaba31720 in backfill_agent (args=Unhandled dwarf expression opcode 0xf3
) at backfill.c:492
#3  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000034c6ee767d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x2aaac0100700 (LWP 28712)):
#0  0x00000034c720b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000477a79 in slurmctld_state_save (no_data=Unhandled dwarf expression opcode 0xf3
) at state_save.c:198
#2  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
#3  0x00000034c6ee767d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x2aaaf4100700 (LWP 28711)):
#0  0x00000034c720f2a5 in sigwait () from /lib64/libpthread.so.0
#1  0x0000000000430d20 in _slurmctld_signal_hand (no_data=Unhandled dwarf expression opcode 0xf3
) at controller.c:831
#2  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
#3  0x00000034c6ee767d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x2aaae0100700 (LWP 28708)):
#0  0x00000034c6eab91d in nanosleep () from /lib64/libc.so.6
#1  0x00000034c6eab790 in sleep () from /lib64/libc.so.6
#2  0x00002aaaf4206d26 in _decay_thread (no_data=Unhandled dwarf expression opcode 0xf3
) at priority_multifactor.c:1402
#3  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000034c6ee767d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x2aaaab21d700 (LWP 28624)):
#0  0x00000034c6eab91d in nanosleep () from /lib64/libc.so.6
#1  0x00000034c6eab790 in sleep () from /lib64/libc.so.6
#2  0x00002aaaaaf16ea0 in _set_db_inx_thread (no_data=Unhandled dwarf expression opcode 0xf3
) at accounting_storage_slurmdbd.c:372
#3  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
#4  0x00000034c6ee767d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2aaaaaac7ba0 (LWP 28622)):
#0  0x00000034c6e328a5 in raise () from /lib64/libc.so.6
#1  0x00000034c6e34085 in abort () from /lib64/libc.so.6
#2  0x00000034c6e2ba1e in __assert_fail_base () from /lib64/libc.so.6
#3  0x00000034c6e2bae0 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000049f57a in bit_set (b=Unhandled dwarf expression opcode 0xf3
) at bitstring.c:196
#5  0x000000000047feca in trigger_node_down (node_ptr=Unhandled dwarf expression opcode 0xf3
) at trigger_mgr.c:507
#6  0x0000000000451fca in _make_node_down (node_ptr=0x2aac0c01dca8, event_time=1385398273) at node_mgr.c:2938
#7  0x0000000000455cc4 in set_node_down_ptr (node_ptr=0x2aac0c01dca8, reason=0x5512a6 "Not responding") at node_mgr.c:2702
#8  0x000000000045f192 in ping_nodes () at ping_nodes.c:265
#9  0x0000000000432429 in _slurmctld_background (argc=Unhandled dwarf expression opcode 0xf3
) at controller.c:1505
#10 main (argc=Unhandled dwarf expression opcode 0xf3
) at controller.c:586
(gdb)
Comment 6 David Bigagli 2013-11-26 07:38:50 MST
Hi,
    indeed the core file provided us with a good insight and we can 
reproduce the problem now. They key is to drain or down a different node 
before running reconfig, then after adding the node to slurm.conf and 
'scontrol reconfig' the 'scontrol update node=x state=down|drain' will 
crash the controller.

However when a node is added to the cluster the controller must be 
restarted only after that the reconfigure should be issued to have all 
slurmd daemons read the new configuration file.

Please consult the FAQ: http://slurm.schedmd.com/faq.html#add_nodes

Nonetheless we should detect the configured node have changed during the 
reconfiguration process and not allow it to continue. We will 
investigate how to implement the fix.

On Tue 26 Nov 2013 12:03:16 PM PST, bugs@schedmd.com wrote:
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=532#c5> on bug
> 532 <http://bugs.schedmd.com/show_bug.cgi?id=532> from Tim Wickberg
> <mailto:wickberg@gwu.edu> *
> Stack output from the more recent crash is:
>
> [root@login2 ~]# gdb slurmctld core.28622
> GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
> Copyright (C) 2010 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /c1/apps/slurm/2.6.3/sbin/slurmctld...done.
> [New Thread 28622]
> [New Thread 28624]
> [New Thread 28708]
> [New Thread 28711]
> [New Thread 28712]
> [New Thread 28683]
> [New Thread 28625]
> [New Thread 28709]
> [New Thread 28710]
> [New Thread 28628]
> Missing separate debuginfo for
> Try: yum --disablerepo='*' --enablerepo='*-debug*' install
> /usr/lib/debug/.build-id/80/1b9608daa2cd5f7035ad415e9c7dd06ebdb0a2
> Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
> Loaded symbols for /lib64/libdl.so.2
> Reading symbols from /lib64/libpthread.so.0...(no debugging symbols
> found)...done.
> [Thread debugging using libthread_db enabled]
> Loaded symbols for /lib64/libpthread.so.0
> Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
> Loaded symbols for /lib64/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libnss_files.so.2
> Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libnss_ldap.so.2
> Reading symbols from
> /c1/apps/slurm/2.6.3/lib/slurm/accounting_storage_slurmdbd.so...done.
> Loaded symbols for
> /c1/apps/slurm/2.6.3/lib/slurm/accounting_storage_slurmdbd.so
> Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/auth_munge.so...done.
> Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/auth_munge.so
> Reading symbols from /usr/lib64/libmunge.so.2...done.
> Loaded symbols for /usr/lib64/libmunge.so.2
> Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/crypto_munge.so...done.
> Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/crypto_munge.so
> Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/select_linear.so...done.
> Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/select_linear.so
> Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/preempt_none.so...done.
> Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/preempt_none.so
> Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/checkpoint_none.so...done.
> Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/checkpoint_none.so
> Reading symbols from
> /c1/apps/slurm/2.6.3/lib/slurm/jobacct_gather_linux.so...done.
> Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/jobacct_gather_linux.so
> Reading symbols from
> /c1/apps/slurm/2.6.3/lib/slurm/job_submit_require_timelimit.so...done.
> Loaded symbols for
> /c1/apps/slurm/2.6.3/lib/slurm/job_submit_require_timelimit.so
> Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/ext_sensors_none.so...done.
> Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/ext_sensors_none.so
> Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/switch_none.so...done.
> Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/switch_none.so
> Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/topology_none.so...done.
> Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/topology_none.so
> Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/jobcomp_none.so...done.
> Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/jobcomp_none.so
> Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/sched_backfill.so...done.
> Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/sched_backfill.so
> Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libnss_dns.so.2
> Reading symbols from /lib64/libresolv.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libresolv.so.2
> Reading symbols from
> /c1/apps/slurm/2.6.3/lib/slurm/priority_multifactor.so...done.
> Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/priority_multifactor.so
> Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
> Loaded symbols for /lib64/libm.so.6
> Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libgcc_s.so.1
> Core was generated by `/c1/apps/slurm/current/sbin/slurmctld'.
> Program terminated with signal 6, Aborted.
> #0  0x00000034c6e328a5 in raise () from /lib64/libc.so.6
> Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6.x86_64
> libgcc-4.4.6-4.el6.x86_64 munge-0.5.10-21_cm6.0.x86_64
> nss-pam-ldapd-0.7.5-14.el6_2.1.x86_64
> (gdb) where
> #0  0x00000034c6e328a5 in raise () from /lib64/libc.so.6
> #1  0x00000034c6e34085 in abort () from /lib64/libc.so.6
> #2  0x00000034c6e2ba1e in __assert_fail_base () from /lib64/libc.so.6
> #3  0x00000034c6e2bae0 in __assert_fail () from /lib64/libc.so.6
> #4  0x000000000049f57a in bit_set (b=Unhandled dwarf expression opcode 0xf3
> ) at bitstring.c:196
> #5  0x000000000047feca in trigger_node_down (node_ptr=Unhandled dwarf
> expression opcode 0xf3
> ) at trigger_mgr.c:507
> #6  0x0000000000451fca in _make_node_down (node_ptr=0x2aac0c01dca8,
> event_time=1385398273) at node_mgr.c:2938
> #7  0x0000000000455cc4 in set_node_down_ptr (node_ptr=0x2aac0c01dca8,
> reason=0x5512a6 "Not responding") at node_mgr.c:2702
> #8  0x000000000045f192 in ping_nodes () at ping_nodes.c:265
> #9  0x0000000000432429 in _slurmctld_background (argc=Unhandled dwarf
> expression opcode 0xf3
> ) at controller.c:1505
> #10 main (argc=Unhandled dwarf expression opcode 0xf3
> ) at controller.c:586
> (gdb) thread apply all where
>
> Thread 10 (Thread 0x2aaaab82b700 (LWP 28628)):
> #0  0x00000034c720b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x000000000050097a in _agent (x=Unhandled dwarf expression opcode 0xf3
> ) at slurmdbd_defs.c:2056
> #2  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
> #3  0x00000034c6ee767d in clone () from /lib64/libc.so.6
>
> Thread 9 (Thread 0x2aaaf4201700 (LWP 28710)):
> #0  0x00000034c6ee0263 in select () from /lib64/libc.so.6
> #1  0x00000000004300ae in _slurmctld_rpc_mgr (no_data=Unhandled dwarf
> expression opcode 0xf3
> ) at controller.c:958
> #2  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
> #3  0x00000034c6ee767d in clone () from /lib64/libc.so.6
>
> Thread 8 (Thread 0x2aaad4908700 (LWP 28709)):
> #0  0x00000034c72080ad in pthread_join () from /lib64/libpthread.so.0
> #1  0x00002aaaf4204c12 in _cleanup_thread (no_data=Unhandled dwarf expression
> opcode 0xf3
> ) at priority_multifactor.c:1456
> #2  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
> #3  0x00000034c6ee767d in clone () from /lib64/libc.so.6
>
> Thread 7 (Thread 0x2aaaab31e700 (LWP 28625)):
> #0  0x00000034c72080ad in pthread_join () from /lib64/libpthread.so.0
> #1  0x00002aaaaaf16cb2 in _cleanup_thread (no_data=Unhandled dwarf expression
> opcode 0xf3
> ) at accounting_storage_slurmdbd.c:380
> #2  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
> #3  0x00000034c6ee767d in clone () from /lib64/libc.so.6
>
> Thread 6 (Thread 0x2aaaabd33700 (LWP 28683)):
> #0  0x00000034c720b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x00002aaaaba2f6a2 in _my_sleep (secs=30) at backfill.c:381
> #2  0x00002aaaaba31720 in backfill_agent (args=Unhandled dwarf expression
> opcode 0xf3
> ) at backfill.c:492
> #3  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
> #4  0x00000034c6ee767d in clone () from /lib64/libc.so.6
>
> Thread 5 (Thread 0x2aaac0100700 (LWP 28712)):
> #0  0x00000034c720b43c in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x0000000000477a79 in slurmctld_state_save (no_data=Unhandled dwarf
> expression opcode 0xf3
> ) at state_save.c:198
> #2  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
> #3  0x00000034c6ee767d in clone () from /lib64/libc.so.6
>
> Thread 4 (Thread 0x2aaaf4100700 (LWP 28711)):
> #0  0x00000034c720f2a5 in sigwait () from /lib64/libpthread.so.0
> #1  0x0000000000430d20 in _slurmctld_signal_hand (no_data=Unhandled dwarf
> expression opcode 0xf3
> ) at controller.c:831
> #2  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
> #3  0x00000034c6ee767d in clone () from /lib64/libc.so.6
>
> Thread 3 (Thread 0x2aaae0100700 (LWP 28708)):
> #0  0x00000034c6eab91d in nanosleep () from /lib64/libc.so.6
> #1  0x00000034c6eab790 in sleep () from /lib64/libc.so.6
> #2  0x00002aaaf4206d26 in _decay_thread (no_data=Unhandled dwarf expression
> opcode 0xf3
> ) at priority_multifactor.c:1402
> #3  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
> #4  0x00000034c6ee767d in clone () from /lib64/libc.so.6
>
> Thread 2 (Thread 0x2aaaab21d700 (LWP 28624)):
> #0  0x00000034c6eab91d in nanosleep () from /lib64/libc.so.6
> #1  0x00000034c6eab790 in sleep () from /lib64/libc.so.6
> #2  0x00002aaaaaf16ea0 in _set_db_inx_thread (no_data=Unhandled dwarf
> expression opcode 0xf3
> ) at accounting_storage_slurmdbd.c:372
> #3  0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0
> #4  0x00000034c6ee767d in clone () from /lib64/libc.so.6
>
> Thread 1 (Thread 0x2aaaaaac7ba0 (LWP 28622)):
> #0  0x00000034c6e328a5 in raise () from /lib64/libc.so.6
> #1  0x00000034c6e34085 in abort () from /lib64/libc.so.6
> #2  0x00000034c6e2ba1e in __assert_fail_base () from /lib64/libc.so.6
> #3  0x00000034c6e2bae0 in __assert_fail () from /lib64/libc.so.6
> #4  0x000000000049f57a in bit_set (b=Unhandled dwarf expression opcode 0xf3
> ) at bitstring.c:196
> #5  0x000000000047feca in trigger_node_down (node_ptr=Unhandled dwarf
> expression opcode 0xf3
> ) at trigger_mgr.c:507
> #6  0x0000000000451fca in _make_node_down (node_ptr=0x2aac0c01dca8,
> event_time=1385398273) at node_mgr.c:2938
> #7  0x0000000000455cc4 in set_node_down_ptr (node_ptr=0x2aac0c01dca8,
> reason=0x5512a6 "Not responding") at node_mgr.c:2702
> #8  0x000000000045f192 in ping_nodes () at ping_nodes.c:265
> #9  0x0000000000432429 in _slurmctld_background (argc=Unhandled dwarf
> expression opcode 0xf3
> ) at controller.c:1505
> #10 main (argc=Unhandled dwarf expression opcode 0xf3
> ) at controller.c:586
> (gdb)
>
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>   * You are watching someone on the CC list of the bug.
>   * You are watching the assignee of the bug.
>
Comment 7 David Bigagli 2013-12-06 09:02:48 MST
Fixed in commit 0f0dbfe99465 in the master branch.

David