I'll see if I can reproduce this in test mode, but I've seen this happen twice with 2.6.3. In both cases, I was adding a new node in to the system. First by editing slurm.conf, then "scontrol reconfigure". Today (not sure if I did this last time), I then did: scontrol update nodename=node097 state=drain reason=new The node comes up successfully, then I run scontrol update nodename=node097 state=resume and slurmctld crashes. Last few lines of slurmctld.log are [2013-11-25T11:48:46.260] Requeue JobId=352867 due to node failure [2013-11-25T11:48:46.260] sched: job_complete for JobId=352867 successful, exit code=4294967294 [2013-11-25T11:48:46.260] Node node097 unexpectedly rebooted [2013-11-25T11:49:18.074] completing job 352859 [2013-11-25T11:49:18.074] sched: job_complete for JobId=352859 successful, exit code=256 [2013-11-25T11:49:19.769] sched: Allocate JobId=352873 NodeList=node074 #CPUs=16 [2013-11-25T11:49:54.487] completing job 352862 [2013-11-25T11:49:54.487] sched: job_complete for JobId=352862 successful, exit code=256 [2013-11-25T11:50:09.105] backfill: Started JobId=352867 on node080 [2013-11-25T11:50:09.269] _slurm_rpc_submit_batch_job JobId=352888 usec=5401 [2013-11-25T11:50:20.533] completing job 352841 [2013-11-25T11:50:20.533] sched: job_complete for JobId=352841 successful, exit code=0 [2013-11-25T11:50:39.105] backfill: Started JobId=352874 on node072 [2013-11-25T11:50:39.264] _slurm_rpc_submit_batch_job JobId=352889 usec=5533 [2013-11-25T11:51:13.681] update_node: node node097 state set to IDLE [2013-11-25T11:51:54.550] error: chdir(/var/log): Permission denied [2013-11-25T11:51:57.894] slurmctld version 2.6.3 started on cluster colonialone [2013-11-25T11:51:59.973] Recovered state of 98 nodes
Hi we tried to reproduce the problem following your steps: 1) add host in slurm.conf 2) scontrol reconfig 3) scontrol update node=x state=drain reason=new 4) boot the new node 5) scontrol update node=x state=resume is this the correct sequence? We were not able to generate a core file using slurm 2.6.3. We notice this error message when your slurmctld starts: [2013-11-25T11:51:54.550] error: chdir(/var/log): Permission denied where do you find your slurmctld.log? Could you also append your configuration? Please tar your etc directory and append it to this problem number. Thanks, David
Created attachment 536 [details] GW slurm configuration most of GW's current slurm configuration. (slurmdbd.conf excluded for security reasons)
(In reply to David Bigagli from comment #1) > Hi we tried to reproduce the problem following your steps: > To modify the specific set of steps a bit: 0) Start host, host launches slurmd and attempts to join 0.5) Realize I haven't added that node to slurm.conf yet > 1) add host in slurm.conf > 2) scontrol reconfig > 3) scontrol update node=x state=drain reason=new > 4) boot the new node this is more accurately: 4) verify node is still functional, restart slurmd on node > 5) scontrol update node=x state=resume > is this the correct sequence? We were not able to generate > a core file using slurm 2.6.3. > > We notice this error message when your slurmctld starts: > > [2013-11-25T11:51:54.550] error: chdir(/var/log): Permission denied > > where do you find your slurmctld.log? Could you also append > your configuration? Please tar your etc directory and append it to this > problem number. Its under /var/log/slurmctld.log . I attached the configs, I don't think we're doing anything unusual though, we're running a relatively simple config using multifactor priority with backfill.
Hi Tim, this appears to be an elusive bug. I started a cluster on your configuration: david@prometeo ~/customers/gw/c1/apps/slurm/2.6.3/etc>\sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 128gb up 14-00:00:0 25 down* node[041-064,097] 128gb up 14-00:00:0 1 idle prometeo 256gb up 14-00:00:0 8 down* node[033-040] defq* up 14-00:00:0 65 down* node[033-097] defq* up 14-00:00:0 1 idle prometeo gpu up 14-00:00:0 32 down* node[001-032] followed your steps but did not get any core and no errors in valgrind output. If you still have the core file could you please send me the stack output using gdb? After loading the executable and the core file: gdb slurmctld core.xxx at the prompt type: (gdb) where (gdb) thread apply all where Thanks, David On Tue 26 Nov 2013 08:19:30 AM PST, bugs@schedmd.com wrote: > *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=532#c3> on bug > 532 <http://bugs.schedmd.com/show_bug.cgi?id=532> from Tim Wickberg > <mailto:wickberg@gwu.edu> * > (In reply to David Bigagli fromcomment #1 <show_bug.cgi?id=532#c1>) > > Hi we tried to reproduce the problem following your steps: > > > > To modify the specific set of steps a bit: > > 0) Start host, host launches slurmd and attempts to join > 0.5) Realize I haven't added that node to slurm.conf yet > > 1) add host in slurm.conf > > 2) scontrol reconfig > > 3) scontrol update node=x state=drain reason=new > > > 4) boot the new node > this is more accurately: > 4) verify node is still functional, restart slurmd on node > > > 5) scontrol update node=x state=resume > > > > > is this the correct sequence? We were not able to generate > > a core file using slurm 2.6.3. > > > > We notice this error message when your slurmctld starts: > > > > [2013-11-25T11:51:54.550] error: chdir(/var/log): Permission denied > > > > where do you find your slurmctld.log? Could you also append > > your configuration? Please tar your etc directory and append it to this > > problem number. > > Its under /var/log/slurmctld.log . I attached the configs, I don't think we're > doing anything unusual though, we're running a relatively simple config using > multifactor priority with backfill. > > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You are the assignee for the bug. > * You are watching someone on the CC list of the bug. > * You are watching the assignee of the bug. >
Stack output from the more recent crash is: [root@login2 ~]# gdb slurmctld core.28622 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /c1/apps/slurm/2.6.3/sbin/slurmctld...done. [New Thread 28622] [New Thread 28624] [New Thread 28708] [New Thread 28711] [New Thread 28712] [New Thread 28683] [New Thread 28625] [New Thread 28709] [New Thread 28710] [New Thread 28628] Missing separate debuginfo for Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/80/1b9608daa2cd5f7035ad415e9c7dd06ebdb0a2 Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_ldap.so.2 Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/accounting_storage_slurmdbd.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/accounting_storage_slurmdbd.so Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/auth_munge.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/auth_munge.so Reading symbols from /usr/lib64/libmunge.so.2...done. Loaded symbols for /usr/lib64/libmunge.so.2 Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/crypto_munge.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/crypto_munge.so Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/select_linear.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/select_linear.so Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/preempt_none.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/preempt_none.so Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/checkpoint_none.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/checkpoint_none.so Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/jobacct_gather_linux.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/jobacct_gather_linux.so Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/job_submit_require_timelimit.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/job_submit_require_timelimit.so Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/ext_sensors_none.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/ext_sensors_none.so Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/switch_none.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/switch_none.so Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/topology_none.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/topology_none.so Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/jobcomp_none.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/jobcomp_none.so Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/sched_backfill.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/sched_backfill.so Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_dns.so.2 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/priority_multifactor.so...done. Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/priority_multifactor.so Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libgcc_s.so.1 Core was generated by `/c1/apps/slurm/current/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00000034c6e328a5 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6.x86_64 libgcc-4.4.6-4.el6.x86_64 munge-0.5.10-21_cm6.0.x86_64 nss-pam-ldapd-0.7.5-14.el6_2.1.x86_64 (gdb) where #0 0x00000034c6e328a5 in raise () from /lib64/libc.so.6 #1 0x00000034c6e34085 in abort () from /lib64/libc.so.6 #2 0x00000034c6e2ba1e in __assert_fail_base () from /lib64/libc.so.6 #3 0x00000034c6e2bae0 in __assert_fail () from /lib64/libc.so.6 #4 0x000000000049f57a in bit_set (b=Unhandled dwarf expression opcode 0xf3 ) at bitstring.c:196 #5 0x000000000047feca in trigger_node_down (node_ptr=Unhandled dwarf expression opcode 0xf3 ) at trigger_mgr.c:507 #6 0x0000000000451fca in _make_node_down (node_ptr=0x2aac0c01dca8, event_time=1385398273) at node_mgr.c:2938 #7 0x0000000000455cc4 in set_node_down_ptr (node_ptr=0x2aac0c01dca8, reason=0x5512a6 "Not responding") at node_mgr.c:2702 #8 0x000000000045f192 in ping_nodes () at ping_nodes.c:265 #9 0x0000000000432429 in _slurmctld_background (argc=Unhandled dwarf expression opcode 0xf3 ) at controller.c:1505 #10 main (argc=Unhandled dwarf expression opcode 0xf3 ) at controller.c:586 (gdb) thread apply all where Thread 10 (Thread 0x2aaaab82b700 (LWP 28628)): #0 0x00000034c720b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000050097a in _agent (x=Unhandled dwarf expression opcode 0xf3 ) at slurmdbd_defs.c:2056 #2 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 #3 0x00000034c6ee767d in clone () from /lib64/libc.so.6 Thread 9 (Thread 0x2aaaf4201700 (LWP 28710)): #0 0x00000034c6ee0263 in select () from /lib64/libc.so.6 #1 0x00000000004300ae in _slurmctld_rpc_mgr (no_data=Unhandled dwarf expression opcode 0xf3 ) at controller.c:958 #2 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 #3 0x00000034c6ee767d in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x2aaad4908700 (LWP 28709)): #0 0x00000034c72080ad in pthread_join () from /lib64/libpthread.so.0 #1 0x00002aaaf4204c12 in _cleanup_thread (no_data=Unhandled dwarf expression opcode 0xf3 ) at priority_multifactor.c:1456 #2 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 #3 0x00000034c6ee767d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x2aaaab31e700 (LWP 28625)): #0 0x00000034c72080ad in pthread_join () from /lib64/libpthread.so.0 #1 0x00002aaaaaf16cb2 in _cleanup_thread (no_data=Unhandled dwarf expression opcode 0xf3 ) at accounting_storage_slurmdbd.c:380 #2 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 #3 0x00000034c6ee767d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x2aaaabd33700 (LWP 28683)): #0 0x00000034c720b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00002aaaaba2f6a2 in _my_sleep (secs=30) at backfill.c:381 #2 0x00002aaaaba31720 in backfill_agent (args=Unhandled dwarf expression opcode 0xf3 ) at backfill.c:492 #3 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 #4 0x00000034c6ee767d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x2aaac0100700 (LWP 28712)): #0 0x00000034c720b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000477a79 in slurmctld_state_save (no_data=Unhandled dwarf expression opcode 0xf3 ) at state_save.c:198 #2 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 #3 0x00000034c6ee767d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x2aaaf4100700 (LWP 28711)): #0 0x00000034c720f2a5 in sigwait () from /lib64/libpthread.so.0 #1 0x0000000000430d20 in _slurmctld_signal_hand (no_data=Unhandled dwarf expression opcode 0xf3 ) at controller.c:831 #2 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 #3 0x00000034c6ee767d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x2aaae0100700 (LWP 28708)): #0 0x00000034c6eab91d in nanosleep () from /lib64/libc.so.6 #1 0x00000034c6eab790 in sleep () from /lib64/libc.so.6 #2 0x00002aaaf4206d26 in _decay_thread (no_data=Unhandled dwarf expression opcode 0xf3 ) at priority_multifactor.c:1402 #3 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 #4 0x00000034c6ee767d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x2aaaab21d700 (LWP 28624)): #0 0x00000034c6eab91d in nanosleep () from /lib64/libc.so.6 #1 0x00000034c6eab790 in sleep () from /lib64/libc.so.6 #2 0x00002aaaaaf16ea0 in _set_db_inx_thread (no_data=Unhandled dwarf expression opcode 0xf3 ) at accounting_storage_slurmdbd.c:372 #3 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 #4 0x00000034c6ee767d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2aaaaaac7ba0 (LWP 28622)): #0 0x00000034c6e328a5 in raise () from /lib64/libc.so.6 #1 0x00000034c6e34085 in abort () from /lib64/libc.so.6 #2 0x00000034c6e2ba1e in __assert_fail_base () from /lib64/libc.so.6 #3 0x00000034c6e2bae0 in __assert_fail () from /lib64/libc.so.6 #4 0x000000000049f57a in bit_set (b=Unhandled dwarf expression opcode 0xf3 ) at bitstring.c:196 #5 0x000000000047feca in trigger_node_down (node_ptr=Unhandled dwarf expression opcode 0xf3 ) at trigger_mgr.c:507 #6 0x0000000000451fca in _make_node_down (node_ptr=0x2aac0c01dca8, event_time=1385398273) at node_mgr.c:2938 #7 0x0000000000455cc4 in set_node_down_ptr (node_ptr=0x2aac0c01dca8, reason=0x5512a6 "Not responding") at node_mgr.c:2702 #8 0x000000000045f192 in ping_nodes () at ping_nodes.c:265 #9 0x0000000000432429 in _slurmctld_background (argc=Unhandled dwarf expression opcode 0xf3 ) at controller.c:1505 #10 main (argc=Unhandled dwarf expression opcode 0xf3 ) at controller.c:586 (gdb)
Hi, indeed the core file provided us with a good insight and we can reproduce the problem now. They key is to drain or down a different node before running reconfig, then after adding the node to slurm.conf and 'scontrol reconfig' the 'scontrol update node=x state=down|drain' will crash the controller. However when a node is added to the cluster the controller must be restarted only after that the reconfigure should be issued to have all slurmd daemons read the new configuration file. Please consult the FAQ: http://slurm.schedmd.com/faq.html#add_nodes Nonetheless we should detect the configured node have changed during the reconfiguration process and not allow it to continue. We will investigate how to implement the fix. On Tue 26 Nov 2013 12:03:16 PM PST, bugs@schedmd.com wrote: > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=532#c5> on bug > 532 <http://bugs.schedmd.com/show_bug.cgi?id=532> from Tim Wickberg > <mailto:wickberg@gwu.edu> * > Stack output from the more recent crash is: > > [root@login2 ~]# gdb slurmctld core.28622 > GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6) > Copyright (C) 2010 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-redhat-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > Reading symbols from /c1/apps/slurm/2.6.3/sbin/slurmctld...done. > [New Thread 28622] > [New Thread 28624] > [New Thread 28708] > [New Thread 28711] > [New Thread 28712] > [New Thread 28683] > [New Thread 28625] > [New Thread 28709] > [New Thread 28710] > [New Thread 28628] > Missing separate debuginfo for > Try: yum --disablerepo='*' --enablerepo='*-debug*' install > /usr/lib/debug/.build-id/80/1b9608daa2cd5f7035ad415e9c7dd06ebdb0a2 > Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. > Loaded symbols for /lib64/libdl.so.2 > Reading symbols from /lib64/libpthread.so.0...(no debugging symbols > found)...done. > [Thread debugging using libthread_db enabled] > Loaded symbols for /lib64/libpthread.so.0 > Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. > Loaded symbols for /lib64/libc.so.6 > Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libnss_files.so.2 > Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libnss_ldap.so.2 > Reading symbols from > /c1/apps/slurm/2.6.3/lib/slurm/accounting_storage_slurmdbd.so...done. > Loaded symbols for > /c1/apps/slurm/2.6.3/lib/slurm/accounting_storage_slurmdbd.so > Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/auth_munge.so...done. > Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/auth_munge.so > Reading symbols from /usr/lib64/libmunge.so.2...done. > Loaded symbols for /usr/lib64/libmunge.so.2 > Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/crypto_munge.so...done. > Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/crypto_munge.so > Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/select_linear.so...done. > Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/select_linear.so > Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/preempt_none.so...done. > Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/preempt_none.so > Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/checkpoint_none.so...done. > Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/checkpoint_none.so > Reading symbols from > /c1/apps/slurm/2.6.3/lib/slurm/jobacct_gather_linux.so...done. > Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/jobacct_gather_linux.so > Reading symbols from > /c1/apps/slurm/2.6.3/lib/slurm/job_submit_require_timelimit.so...done. > Loaded symbols for > /c1/apps/slurm/2.6.3/lib/slurm/job_submit_require_timelimit.so > Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/ext_sensors_none.so...done. > Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/ext_sensors_none.so > Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/switch_none.so...done. > Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/switch_none.so > Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/topology_none.so...done. > Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/topology_none.so > Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/jobcomp_none.so...done. > Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/jobcomp_none.so > Reading symbols from /c1/apps/slurm/2.6.3/lib/slurm/sched_backfill.so...done. > Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/sched_backfill.so > Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libnss_dns.so.2 > Reading symbols from /lib64/libresolv.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libresolv.so.2 > Reading symbols from > /c1/apps/slurm/2.6.3/lib/slurm/priority_multifactor.so...done. > Loaded symbols for /c1/apps/slurm/2.6.3/lib/slurm/priority_multifactor.so > Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. > Loaded symbols for /lib64/libm.so.6 > Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libgcc_s.so.1 > Core was generated by `/c1/apps/slurm/current/sbin/slurmctld'. > Program terminated with signal 6, Aborted. > #0 0x00000034c6e328a5 in raise () from /lib64/libc.so.6 > Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6.x86_64 > libgcc-4.4.6-4.el6.x86_64 munge-0.5.10-21_cm6.0.x86_64 > nss-pam-ldapd-0.7.5-14.el6_2.1.x86_64 > (gdb) where > #0 0x00000034c6e328a5 in raise () from /lib64/libc.so.6 > #1 0x00000034c6e34085 in abort () from /lib64/libc.so.6 > #2 0x00000034c6e2ba1e in __assert_fail_base () from /lib64/libc.so.6 > #3 0x00000034c6e2bae0 in __assert_fail () from /lib64/libc.so.6 > #4 0x000000000049f57a in bit_set (b=Unhandled dwarf expression opcode 0xf3 > ) at bitstring.c:196 > #5 0x000000000047feca in trigger_node_down (node_ptr=Unhandled dwarf > expression opcode 0xf3 > ) at trigger_mgr.c:507 > #6 0x0000000000451fca in _make_node_down (node_ptr=0x2aac0c01dca8, > event_time=1385398273) at node_mgr.c:2938 > #7 0x0000000000455cc4 in set_node_down_ptr (node_ptr=0x2aac0c01dca8, > reason=0x5512a6 "Not responding") at node_mgr.c:2702 > #8 0x000000000045f192 in ping_nodes () at ping_nodes.c:265 > #9 0x0000000000432429 in _slurmctld_background (argc=Unhandled dwarf > expression opcode 0xf3 > ) at controller.c:1505 > #10 main (argc=Unhandled dwarf expression opcode 0xf3 > ) at controller.c:586 > (gdb) thread apply all where > > Thread 10 (Thread 0x2aaaab82b700 (LWP 28628)): > #0 0x00000034c720b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x000000000050097a in _agent (x=Unhandled dwarf expression opcode 0xf3 > ) at slurmdbd_defs.c:2056 > #2 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 > #3 0x00000034c6ee767d in clone () from /lib64/libc.so.6 > > Thread 9 (Thread 0x2aaaf4201700 (LWP 28710)): > #0 0x00000034c6ee0263 in select () from /lib64/libc.so.6 > #1 0x00000000004300ae in _slurmctld_rpc_mgr (no_data=Unhandled dwarf > expression opcode 0xf3 > ) at controller.c:958 > #2 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 > #3 0x00000034c6ee767d in clone () from /lib64/libc.so.6 > > Thread 8 (Thread 0x2aaad4908700 (LWP 28709)): > #0 0x00000034c72080ad in pthread_join () from /lib64/libpthread.so.0 > #1 0x00002aaaf4204c12 in _cleanup_thread (no_data=Unhandled dwarf expression > opcode 0xf3 > ) at priority_multifactor.c:1456 > #2 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 > #3 0x00000034c6ee767d in clone () from /lib64/libc.so.6 > > Thread 7 (Thread 0x2aaaab31e700 (LWP 28625)): > #0 0x00000034c72080ad in pthread_join () from /lib64/libpthread.so.0 > #1 0x00002aaaaaf16cb2 in _cleanup_thread (no_data=Unhandled dwarf expression > opcode 0xf3 > ) at accounting_storage_slurmdbd.c:380 > #2 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 > #3 0x00000034c6ee767d in clone () from /lib64/libc.so.6 > > Thread 6 (Thread 0x2aaaabd33700 (LWP 28683)): > #0 0x00000034c720b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00002aaaaba2f6a2 in _my_sleep (secs=30) at backfill.c:381 > #2 0x00002aaaaba31720 in backfill_agent (args=Unhandled dwarf expression > opcode 0xf3 > ) at backfill.c:492 > #3 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 > #4 0x00000034c6ee767d in clone () from /lib64/libc.so.6 > > Thread 5 (Thread 0x2aaac0100700 (LWP 28712)): > #0 0x00000034c720b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x0000000000477a79 in slurmctld_state_save (no_data=Unhandled dwarf > expression opcode 0xf3 > ) at state_save.c:198 > #2 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 > #3 0x00000034c6ee767d in clone () from /lib64/libc.so.6 > > Thread 4 (Thread 0x2aaaf4100700 (LWP 28711)): > #0 0x00000034c720f2a5 in sigwait () from /lib64/libpthread.so.0 > #1 0x0000000000430d20 in _slurmctld_signal_hand (no_data=Unhandled dwarf > expression opcode 0xf3 > ) at controller.c:831 > #2 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 > #3 0x00000034c6ee767d in clone () from /lib64/libc.so.6 > > Thread 3 (Thread 0x2aaae0100700 (LWP 28708)): > #0 0x00000034c6eab91d in nanosleep () from /lib64/libc.so.6 > #1 0x00000034c6eab790 in sleep () from /lib64/libc.so.6 > #2 0x00002aaaf4206d26 in _decay_thread (no_data=Unhandled dwarf expression > opcode 0xf3 > ) at priority_multifactor.c:1402 > #3 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 > #4 0x00000034c6ee767d in clone () from /lib64/libc.so.6 > > Thread 2 (Thread 0x2aaaab21d700 (LWP 28624)): > #0 0x00000034c6eab91d in nanosleep () from /lib64/libc.so.6 > #1 0x00000034c6eab790 in sleep () from /lib64/libc.so.6 > #2 0x00002aaaaaf16ea0 in _set_db_inx_thread (no_data=Unhandled dwarf > expression opcode 0xf3 > ) at accounting_storage_slurmdbd.c:372 > #3 0x00000034c7207851 in start_thread () from /lib64/libpthread.so.0 > #4 0x00000034c6ee767d in clone () from /lib64/libc.so.6 > > Thread 1 (Thread 0x2aaaaaac7ba0 (LWP 28622)): > #0 0x00000034c6e328a5 in raise () from /lib64/libc.so.6 > #1 0x00000034c6e34085 in abort () from /lib64/libc.so.6 > #2 0x00000034c6e2ba1e in __assert_fail_base () from /lib64/libc.so.6 > #3 0x00000034c6e2bae0 in __assert_fail () from /lib64/libc.so.6 > #4 0x000000000049f57a in bit_set (b=Unhandled dwarf expression opcode 0xf3 > ) at bitstring.c:196 > #5 0x000000000047feca in trigger_node_down (node_ptr=Unhandled dwarf > expression opcode 0xf3 > ) at trigger_mgr.c:507 > #6 0x0000000000451fca in _make_node_down (node_ptr=0x2aac0c01dca8, > event_time=1385398273) at node_mgr.c:2938 > #7 0x0000000000455cc4 in set_node_down_ptr (node_ptr=0x2aac0c01dca8, > reason=0x5512a6 "Not responding") at node_mgr.c:2702 > #8 0x000000000045f192 in ping_nodes () at ping_nodes.c:265 > #9 0x0000000000432429 in _slurmctld_background (argc=Unhandled dwarf > expression opcode 0xf3 > ) at controller.c:1505 > #10 main (argc=Unhandled dwarf expression opcode 0xf3 > ) at controller.c:586 > (gdb) > > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You are the assignee for the bug. > * You are watching someone on the CC list of the bug. > * You are watching the assignee of the bug. >
Fixed in commit 0f0dbfe99465 in the master branch. David