Ticket 2935 - slurmctld segfaulting continually
Summary: slurmctld segfaulting continually
Status: RESOLVED DUPLICATE of ticket 2925
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 16.05.2
Hardware: Linux Linux
: 2 - High Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-07-24 21:30 MDT by David Paul
Modified: 2016-07-24 22:28 MDT (History)
3 users (show)

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description David Paul 2016-07-24 21:30:05 MDT
slurmctld segfaulted and now continues to segfault after each restart. The system is unuable.  A sample gdb output:

ctl1:/var/tmp/slurm # ll
total 2060088
-rw------- 1 root root  164728832 Jul 24 19:24 core
-rw------- 1 root root  233345024 Jul 24 17:49 core.201607241749
-rw------- 1 root root  178831360 Jul 24 18:26 core.201607241826
-rw------- 1 root root  180555776 Jul 24 18:49 core.201607241849
-rw------- 1 root root  162451456 Jul 24 18:54 core.201607241854
-rw------- 1 root root  256122880 Jul 20 11:22 core.weird
-rw------- 1 root root    2298695 Jul 24 19:24 slurmctld.log
-rw------- 1 root root 1436168612 Jul 24 18:26 slurmctld.log.save1
ctl1:/var/tmp/slurm # gdb /usr/sbin/slurmctld core 
GNU gdb (GDB; SUSE Linux Enterprise 12) 7.9.1
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/slurmctld...done.
[New LWP 16039]
[New LWP 15977]
[New LWP 16043]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  _free_step_rec (step_ptr=0x30da500) at step_mgr.c:313
313	step_mgr.c: No such file or directory.
Missing separate debuginfos, use: zypper install slurm-debuginfo-16.05.2-20160715210839head.x86_64
(gdb) bt
#0  _free_step_rec (step_ptr=0x30da500) at step_mgr.c:313
#1  0x00000000004d8b7c in delete_step_record (job_ptr=0x30d39a0, step_id=4294967295) at step_mgr.c:374
#2  0x00007f53949ec0f9 in _step_fini (args=0x30da500) at select_cray.c:1173
#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f539592104d in clone () from /lib64/libc.so.6
(gdb) thread apply all bt

Thread 17 (Thread 0x7f537fcf1700 (LWP 16043)):
#0  0x00007f53958f2ded in nanosleep () from /lib64/libc.so.6
#1  0x00007f53958f2c84 in sleep () from /lib64/libc.so.6
#2  0x00007f537fcf8c13 in _decay_thread (no_data=0x0) at priority_multifactor.c:1369
#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7f538f87b700 (LWP 15977)):
#0  0x00007f53958f2ded in nanosleep () from /lib64/libc.so.6
#1  0x00007f53958f2c84 in sleep () from /lib64/libc.so.6
#2  0x00007f538f87fdd7 in _set_db_inx_thread (no_data=0x0) at accounting_storage_slurmdbd.c:435
#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7f537ede2700 (LWP 16180)):
#0  0x00007f53958f2ded in nanosleep () from /lib64/libc.so.6
#1  0x00007f53958f2c84 in sleep () from /lib64/libc.so.6
#2  0x00007f53949ebba6 in _wait_job_completed (job_id=2771293, job_ptr=0x3236fa0) at select_cray.c:1030
#3  0x00007f53949ebd17 in _job_fini (args=0x3236fa0) at select_cray.c:1065
#4  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f5396215700 (LWP 15975)):
#0  0x00007f53958f2ded in nanosleep () from /lib64/libc.so.6
#1  0x00007f539591a9d4 in usleep () from /lib64/libc.so.6
#2  0x0000000000447212 in _slurmctld_background (no_data=0x0) at controller.c:1713
#3  0x0000000000444940 in main (argc=1, argv=0x7fffb9599ab8) at controller.c:605

Thread 13 (Thread 0x7f537faef700 (LWP 16123)):
#0  0x00007f539591a2b3 in select () from /lib64/libc.so.6
#1  0x00000000004456c0 in _slurmctld_rpc_mgr (no_data=0x0) at controller.c:1014
#2  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f538c89a700 (LWP 16038)):
#0  0x00007f5395918c1d in poll () from /lib64/libc.so.6
#1  0x00007f53949ea319 in _aeld_event_loop (args=0x0) at select_cray.c:516
#2  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f539592104d in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 11 (Thread 0x7f538e93c700 (LWP 16036)):
#0  0x00007f5395bef408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f538e941585 in _my_sleep (usec=120000000) at backfill.c:488
#2  0x00007f538e941fdf in backfill_agent (args=0x0) at backfill.c:742
#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f5396213700 (LWP 15976)):
#0  0x00007f53958f2ded in nanosleep () from /lib64/libc.so.6
#1  0x00007f53958f2c84 in sleep () from /lib64/libc.so.6
#2  0x00007f539051f5ef in _lease_extender (args=0x0) at cookies.c:350
#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f538f475700 (LWP 15979)):
#0  0x00007f5395bef408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005c0b07 in _agent (x=0x0) at slurmdbd_defs.c:2137
#2  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f537f9ee700 (LWP 16124)):
#0  0x00007f5395bf25c9 in do_sigwait () from /lib64/libpthread.so.0
#1  0x00007f5395bf2653 in sigwait () from /lib64/libpthread.so.0
#2  0x0000000000445010 in _slurmctld_signal_hand (no_data=0x0) at controller.c:876
#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f537f8ed700 (LWP 16125)):
#0  0x00007f5395bef05f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004d7842 in slurmctld_state_save (no_data=0x0) at state_save.c:208
#2  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f538f77a700 (LWP 15978)):
#0  0x00007f5395bec4c2 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f538f87fdfc in _cleanup_thread (no_data=0x0) at accounting_storage_slurmdbd.c:443
#2  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f539592104d in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---
Thread 5 (Thread 0x7f537fbf0700 (LWP 16044)):
#0  0x00007f5395bec4c2 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f537fcf8d54 in _cleanup_thread (no_data=0x0) at priority_multifactor.c:1423
#2  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f537ffff700 (LWP 16042)):
#0  0x00007f5395bef408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f538c2902db in bb_sleep (state_ptr=0x7f538c496880 <bb_state>, add_secs=30) at burst_buffer_common.c:926
#2  0x00007f538c27e17b in _bb_agent (args=0x0) at burst_buffer_cray.c:413
#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f538c597700 (LWP 16041)):
#0  0x00007f53958f2ded in nanosleep () from /lib64/libc.so.6
#1  0x00007f53958f2c84 in sleep () from /lib64/libc.so.6
#2  0x00007f53949ebba6 in _wait_job_completed (job_id=2766118, job_ptr=0x30e1600) at select_cray.c:1030
#3  0x00007f53949ebd17 in _job_fini (args=0x30e1600) at select_cray.c:1065
#4  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f538c698700 (LWP 16040)):
#0  0x00007f5395bf2489 in waitpid () from /lib64/libpthread.so.0
#1  0x00007f53949e99a4 in _run_nhc (nhc_info=0x7f538c697f20) at select_cray.c:314
#2  0x00007f53949ebd23 in _job_fini (args=0x30d39a0) at select_cray.c:1068
#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f539592104d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f538c799700 (LWP 16039)):
#0  _free_step_rec (step_ptr=0x30da500) at step_mgr.c:313
#1  0x00000000004d8b7c in delete_step_record (job_ptr=0x30d39a0, step_id=4294967295) at step_mgr.c:374
#2  0x00007f53949ec0f9 in _step_fini (args=0x30da500) at select_cray.c:1173
#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f539592104d in clone () from /lib64/libc.so.6
(gdb) 
(gdb) quit
Comment 1 Tim Wickberg 2016-07-24 21:36:56 MDT
Can you confirm which 16.05 series commit you're at at this point?

On July 24, 2016 11:30:05 PM EDT, bugs@schedmd.com wrote:
>https://bugs.schedmd.com/show_bug.cgi?id=2935
>
>           Site: NERSC - National Energy Research Supercomputing Center
>            Bug ID: 2935
>           Summary: slurmctld segfaulting continually
>           Product: Slurm
>           Version: 16.05.2
>          Hardware: Linux
>                OS: Linux
>            Status: UNCONFIRMED
>          Severity: 1 - System not usable
>          Priority: ---
>         Component: slurmctld
>          Assignee: support@schedmd.com
>          Reporter: dpaul@nersc.gov
>
>slurmctld segfaulted and now continues to segfault after each restart.
>The
>system is unuable.  A sample gdb output:
>
>ctl1:/var/tmp/slurm # ll
>total 2060088
>-rw------- 1 root root  164728832 Jul 24 19:24 core
>-rw------- 1 root root  233345024 Jul 24 17:49 core.201607241749
>-rw------- 1 root root  178831360 Jul 24 18:26 core.201607241826
>-rw------- 1 root root  180555776 Jul 24 18:49 core.201607241849
>-rw------- 1 root root  162451456 Jul 24 18:54 core.201607241854
>-rw------- 1 root root  256122880 Jul 20 11:22 core.weird
>-rw------- 1 root root    2298695 Jul 24 19:24 slurmctld.log
>-rw------- 1 root root 1436168612 Jul 24 18:26 slurmctld.log.save1
>ctl1:/var/tmp/slurm # gdb /usr/sbin/slurmctld core 
>GNU gdb (GDB; SUSE Linux Enterprise 12) 7.9.1
>Copyright (C) 2015 Free Software Foundation, Inc.
>License GPLv3+: GNU GPL version 3 or later
><http://gnu.org/licenses/gpl.html>
>This is free software: you are free to change and redistribute it.
>There is NO WARRANTY, to the extent permitted by law.  Type "show
>copying"
>and "show warranty" for details.
>This GDB was configured as "x86_64-suse-linux".
>Type "show configuration" for configuration details.
>For bug reporting instructions, please see:
><http://bugs.opensuse.org/>.
>Find the GDB manual and other documentation resources online at:
><http://www.gnu.org/software/gdb/documentation/>.
>For help, type "help".
>Type "apropos word" to search for commands related to "word"...
>Reading symbols from /usr/sbin/slurmctld...done.
>[New LWP 16039]
>[New LWP 15977]
>[New LWP 16043]
>[Thread debugging using libthread_db enabled]
>Using host libthread_db library "/lib64/libthread_db.so.1".
>Core was generated by `/usr/sbin/slurmctld'.
>Program terminated with signal SIGSEGV, Segmentation fault.
>#0  _free_step_rec (step_ptr=0x30da500) at step_mgr.c:313
>313     step_mgr.c: No such file or directory.
>Missing separate debuginfos, use: zypper install
>slurm-debuginfo-16.05.2-20160715210839head.x86_64
>(gdb) bt
>#0  _free_step_rec (step_ptr=0x30da500) at step_mgr.c:313
>#1  0x00000000004d8b7c in delete_step_record (job_ptr=0x30d39a0,
>step_id=4294967295) at step_mgr.c:374
>#2  0x00007f53949ec0f9 in _step_fini (args=0x30da500) at
>select_cray.c:1173
>#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#4  0x00007f539592104d in clone () from /lib64/libc.so.6
>(gdb) thread apply all bt
>
>Thread 17 (Thread 0x7f537fcf1700 (LWP 16043)):
>#0  0x00007f53958f2ded in nanosleep () from /lib64/libc.so.6
>#1  0x00007f53958f2c84 in sleep () from /lib64/libc.so.6
>#2  0x00007f537fcf8c13 in _decay_thread (no_data=0x0) at
>priority_multifactor.c:1369
>#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#4  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 16 (Thread 0x7f538f87b700 (LWP 15977)):
>#0  0x00007f53958f2ded in nanosleep () from /lib64/libc.so.6
>#1  0x00007f53958f2c84 in sleep () from /lib64/libc.so.6
>#2  0x00007f538f87fdd7 in _set_db_inx_thread (no_data=0x0) at
>accounting_storage_slurmdbd.c:435
>#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#4  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 15 (Thread 0x7f537ede2700 (LWP 16180)):
>#0  0x00007f53958f2ded in nanosleep () from /lib64/libc.so.6
>#1  0x00007f53958f2c84 in sleep () from /lib64/libc.so.6
>#2  0x00007f53949ebba6 in _wait_job_completed (job_id=2771293,
>job_ptr=0x3236fa0) at select_cray.c:1030
>#3  0x00007f53949ebd17 in _job_fini (args=0x3236fa0) at
>select_cray.c:1065
>#4  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#5  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 14 (Thread 0x7f5396215700 (LWP 15975)):
>#0  0x00007f53958f2ded in nanosleep () from /lib64/libc.so.6
>#1  0x00007f539591a9d4 in usleep () from /lib64/libc.so.6
>#2  0x0000000000447212 in _slurmctld_background (no_data=0x0) at
>controller.c:1713
>#3  0x0000000000444940 in main (argc=1, argv=0x7fffb9599ab8) at
>controller.c:605
>
>Thread 13 (Thread 0x7f537faef700 (LWP 16123)):
>#0  0x00007f539591a2b3 in select () from /lib64/libc.so.6
>#1  0x00000000004456c0 in _slurmctld_rpc_mgr (no_data=0x0) at
>controller.c:1014
>#2  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#3  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 12 (Thread 0x7f538c89a700 (LWP 16038)):
>#0  0x00007f5395918c1d in poll () from /lib64/libc.so.6
>#1  0x00007f53949ea319 in _aeld_event_loop (args=0x0) at
>select_cray.c:516
>#2  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#3  0x00007f539592104d in clone () from /lib64/libc.so.6
>---Type <return> to continue, or q <return> to quit---
>
>Thread 11 (Thread 0x7f538e93c700 (LWP 16036)):
>#0  0x00007f5395bef408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>/lib64/libpthread.so.0
>#1  0x00007f538e941585 in _my_sleep (usec=120000000) at backfill.c:488
>#2  0x00007f538e941fdf in backfill_agent (args=0x0) at backfill.c:742
>#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#4  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 10 (Thread 0x7f5396213700 (LWP 15976)):
>#0  0x00007f53958f2ded in nanosleep () from /lib64/libc.so.6
>#1  0x00007f53958f2c84 in sleep () from /lib64/libc.so.6
>#2  0x00007f539051f5ef in _lease_extender (args=0x0) at cookies.c:350
>#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#4  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 9 (Thread 0x7f538f475700 (LWP 15979)):
>#0  0x00007f5395bef408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>/lib64/libpthread.so.0
>#1  0x00000000005c0b07 in _agent (x=0x0) at slurmdbd_defs.c:2137
>#2  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#3  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 8 (Thread 0x7f537f9ee700 (LWP 16124)):
>#0  0x00007f5395bf25c9 in do_sigwait () from /lib64/libpthread.so.0
>#1  0x00007f5395bf2653 in sigwait () from /lib64/libpthread.so.0
>#2  0x0000000000445010 in _slurmctld_signal_hand (no_data=0x0) at
>controller.c:876
>#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#4  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 7 (Thread 0x7f537f8ed700 (LWP 16125)):
>#0  0x00007f5395bef05f in pthread_cond_wait@@GLIBC_2.3.2 () from
>/lib64/libpthread.so.0
>#1  0x00000000004d7842 in slurmctld_state_save (no_data=0x0) at
>state_save.c:208
>#2  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#3  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 6 (Thread 0x7f538f77a700 (LWP 15978)):
>#0  0x00007f5395bec4c2 in pthread_join () from /lib64/libpthread.so.0
>#1  0x00007f538f87fdfc in _cleanup_thread (no_data=0x0) at
>accounting_storage_slurmdbd.c:443
>#2  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#3  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>---Type <return> to continue, or q <return> to quit---
>Thread 5 (Thread 0x7f537fbf0700 (LWP 16044)):
>#0  0x00007f5395bec4c2 in pthread_join () from /lib64/libpthread.so.0
>#1  0x00007f537fcf8d54 in _cleanup_thread (no_data=0x0) at
>priority_multifactor.c:1423
>#2  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#3  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 4 (Thread 0x7f537ffff700 (LWP 16042)):
>#0  0x00007f5395bef408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>/lib64/libpthread.so.0
>#1  0x00007f538c2902db in bb_sleep (state_ptr=0x7f538c496880
><bb_state>,
>add_secs=30) at burst_buffer_common.c:926
>#2  0x00007f538c27e17b in _bb_agent (args=0x0) at
>burst_buffer_cray.c:413
>#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#4  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 3 (Thread 0x7f538c597700 (LWP 16041)):
>#0  0x00007f53958f2ded in nanosleep () from /lib64/libc.so.6
>#1  0x00007f53958f2c84 in sleep () from /lib64/libc.so.6
>#2  0x00007f53949ebba6 in _wait_job_completed (job_id=2766118,
>job_ptr=0x30e1600) at select_cray.c:1030
>#3  0x00007f53949ebd17 in _job_fini (args=0x30e1600) at
>select_cray.c:1065
>#4  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#5  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 2 (Thread 0x7f538c698700 (LWP 16040)):
>#0  0x00007f5395bf2489 in waitpid () from /lib64/libpthread.so.0
>#1  0x00007f53949e99a4 in _run_nhc (nhc_info=0x7f538c697f20) at
>select_cray.c:314
>#2  0x00007f53949ebd23 in _job_fini (args=0x30d39a0) at
>select_cray.c:1068
>#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#4  0x00007f539592104d in clone () from /lib64/libc.so.6
>
>Thread 1 (Thread 0x7f538c799700 (LWP 16039)):
>#0  _free_step_rec (step_ptr=0x30da500) at step_mgr.c:313
>#1  0x00000000004d8b7c in delete_step_record (job_ptr=0x30d39a0,
>step_id=4294967295) at step_mgr.c:374
>#2  0x00007f53949ec0f9 in _step_fini (args=0x30da500) at
>select_cray.c:1173
>#3  0x00007f5395beb0a4 in start_thread () from /lib64/libpthread.so.0
>#4  0x00007f539592104d in clone () from /lib64/libc.so.6
>(gdb) 
>(gdb) quit
>
>-- 
>You are receiving this mail because:
>You are the assignee for the bug.
Comment 3 David Paul 2016-07-24 22:23:33 MDT
Doug J said: "Owing to a number of critical fixes (for NERSC) post-16.05.2, we've been running against the 16.05 HEAD for about two weeks.  We just updated again to c39f9ac9179aeef1874e0d3da775c34af6518f41.  Prior to that we were on 16.05.1 when cori was made available following the CLE6 update.

Before this we were running 0c7bd6d024359e7de5146534e8f1cc150f28be38 (from 20160715).  This was done to integrate a fix crashing slurmctld when pending steps were being confused with completing steps.  I believe this is corrected in 3b914e5b4fd1a1e2591bcd2c4437a14ebf09c6ef."
Comment 4 David Paul 2016-07-24 22:24:31 MDT
We are up now with this latest version.
Comment 5 Tim Wickberg 2016-07-24 22:28:21 MDT
Alright, closing this as a duplicate of 2925. Please let me know if you think that's in error.

*** This ticket has been marked as a duplicate of ticket 2925 ***