| Summary: | after core network outage slurmctld will not start - fails with segfault | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jenny Williams <jennyw> |
| Component: | slurmctld | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 17.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of North Carolina at Chapel Hill | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | RHEL |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | patch_to_avoid_segfault | ||
|
Description
Jenny Williams
2018-12-19 10:28:22 MST
My contact number is 919-923-3987 Hi Jenny, Would you please attach the output of "thread apply all bt full" from the core file. You can do this by locating the core file and running: gdb /path/to/slurmctld corefile (gdb) thread apply all bt full -Jason [root@longleaf-sched slurmctld]# which slurmctld /sbin/slurmctld [root@longleaf-sched slurmctld]# gdb /sbin/slurmctld core.110347 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. [New LWP 110347] [New LWP 110348] [New LWP 110351] [New LWP 110349] [New LWP 110350] [New LWP 110354] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 _step_dealloc_lps (step_ptr=0x482e980) at step_mgr.c:2081 2081 i_first = bit_ffs(job_resrcs_ptr->node_bitmap); Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-17.11.7-1.el7.x86_64 (gdb) thread apply all bt full Thread 6 (Thread 0x7f2ea3688700 (LWP 110354)): #0 0x00007f2ea8120d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007f2ea368cbe6 in _my_sleep (usec=30000000) at backfill.c:540 err = <optimized out> nsec = <optimized out> sleep_time = 0 ts = {tv_sec = 1545240145, tv_nsec = 227789000} tv1 = {tv_sec = 1545240115, tv_usec = 227789} tv2 = {tv_sec = 0, tv_usec = 0} __func__ = "_my_sleep" #2 0x00007f2ea3693002 in backfill_agent (args=<optimized out>) at backfill.c:876 now = <optimized out> wait_time = <optimized out> last_backfill_time = 1545240115 all_locks = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK, federation = READ_LOCK} load_config = <optimized out> short_sleep = false backfill_cnt = 0 __func__ = "backfill_agent" #3 0x00007f2ea811cdd5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00007f2ea7e45ead in clone () from /lib64/libc.so.6 No symbol table info available. Thread 5 (Thread 0x7f2ea48a5700 (LWP 110350)): #0 0x00007f2ea811df47 in pthread_join () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007f2ea49aa010 in _cleanup_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:445 No locals. #2 0x00007f2ea811cdd5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007f2ea7e45ead in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x7f2ea49a6700 (LWP 110349)): #0 0x00007f2ea7e0ce2d in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00007f2ea7e0ccc4 in sleep () from /lib64/libc.so.6 No symbol table info available. #2 0x00007f2ea49aa8c8 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:437 local_job_list = <optimized out> job_ptr = <optimized out> itr = <optimized out> job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK} job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK} __func__ = "_set_db_inx_thread" #3 0x00007f2ea811cdd5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00007f2ea7e45ead in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x7f2ea45a0700 (LWP 110351)): #0 0x00007f2ea812369d in write () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007f2ea8600037 in slurm_persist_send_msg (persist_conn=0x232fb30, buffer=buffer@entry=0xa3e0f90) at slurm_persist_conn.c:847 msg_size = 221 nw_size = 3707764736 msg = <optimized out> msg_wrote = <optimized out> rc = <optimized out> ---Type <return> to continue, or q <return> to quit--- retry_cnt = 0 #2 0x00007f2ea8677d6c in _agent (x=<optimized out>) at slurmdbd_defs.c:2034 cnt = <optimized out> rc = <optimized out> buffer = 0xa3e0f90 abs_time = {tv_sec = 1545240126, tv_nsec = 0} fail_time = 0 sigarray = {10, 0} list_req = {msg_type = 1474, data = 0x7f2ea459feb0} list_msg = {my_list = 0x0, return_code = 0} __func__ = "_agent" #3 0x00007f2ea811cdd5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00007f2ea7e45ead in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x7f2ea8aea700 (LWP 110348)): #0 0x00007f2ea8120d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000000000041e3c7 in _agent_init (arg=<optimized out>) at agent.c:1313 err = <optimized out> min_wait = <optimized out> mail_too = <optimized out> ts = {tv_sec = 1545240117, tv_nsec = 0} __func__ = "_agent_init" #2 0x00007f2ea811cdd5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007f2ea7e45ead in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x7f2ea8aeb740 (LWP 110347)): #0 _step_dealloc_lps (step_ptr=0x482e980) at step_mgr.c:2081 cpus_alloc = <optimized out> i_last = <optimized out> i_node = <optimized out> i_first = <optimized out> step_node_inx = -1 job_ptr = 0x482df40 job_resrcs_ptr = 0x0 job_node_inx = -1 cray_simulate_logged = false #1 post_job_step (step_ptr=step_ptr@entry=0x482e980) at step_mgr.c:4652 job_ptr = 0x482df40 error_code = <optimized out> #2 0x00000000004ab053 in _post_job_step (step_ptr=0x482e980) at step_mgr.c:266 select_cray_plugin = 0 #3 _internal_step_complete (job_ptr=job_ptr@entry=0x482df40, step_ptr=step_ptr@entry=0x482e980) at step_mgr.c:307 jobacct = <optimized out> #4 0x00000000004ab0d1 in delete_step_records (job_ptr=job_ptr@entry=0x482df40) at step_mgr.c:336 cleaning = 0 step_iterator = 0x232d420 step_ptr = 0x482e980 #5 0x0000000000463dd5 in cleanup_completing (job_ptr=job_ptr@entry=0x482df40) at job_scheduler.c:4700 delay = <optimized out> __func__ = "cleanup_completing" #6 0x000000000046ff7b in deallocate_nodes (job_ptr=job_ptr@entry=0x482df40, timeout=timeout@entry=false, suspended=suspended@entry=false, preempted=preempted@entry=false) at node_scheduler.c:610 select_serial = 0 i = <optimized out> kill_job = 0xa9691e0 agent_args = 0xa969500 down_node_cnt = <optimized out> node_ptr = 0x2964d48 ---Type <return> to continue, or q <return> to quit--- __func__ = "deallocate_nodes" #7 0x000000000049435d in _sync_nodes_to_comp_job () at read_config.c:2404 job_ptr = 0x482df40 job_iterator = 0x232d400 update_cnt = 1 #8 read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1241 tv1 = {tv_sec = 1545240115, tv_usec = 212445} tv2 = {tv_sec = 36751520, tv_usec = 20} tv_str = '\000' <repeats 19 times> delta_t = 72340172821299456 error_code = 0 i = <optimized out> rc = <optimized out> load_job_ret = 0 old_node_record_count = <optimized out> old_node_table_ptr = <optimized out> node_ptr = <optimized out> do_reorder_nodes = <optimized out> old_part_list = <optimized out> old_def_part_name = <optimized out> old_auth_type = <optimized out> old_bb_type = <optimized out> old_checkpoint_type = <optimized out> old_crypto_type = <optimized out> old_preempt_mode = <optimized out> old_preempt_type = 0x29010c0 "preempt/none" old_sched_type = <optimized out> old_select_type = <optimized out> old_switch_type = <optimized out> state_save_dir = 0x0 mpi_params = 0x0 old_select_type_p = <optimized out> __func__ = "read_slurm_conf" #9 0x00000000004279b6 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:497 cnt = <optimized out> error_code = <optimized out> i = 3 stat_buf = {st_dev = 64768, st_ino = 68778753, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 392880, st_blksize = 4096, st_blocks = 768, st_atim = {tv_sec = 1545175099, tv_nsec = 500946750}, st_mtim = { tv_sec = 1518025023, tv_nsec = 0}, st_ctim = {tv_sec = 1530169016, tv_nsec = 524506863}, __unused = {0, 0, 0}} rlim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615} config_write_lock = {config = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK} node_part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK} callbacks = {acct_full = 0x4ac9f5 <trigger_primary_ctld_acct_full>, dbd_fail = 0x4acc04 <trigger_primary_dbd_fail>, dbd_resumed = 0x4acc92 <trigger_primary_dbd_res_op>, db_fail = 0x4acd17 <trigger_primary_db_fail>, db_resumed = 0x4acda5 <trigger_primary_db_res_op>} create_clustername_file = false __func__ = "main" (gdb) This installation has the state files on a shared network drive; when the network went out my guess is the state was horked. (In reply to Jenny Williams from comment #1) > My contact number is 919-923-3987 Updating this to my desk - 919-962-4751 (In reply to Jenny Williams from comment #1) > My contact number is 919-923-3987 Updating this to my desk - 919-962-4751 Created attachment 8709 [details]
patch_to_avoid_segfault
I have attached a patch to help you move past the segfault with slurm-17-11-7-1. Would you please apply this and rebuild the Slurm head node.
cd /your/source
patch -p1 < patch_to_avoid_segfault.patch
Then re-make.
-Jason
Will work on that. I assume this will recover pending jobs but running jobs will have been terminated by now due to slurmd timeout – is that correct? J From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, December 19, 2018 1:06 PM To: Williams, Jenny Avis <jennyw@email.unc.edu> Subject: [Bug 6270] after core network outage slurmctld will not start - fails with segfault Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=6270#c7> on bug 6270<https://bugs.schedmd.com/show_bug.cgi?id=6270> from Jason Booth<mailto:jbooth@schedmd.com> Created attachment 8709 [details]<attachment.cgi?id=8709&action=diff> [details]<attachment.cgi?id=8709&action=edit> patch_to_avoid_segfault I have attached a patch to help you move past the segfault with slurm-17-11-7-1. Would you please apply this and rebuild the Slurm head node. cd /your/source patch -p1 < patch_to_avoid_segfault.patch Then re-make. -Jason ________________________________ You are receiving this mail because: * You reported the bug. Jenny
> Will work on that. I assume this will recover pending jobs but running jobs will have been terminated by now due to slurmd timeout – is that correct?
The issue only affects slurmctld and should not trigger the slurmd timeout code you mention. Your jobs on the slurmds should not be affected unless the slurmctld has lost all record of them, in that case, they will be canceled.
-Jason
Close ... started it again after changing python versions - but a hint as to what python it wants would be appreciated
make[4]: Nothing to be done for `all-am'.
make[4]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/testsuite/slurm_unit'
make[3]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/testsuite/slurm_unit'
make[3]: Entering directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/testsuite'
make[3]: Nothing to be done for `all-am'.
make[3]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/testsuite'
make[2]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/testsuite'
Making all in doc
make[2]: Entering directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/doc'
Making all in man
make[3]: Entering directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/doc/man'
Making all in man1
make[4]: Entering directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/doc/man/man1'
`dirname sacct.1`/../man2html.py 17.11 ./../../html/header.txt ./../../html/footer.txt sacct.1
`dirname sacctmgr.1`/../man2html.py 17.11 ./../../html/header.txt ./../../html/footer.txt sacctmgr.1
`dirname salloc.1`/../man2html.py 17.11 ./../../html/header.txt ./../../html/footer.txt salloc.1
Fatal Python error: Py_Initialize: Unable to get the locale encoding
Fatal Python error: Py_Initialize: Unable to get the locale encoding
File "/nas/longleaf/apps/python/2.7.12/lib/python2.7/encodings/__init__.py", line 123
File "/nas/longleaf/apps/python/2.7.12/lib/python2.7/encodings/__init__.py", line 123
raise CodecRegistryError,\
raise CodecRegistryError,\
^
^
SyntaxError: SyntaxErrorinvalid syntax:
invalid syntax
Current thread 0xCurrent thread 0x00007f238323c74000007f9853a00740 (most recent call first):
(most recent call first):
Fatal Python error: Py_Initialize: Unable to get the locale encoding
File "/nas/longleaf/apps/python/2.7.12/lib/python2.7/encodings/__init__.py", line 123
raise CodecRegistryError,\
^
SyntaxError: invalid syntax
Current thread 0x00007f6163e79740 (most recent call first):
make[4]: *** [sacct.html] Aborted
make[4]: *** Waiting for unfinished jobs....
make[4]: *** [salloc.html] Aborted
make[4]: *** [sacctmgr.html] Aborted
make[4]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/doc/man/man1'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/doc/man'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7/doc'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/nas/longleaf/apps/slurm/rpmbuild/BUILD/slurm-17.11.7'
make: *** [all] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.x5gG0X (%build)
RPM build errors:
Bad exit status from /var/tmp/rpm-tmp.x5gG0X (%build)
^C
[jennyw@longleaf-sched rpmbuild]$ which python
/nas/longleaf/apps/python/2.7.12/bin/python
[jennyw@longleaf-sched rpmbuild]$ module list
Currently Loaded Modules:
1) python/2.7.12
It is up. Wow. Thank you. Any follow on to this ? Jenny, What version of RHEL is this, EL6. Also, what did you change to resolve your python issue (PYTHONPATH)? -Jason RHEL 7.6 I moved to a "build" node; I had been running the compile on a node that may not have had all the prerequisites. The python version that ended up working for me was 3.6.3 PYTHONPATH was empty - no value for PYTHONPATH Jenny, Just a quick update on this other issue. I could investigate this further if you attach the slurmctld logs from the day of the crash. -Jason Hi Jenny, I am marking this issue resolved. If you need any further assistance with this please feel free to re-open this case. |