Summary: | Slurmctld segfaulting on startup | ||
---|---|---|---|
Product: | Slurm | Reporter: | Steve Ford <fordste5> |
Component: | slurmctld | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 1 - System not usable | ||
Priority: | --- | CC: | marshall |
Version: | 18.08.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | MSU | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | Workaround segfault (from bug 6739) |
Description
Steve Ford
2019-04-29 10:52:39 MDT
Thread 6 (Thread 0x7f7f4d45d700 (LWP 60369)): #0 0x00007f7f4ca6dd42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004202b6 in _agent_init (arg=<optimized out>) at agent.c:1383 #2 0x00007f7f4ca69e25 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f7f4c793bad in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7f7f4a0f2700 (LWP 60370)): #0 0x00007f7f4ca6dd42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f7f4a0f7ae0 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:447 #2 0x00007f7f4ca69e25 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f7f4c793bad in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f7f45525700 (LWP 60374)): #0 0x00007f7f4ca6dd42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f7f45529fa8 in _my_sleep (usec=60000000) at backfill.c:597 #2 0x00007f7f45530aeb in backfill_agent (args=<optimized out>) at backfill.c:956 #3 0x00007f7f4ca69e25 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7f4c793bad in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f7f45835700 (LWP 60373)): #0 0x00007f7f4c75a56d in nanosleep () from /lib64/libc.so.6 #1 0x00007f7f4c75a404 in sleep () from /lib64/libc.so.6 #2 0x00007f7f498e6300 in _process_jobs (x=<optimized out>) at jobcomp_elasticsearch.c:899 #3 0x00007f7f4ca69e25 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7f4c793bad in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f7f49ded700 (LWP 60372)): #0 0x00007f7f4c788f0d in poll () from /lib64/libc.so.6 #1 0x00007f7f4cf54c86 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7f7f49decda0) at /usr/include/bits/poll2.h:46 #2 _conn_readable (persist_conn=persist_conn@entry=0x194f390) at slurm_persist_conn.c:138 #3 0x00007f7f4cf56277 in slurm_persist_recv_msg (persist_conn=0x194f390) at slurm_persist_conn.c:905 #4 0x00007f7f4a0fcd44 in _handle_mult_rc_ret () at slurmdbd_agent.c:168 #5 _agent (x=<optimized out>) at slurmdbd_agent.c:678 #6 0x00007f7f4ca69e25 in start_thread () from /lib64/libpthread.so.0 #7 0x00007f7f4c793bad in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f7f4d45e740 (LWP 60368)): #0 0x00007f7f4c6cb277 in raise () from /lib64/libc.so.6 #1 0x00007f7f4c6cc968 in abort () from /lib64/libc.so.6 #2 0x00007f7f4c6c4096 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007f7f4c6c4142 in __assert_fail () from /lib64/libc.so.6 #4 0x00007f7f4ceea0b1 in bit_nclear (b=b@entry=0xbd0b7a0, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292 #5 0x00007f7f4ceec667 in bit_unfmt_hexmask (bitmap=0xbd0b7a0, str=<optimized out>) at bitstring.c:1397 #6 0x00007f7f4cf0491d in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7ffd2d33edf8, buffer=buffer@entry=0xa144140, job_id=18790198, protocol_version=protocol_version@entry=8448) at gres.c:4318 #7 0x000000000045d129 in _load_job_state (buffer=buffer@entry=0xa144140, protocol_version=<optimized out>) at job_mgr.c:1519 #8 0x00000000004609f1 in load_all_job_state () at job_mgr.c:988 #9 0x000000000049c12d in read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1326 #10 0x000000000042bc22 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:663 Hi Steve, This looks like a duplicate of 6739. This commit fixes this issue in 18.08: https://github.com/SchedMD/slurm/commit/4c48a84a6edb Created attachment 10050 [details] Workaround segfault (from bug 6739) Hi Steve, It looks to me like a dup of bug 6739. I've attached a patch from there that prevented the segfault. Can you apply it and let us know if it prevents the segfault? The following commit is in slurm 18.08.8 and fixes that issue. https://github.com/SchedMD/slurm/commit/4c48a84a6edb Thank you, Marshall. I'm applying this patch now. I'll let you know how it goes. We're up and running again. Thank you both for your help. Awesome. I'm closing this as a duplicate of 6739. Please re-open it if you run into more issues.Thank you, Marshall. I'm applying this patch now. I'll let you know how it goes. *** This ticket has been marked as a duplicate of ticket 6739 *** |