Summary: | slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed. | ||
---|---|---|---|
Product: | Slurm | Reporter: | HMS Research Computing <rc> |
Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 1 - System not usable | ||
Priority: | --- | CC: | bart |
Version: | 18.08.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Harvard Medical School | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
Core file from prod01
workaround initial crash |
Description
HMS Research Computing
2020-02-27 02:39:59 MST
Hi Can you load the core file into gdb and share the backtrace with us: eg.: gdb -ex 't a a bt' -batch slurmctld <corefile> Dominik Hi Dominik, Here it is: # gdb -ex 't a a bt' -batch /usr/sbin/slurmctld core.26464 [New LWP 26464] [New LWP 16525] [New LWP 16527] [New LWP 16524] [New LWP 16529] [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00007fa691c085f7 in raise () from /usr/lib64/libc.so.6 Thread 5 (Thread 0x7fa68e614700 (LWP 16529)): #0 0x00007fa691fa0a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007fa67c1beed6 in _my_sleep (usec=60000000) at backfill.c:591 #2 0x00007fa67c1c55eb in backfill_agent (args=<optimized out>) at backfill.c:934 #3 0x00007fa691f9cdc5 in start_thread () from /usr/lib64/libpthread.so.0 #4 0x00007fa691cc9ced in clone () from /usr/lib64/libc.so.6 Thread 4 (Thread 0x7fa68e816700 (LWP 16524)): #0 0x00007fa691fa0a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007fa68ddedae0 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:447 #2 0x00007fa691f9cdc5 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00007fa691cc9ced in clone () from /usr/lib64/libc.so.6 Thread 3 (Thread 0x7fa69298f700 (LWP 16527)): #0 0x00007fa691c90efd in nanosleep () from /usr/lib64/libc.so.6 #1 0x00007fa691c90d94 in sleep () from /usr/lib64/libc.so.6 #2 0x00007fa68d1dd300 in _process_jobs (x=<optimized out>) at jobcomp_elasticsearch.c:899 #3 0x00007fa691f9cdc5 in start_thread () from /usr/lib64/libpthread.so.0 #4 0x00007fa691cc9ced in clone () from /usr/lib64/libc.so.6 Thread 2 (Thread 0x7fa68e715700 (LWP 16525)): #0 0x00007fa691cbf69d in poll () from /usr/lib64/libc.so.6 #1 0x00007fa692487626 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fa68e714da0) at /usr/include/bits/poll2.h:46 #2 _conn_readable (persist_conn=persist_conn@entry=0x1f57ec0) at slurm_persist_conn.c:138 #3 0x00007fa692488b2f in slurm_persist_recv_msg (persist_conn=0x1f57ec0) at slurm_persist_conn.c:882 #4 0x00007fa68ddf2cb8 in _handle_mult_rc_ret () at slurmdbd_agent.c:168 #5 _agent (x=<optimized out>) at slurmdbd_agent.c:667 #6 0x00007fa691f9cdc5 in start_thread () from /usr/lib64/libpthread.so.0 #7 0x00007fa691cc9ced in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x7fa692990740 (LWP 26464)): #0 0x00007fa691c085f7 in raise () from /usr/lib64/libc.so.6 #1 0x00007fa691c09ce8 in abort () from /usr/lib64/libc.so.6 #2 0x00007fa691c01566 in __assert_fail_base () from /usr/lib64/libc.so.6 #3 0x00007fa691c01612 in __assert_fail () from /usr/lib64/libc.so.6 #4 0x00007fa69241cf21 in bit_nclear (b=b@entry=0x75904b0, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292 #5 0x00007fa69241f4cd in bit_unfmt_hexmask (bitmap=0x75904b0, str=<optimized out>) at bitstring.c:1395 #6 0x00007fa69243761d in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7ffefaa2eea8, buffer=buffer@entry=0x3392580, job_id=66584097, protocol_version=protocol_version@entry=8448) at gres.c:4275 #7 0x000000000045cd18 in _load_job_state (buffer=buffer@entry=0x3392580, protocol_version=<optimized out>) at job_mgr.c:1517 #8 0x000000000046059f in load_all_job_state () at job_mgr.c:988 #9 0x000000000049b3c7 in read_slurm_conf (recover=recover@entry=2, reconfig=reconfig@entry=false) at read_config.c:1306 #10 0x0000000000424d32 in run_backup (callbacks=callbacks@entry=0x7ffefaa2fa20) at backup.c:257 #11 0x000000000042b845 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:607 Created attachment 13198 [details]
Core file from prod01
Created attachment 13199 [details] workaround initial crash Hi Can you apply this and restart the slurmctld? This should move where the crash happens if nothing else. This commit should prevent this issue in the future: https://github.com/SchedMD/slurm/commit/4c48a84a6edb Dominik I will see what I can do. I may need to wait until someone else is available whom knows our build env. Thanks --Mick Will have a build ready soon. What does the patch fix? Thanks --Mick Hi This is just a workaround. This patch skips load zero sizes bitmap should allow to turn on slurmctld. Dominik I rebuild and installed my new RPM's and seeing the same issue. I installed them on our secondary server. Either the patch doesn’t do what it says on the tin, or the patch wasn’t applied. Any suggestions on how to confirm that the patch is in the RPM's that I build, or other suggestions? Thanks --Mick I can see the patch being applied during the build process: Step 4 : ENV PATCH https://bugs.schedmd.com/attachment.cgi?id=13199 ---> Using cache ---> c41e68d05365 Hi If you apply this patch for sure backtrace looks differently (line number must change for bit_unfmt_hexmask()). Dominik Hi Dominik, This is what I'm what seeing in the backtrace: # gdb -ex 't a a bt' -batch /usr/sbin/slurmctld core.6734 [New LWP 6734] [New LWP 6788] [New LWP 6739] [New LWP 6785] [New LWP 6786] [New LWP 6784] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00007f61f832e5f7 in raise () from /lib64/libc.so.6 Thread 6 (Thread 0x7f61f4f3c700 (LWP 6784)): #0 0x00007f61f86c6a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f61f4513ae0 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:447 #2 0x00007f61f86c2dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f61f83efced in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7f61f90b6700 (LWP 6786)): #0 0x00007f61f83b6efd in nanosleep () from /lib64/libc.so.6 #1 0x00007f61f83b6d94 in sleep () from /lib64/libc.so.6 #2 0x00007f61f4108300 in _process_jobs (x=<optimized out>) at jobcomp_elasticsearch.c:899 #3 0x00007f61f86c2dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f61f83efced in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f61f4e3b700 (LWP 6785)): #0 0x00007f61f83e569d in poll () from /lib64/libc.so.6 #1 0x00007f61f8bad626 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7f61f4e3ada0) at /usr/include/bits/poll2.h:46 #2 _conn_readable (persist_conn=persist_conn@entry=0x26fae40) at slurm_persist_conn.c:138 #3 0x00007f61f8baeb2f in slurm_persist_recv_msg (persist_conn=0x26fae40) at slurm_persist_conn.c:882 #4 0x00007f61f4518cb8 in _handle_mult_rc_ret () at slurmdbd_agent.c:168 #5 _agent (x=<optimized out>) at slurmdbd_agent.c:667 #6 0x00007f61f86c2dc5 in start_thread () from /lib64/libpthread.so.0 #7 0x00007f61f83efced in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f61f4d3a700 (LWP 6739)): #0 0x00007f61f83b6efd in nanosleep () from /lib64/libc.so.6 #1 0x00007f61f83b6d94 in sleep () from /lib64/libc.so.6 #2 0x00007f61f8bb38c9 in slurm_send_recv_controller_msg (request_msg=request_msg@entry=0x7f61f4d39e20, response_msg=response_msg@entry=0x7f61f4d39d60, comm_cluster_rec=0x0) at slurm_protocol_api.c:4515 #3 0x00007f61f8bb4158 in slurm_send_recv_controller_rc_msg (req=req@entry=0x7f61f4d39e20, rc=rc@entry=0x7f61f4d39e0c, comm_cluster_rec=<optimized out>) at slurm_protocol_api.c:4893 #4 0x00007f61f8b36439 in slurm_pull_trigger (trigger_pull=trigger_pull@entry=0x7f61f4d39ed0) at triggers.c:171 #5 0x00000000004241a0 in _trigger_slurmctld_event (arg=<optimized out>) at backup.c:722 #6 0x00007f61f86c2dc5 in start_thread () from /lib64/libpthread.so.0 #7 0x00007f61f83efced in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f61f4a35700 (LWP 6788)): #0 0x00007f61f86c6a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f61e2938ed6 in _my_sleep (usec=60000000) at backfill.c:591 #2 0x00007f61e293f5eb in backfill_agent (args=<optimized out>) at backfill.c:934 #3 0x00007f61f86c2dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f61f83efced in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f61f90b7740 (LWP 6734)): #0 0x00007f61f832e5f7 in raise () from /lib64/libc.so.6 #1 0x00007f61f832fce8 in abort () from /lib64/libc.so.6 #2 0x00007f61f8327566 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007f61f8327612 in __assert_fail () from /lib64/libc.so.6 #4 0x00007f61f8b42f21 in bit_nclear (b=b@entry=0x7d26340, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292 #5 0x00007f61f8b454cd in bit_unfmt_hexmask (bitmap=0x7d26340, str=<optimized out>) at bitstring.c:1395 #6 0x00007f61f8b5d61d in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7ffe6b7a9018, buffer=buffer@entry=0x3b35380, job_id=66584097, protocol_version=protocol_version@entry=8448) at gres.c:4275 #7 0x000000000045cd18 in _load_job_state (buffer=buffer@entry=0x3b35380, protocol_version=<optimized out>) at job_mgr.c:1517 #8 0x000000000046059f in load_all_job_state () at job_mgr.c:988 #9 0x000000000049b3c7 in read_slurm_conf (recover=recover@entry=2, reconfig=reconfig@entry=false) at read_config.c:1306 #10 0x0000000000424d32 in run_backup (callbacks=callbacks@entry=0x7ffe6b7a9b90) at backup.c:257 #11 0x000000000042b845 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:607 Ignore the last comment, my colleague confirmed that I missed something in our build process and didn't apply the patch. Will rebuild shortly. Thanks --Mick Hi Dominik, We finally got the patch applied and slurm appears to be working. I'll update later if there are further problems. We will look into upgrading to 19.x in the near future. Thanks for the help. Cheers -Mick Any change you can tell me the exact reason for this bug so that I can communicate it to my management? Thanks! --Mick Hi This is a duplicate of bug 6739. In some situations, slurmctld wrongly creates zero-sized bitmap in struct describing job gres. 18.08.4 is old and currently isn’t supported, furthermore it contains plenty of known and already fixed issues. The following commit is in slurm 18.08.8 and fixes that issue. https://github.com/SchedMD/slurm/commit/4c48a84a6edb I'm glad it fixed it for you. I'm closing this as resolved/duplicate of 6739. Dominik *** This ticket has been marked as a duplicate of ticket 6739 *** |