Ticket 8591

Summary:	slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed.
Product:	Slurm	Reporter:	HMS Research Computing <rc>
Component:	slurmctld	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	1 - System not usable
Priority:	---	CC:	bart
Version:	18.08.4
Hardware:	Linux
OS:	Linux
Site:	Harvard Medical School	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Core file from prod01 workaround initial crash

Description HMS Research Computing 2020-02-27 02:39:59 MST

Hi,

Our slurmctld is core dumping on both our primary and failover servers.


[root@slurm-prod01 slurm]# /usr/sbin/slurmctld -D 
slurmctld: slurmctld version 18.08.4 started on cluster o2
slurmctld: job_submit.lua: on-submit conditions initialized
slurmctld: error: _shutdown_bu_thread:send/recv slurm-prod02: Connection refused
slurmctld: layouts: no layout to initialize
slurmctld: error: read_slurm_conf: default partition not set.
slurmctld: error: Unable to resolve "compute-e-16-215": Unknown host
slurmctld: error: _set_slurmd_addr: failure on compute-e-16-215
slurmctld: error: Unable to resolve "compute-e-16-232": Unknown host
slurmctld: error: _set_slurmd_addr: failure on compute-e-16-232
slurmctld: error: Unable to resolve "compute-e-16-234": Unknown host
slurmctld: error: _set_slurmd_addr: failure on compute-e-16-234
slurmctld: layouts: loading entities/relations information
slurmctld: Recovered state of 371 nodes
slurmctld: Down nodes: compute-a-16-[143,161],compute-a-17-107,compute-e-16-[189,207-208,210,218-219,221,225-226,231,233,244],compute-f-17-10,compute-h-16-195,compute-h-17-55,compute-p-17-[05,46]
slurmctld: Recovered JobId=27528536_* Assoc=952
...

slurmctld: Recovered JobId=66583958 Assoc=44
slurmctld: recovered JobId=66006401_6859(66584527) StepId=0
slurmctld: Recovered JobId=66006401_6859(66584527) Assoc=3753
slurmctld: Recovered JobId=66583970 Assoc=2138
slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed.
Aborted (core dumped)



GDB is not helpful:

[root@slurm-prod02 slurm]# gdb /usr/sbin/slurmctld core.26464
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New LWP 26464]
[New LWP 16525]
[New LWP 16527]
[New LWP 16524]
[New LWP 16529]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x00007fa691c085f7 in raise () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-18.08.4-1.el7.centos.x86_64

Will upload core file shortly.

Comment 1 Dominik Bartkiewicz 2020-02-27 02:42:15 MST

Hi

Can you load the core file into gdb and share the backtrace with us:
eg.:
gdb -ex 't a a bt' -batch slurmctld <corefile>

Dominik

Comment 2 HMS Research Computing 2020-02-27 02:47:21 MST

Hi Dominik,

Here it is:

# gdb -ex 't a a bt' -batch  /usr/sbin/slurmctld core.26464
[New LWP 26464]
[New LWP 16525]
[New LWP 16527]
[New LWP 16524]
[New LWP 16529]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x00007fa691c085f7 in raise () from /usr/lib64/libc.so.6

Thread 5 (Thread 0x7fa68e614700 (LWP 16529)):
#0  0x00007fa691fa0a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fa67c1beed6 in _my_sleep (usec=60000000) at backfill.c:591
#2  0x00007fa67c1c55eb in backfill_agent (args=<optimized out>) at backfill.c:934
#3  0x00007fa691f9cdc5 in start_thread () from /usr/lib64/libpthread.so.0
#4  0x00007fa691cc9ced in clone () from /usr/lib64/libc.so.6

Thread 4 (Thread 0x7fa68e816700 (LWP 16524)):
#0  0x00007fa691fa0a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fa68ddedae0 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:447
#2  0x00007fa691f9cdc5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007fa691cc9ced in clone () from /usr/lib64/libc.so.6

Thread 3 (Thread 0x7fa69298f700 (LWP 16527)):
#0  0x00007fa691c90efd in nanosleep () from /usr/lib64/libc.so.6
#1  0x00007fa691c90d94 in sleep () from /usr/lib64/libc.so.6
#2  0x00007fa68d1dd300 in _process_jobs (x=<optimized out>) at jobcomp_elasticsearch.c:899
#3  0x00007fa691f9cdc5 in start_thread () from /usr/lib64/libpthread.so.0
#4  0x00007fa691cc9ced in clone () from /usr/lib64/libc.so.6

Thread 2 (Thread 0x7fa68e715700 (LWP 16525)):
#0  0x00007fa691cbf69d in poll () from /usr/lib64/libc.so.6
#1  0x00007fa692487626 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fa68e714da0) at /usr/include/bits/poll2.h:46
#2  _conn_readable (persist_conn=persist_conn@entry=0x1f57ec0) at slurm_persist_conn.c:138
#3  0x00007fa692488b2f in slurm_persist_recv_msg (persist_conn=0x1f57ec0) at slurm_persist_conn.c:882
#4  0x00007fa68ddf2cb8 in _handle_mult_rc_ret () at slurmdbd_agent.c:168
#5  _agent (x=<optimized out>) at slurmdbd_agent.c:667
#6  0x00007fa691f9cdc5 in start_thread () from /usr/lib64/libpthread.so.0
#7  0x00007fa691cc9ced in clone () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x7fa692990740 (LWP 26464)):
#0  0x00007fa691c085f7 in raise () from /usr/lib64/libc.so.6
#1  0x00007fa691c09ce8 in abort () from /usr/lib64/libc.so.6
#2  0x00007fa691c01566 in __assert_fail_base () from /usr/lib64/libc.so.6
#3  0x00007fa691c01612 in __assert_fail () from /usr/lib64/libc.so.6
#4  0x00007fa69241cf21 in bit_nclear (b=b@entry=0x75904b0, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292
#5  0x00007fa69241f4cd in bit_unfmt_hexmask (bitmap=0x75904b0, str=<optimized out>) at bitstring.c:1395
#6  0x00007fa69243761d in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7ffefaa2eea8, buffer=buffer@entry=0x3392580, job_id=66584097, protocol_version=protocol_version@entry=8448) at gres.c:4275
#7  0x000000000045cd18 in _load_job_state (buffer=buffer@entry=0x3392580, protocol_version=<optimized out>) at job_mgr.c:1517
#8  0x000000000046059f in load_all_job_state () at job_mgr.c:988
#9  0x000000000049b3c7 in read_slurm_conf (recover=recover@entry=2, reconfig=reconfig@entry=false) at read_config.c:1306
#10 0x0000000000424d32 in run_backup (callbacks=callbacks@entry=0x7ffefaa2fa20) at backup.c:257
#11 0x000000000042b845 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:607

Comment 3 HMS Research Computing 2020-02-27 02:48:58 MST

Created attachment 13198 [details]
Core file from prod01

Comment 5 Dominik Bartkiewicz 2020-02-27 02:56:44 MST

Created attachment 13199 [details]
workaround initial crash

Hi

Can you apply this and restart the slurmctld?
This should move where the crash happens if nothing else.

This commit should prevent this issue in the future:
https://github.com/SchedMD/slurm/commit/4c48a84a6edb

Dominik

Comment 6 HMS Research Computing 2020-02-27 03:18:48 MST

I will see what I can do. I may need to wait until someone else is available whom knows our build env.

Thanks
--Mick

Comment 7 HMS Research Computing 2020-02-27 04:28:49 MST

Will have a build ready soon.

What does the patch fix?

Thanks
--Mick

Comment 8 Dominik Bartkiewicz 2020-02-27 04:39:24 MST

Hi

This is just a workaround.
This patch skips load zero sizes bitmap should allow to turn on slurmctld.

Dominik

Comment 9 HMS Research Computing 2020-02-27 05:41:33 MST

I rebuild and installed my new RPM's and seeing the same issue. I installed them on our secondary server.

Either the patch doesn’t do what it says on the tin, or the patch wasn’t applied. Any suggestions on how to confirm that the patch is in the RPM's that I build, or other suggestions?

Thanks
--Mick

Comment 10 HMS Research Computing 2020-02-27 05:55:32 MST

I can see the patch being applied during the build process:

Step 4 : ENV PATCH https://bugs.schedmd.com/attachment.cgi?id=13199
 ---> Using cache
 ---> c41e68d05365

Comment 11 Dominik Bartkiewicz 2020-02-27 06:30:05 MST

Hi

If you apply this patch for sure backtrace looks differently (line number must change for bit_unfmt_hexmask()).

Dominik

Comment 12 HMS Research Computing 2020-02-27 06:43:00 MST

Hi Dominik,

This is what I'm what seeing in the backtrace:



# gdb -ex 't a a bt' -batch  /usr/sbin/slurmctld core.6734 
[New LWP 6734]
[New LWP 6788]
[New LWP 6739]
[New LWP 6785]
[New LWP 6786]
[New LWP 6784]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x00007f61f832e5f7 in raise () from /lib64/libc.so.6

Thread 6 (Thread 0x7f61f4f3c700 (LWP 6784)):
#0  0x00007f61f86c6a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61f4513ae0 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:447
#2  0x00007f61f86c2dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f61f83efced in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f61f90b6700 (LWP 6786)):
#0  0x00007f61f83b6efd in nanosleep () from /lib64/libc.so.6
#1  0x00007f61f83b6d94 in sleep () from /lib64/libc.so.6
#2  0x00007f61f4108300 in _process_jobs (x=<optimized out>) at jobcomp_elasticsearch.c:899
#3  0x00007f61f86c2dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61f83efced in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f61f4e3b700 (LWP 6785)):
#0  0x00007f61f83e569d in poll () from /lib64/libc.so.6
#1  0x00007f61f8bad626 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7f61f4e3ada0) at /usr/include/bits/poll2.h:46
#2  _conn_readable (persist_conn=persist_conn@entry=0x26fae40) at slurm_persist_conn.c:138
#3  0x00007f61f8baeb2f in slurm_persist_recv_msg (persist_conn=0x26fae40) at slurm_persist_conn.c:882
#4  0x00007f61f4518cb8 in _handle_mult_rc_ret () at slurmdbd_agent.c:168
#5  _agent (x=<optimized out>) at slurmdbd_agent.c:667
#6  0x00007f61f86c2dc5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f61f83efced in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f61f4d3a700 (LWP 6739)):
#0  0x00007f61f83b6efd in nanosleep () from /lib64/libc.so.6
#1  0x00007f61f83b6d94 in sleep () from /lib64/libc.so.6
#2  0x00007f61f8bb38c9 in slurm_send_recv_controller_msg (request_msg=request_msg@entry=0x7f61f4d39e20, response_msg=response_msg@entry=0x7f61f4d39d60, comm_cluster_rec=0x0) at slurm_protocol_api.c:4515
#3  0x00007f61f8bb4158 in slurm_send_recv_controller_rc_msg (req=req@entry=0x7f61f4d39e20, rc=rc@entry=0x7f61f4d39e0c, comm_cluster_rec=<optimized out>) at slurm_protocol_api.c:4893
#4  0x00007f61f8b36439 in slurm_pull_trigger (trigger_pull=trigger_pull@entry=0x7f61f4d39ed0) at triggers.c:171
#5  0x00000000004241a0 in _trigger_slurmctld_event (arg=<optimized out>) at backup.c:722
#6  0x00007f61f86c2dc5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f61f83efced in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f61f4a35700 (LWP 6788)):
#0  0x00007f61f86c6a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61e2938ed6 in _my_sleep (usec=60000000) at backfill.c:591
#2  0x00007f61e293f5eb in backfill_agent (args=<optimized out>) at backfill.c:934
#3  0x00007f61f86c2dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61f83efced in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f61f90b7740 (LWP 6734)):
#0  0x00007f61f832e5f7 in raise () from /lib64/libc.so.6
#1  0x00007f61f832fce8 in abort () from /lib64/libc.so.6
#2  0x00007f61f8327566 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f61f8327612 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f61f8b42f21 in bit_nclear (b=b@entry=0x7d26340, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292
#5  0x00007f61f8b454cd in bit_unfmt_hexmask (bitmap=0x7d26340, str=<optimized out>) at bitstring.c:1395
#6  0x00007f61f8b5d61d in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7ffe6b7a9018, buffer=buffer@entry=0x3b35380, job_id=66584097, protocol_version=protocol_version@entry=8448) at gres.c:4275
#7  0x000000000045cd18 in _load_job_state (buffer=buffer@entry=0x3b35380, protocol_version=<optimized out>) at job_mgr.c:1517
#8  0x000000000046059f in load_all_job_state () at job_mgr.c:988
#9  0x000000000049b3c7 in read_slurm_conf (recover=recover@entry=2, reconfig=reconfig@entry=false) at read_config.c:1306
#10 0x0000000000424d32 in run_backup (callbacks=callbacks@entry=0x7ffe6b7a9b90) at backup.c:257
#11 0x000000000042b845 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:607

Comment 13 HMS Research Computing 2020-02-27 06:49:07 MST

Ignore the last comment, my colleague confirmed that I missed something in our build process and didn't apply the patch. Will rebuild shortly.


Thanks
--Mick

Comment 14 HMS Research Computing 2020-02-27 07:23:34 MST

Hi Dominik,

We finally got the patch applied and slurm appears to be working. I'll update later if there are further problems.

We will look into upgrading to 19.x in the near future.

Thanks for the help.

Cheers
-Mick

Comment 15 HMS Research Computing 2020-02-27 07:25:09 MST

Any change you can tell me the exact reason for this bug so that I can communicate it to my management?

Thanks!
--Mick

Comment 16 Dominik Bartkiewicz 2020-02-27 09:04:39 MST

Hi

This is a duplicate of bug 6739.
In some situations, slurmctld wrongly creates zero-sized bitmap in struct describing job gres.

18.08.4 is old and currently isn’t supported, furthermore it contains plenty of known and already fixed issues.

The following commit is in slurm 18.08.8 and fixes that issue.
https://github.com/SchedMD/slurm/commit/4c48a84a6edb

I'm glad it fixed it for you. I'm closing this as resolved/duplicate of 6739.

Dominik

*** This ticket has been marked as a duplicate of ticket 6739 ***