Ticket 8006

Summary:	Slurmctld - crashes - slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed.
Product:	Slurm	Reporter:	Damien <damien.leong>
Component:	slurmctld	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	bart
Version:	18.08.6
Hardware:	Linux
OS:	Linux
Site:	Monash University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm core dump slurm core dump workaround initial crash slurmctld log

Description Damien 2019-10-28 02:13:25 MDT

Hi Slurm Support

We are using slurm 18.08.6.

Our slurmctld crashes, causing multiple core dumps, Couldn't bring it up even after multiple restarts.


./slurmctld -D
slurmctld: debug:  Log file re-opened
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: slurmctld version 18.08.6-2 started on cluster m3
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: debug:  init: Gres GPU plugin loaded
slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 20
slurmctld: preempt/qos loaded
slurmctld: debug:  Checkpoint plugin loaded: checkpoint/none
slurmctld: debug:  AcctGatherEnergy NONE plugin loaded
slurmctld: debug:  AcctGatherProfile NONE plugin loaded
slurmctld: debug:  AcctGatherInterconnect NONE plugin loaded
slurmctld: debug:  AcctGatherFilesystem NONE plugin loaded
slurmctld: debug:  Job accounting gather cgroup plugin loaded
slurmctld: job_submit.lua: initialized
slurmctld: ExtSensors NONE plugin loaded
slurmctld: debug:  switch NONE plugin loaded
slurmctld: debug:  power_save module disabled, SuspendTime < 0
slurmctld: debug:  Requesting control from backup controller m3-mgmt1
.....
.....
.....
slurmctld:   gres_per_node:1 node_cnt:0
slurmctld: Recovered JobId=11534397 Assoc=2472
slurmctld: recovered JobId=11528146_11(11538958) StepId=Extern
slurmctld: Recovered JobId=11528146_11(11538958) Assoc=361
slurmctld: gres:gpu(7696487) type:P4(13392) job:11538967 state
slurmctld:   gres_per_node:1 node_cnt:1
slurmctld:   gres_bit_step_alloc:NULL
slurmctld:   gres_bit_alloc[0]:1
slurmctld:   gres_cnt_step_alloc[0]:0
slurmctld: recovered JobId=11538967 StepId=Extern
slurmctld: Recovered JobId=11538967 Assoc=3026
slurmctld: recovered JobId=11539461 StepId=Extern
slurmctld: Recovered JobId=11539461 Assoc=2220
slurmctld: recovered JobId=11539464 StepId=Extern
slurmctld: Recovered JobId=11539464 Assoc=2220
slurmctld: recovered JobId=11539468 StepId=Extern
slurmctld: Recovered JobId=11539468 Assoc=2220
slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed.
Aborted



Investigation leads to this:

# gdb /opt/slurm-18.08.6-2/sbin/slurmctld  core.17463 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/slurm-18.08.6-2/sbin/slurmctld...done.
[New LWP 17463]
[New LWP 17464]
[New LWP 17470]
[New LWP 17466]
[New LWP 17468]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/slurm-18.08.6-2/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x00007ff67b1b6207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.3.x86_64 lua-5.1.4-15.el7.x86_64 sssd-client-1.16.2-13.el7_6.5.x86_64
(gdb) where
#0  0x00007ff67b1b6207 in raise () from /lib64/libc.so.6
#1  0x00007ff67b1b78f8 in abort () from /lib64/libc.so.6
#2  0x00007ff67b1af026 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007ff67b1af0d2 in __assert_fail () from /lib64/libc.so.6
#4  0x00007ff67b9d2291 in bit_nclear (b=b@entry=0x18f7130, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292
#5  0x00007ff67b9d4790 in bit_unfmt_hexmask (bitmap=0x18f7130, str=<optimized out>) at bitstring.c:1397
#6  0x00007ff67b9ebbf5 in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7ffeb93ae7e0, buffer=buffer@entry=0x17518c0, 
    job_id=11539588, protocol_version=protocol_version@entry=8448) at gres.c:4318
#7  0x000000000045b77e in _load_job_state (buffer=buffer@entry=0x17518c0, protocol_version=<optimized out>) at job_mgr.c:1519
#8  0x000000000045f21c in load_all_job_state () at job_mgr.c:988
#9  0x0000000000499583 in read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1326
#10 0x000000000042b172 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:663
(gdb) 




### The problematic job seems to be job_id=11539588

Further investigations:

# cd job.11539588/
[root@m3-mgmt2 job.11539588]# ls
environment  script
[root@m3-mgmt2 job.11539588]# cat script 
#!/bin/bash
#SBATCH --job-name=labeling
#SBATCH --account=pd87
#SBATCH --time=10:00:00
#SBATCH --ntasks=1
#SBATCH --mem=64000
#SBATCH --cpus-per-task=26
#SBATCH --gpus-per-task=1
#SBATCH --partition=m3g
source activate MLenv
python labeling.py


### --gpus-per-task=1 is only available for v19.05 right ? Could it has caused problems to our slurmcltd v18.08.06 ?


Since, our slurmcltd is down, How can we recover from this ? I have gone through most of our compute nodes, most jobs are still running. slurmctld, sinfo, squeue are dead. I don't want to scancel the other running jobs. 


Kindly advise and help


Thanks


Damien

Comment 1 Damien 2019-10-28 02:17:48 MDT

Created attachment 12106 [details]
slurm core dump

Comment 2 Damien 2019-10-28 02:18:49 MDT

Created attachment 12107 [details]
slurm core dump

Comment 3 Dominik Bartkiewicz 2019-10-28 02:24:47 MDT

Created attachment 12108 [details]
workaround initial crash

Hi

Can you apply this and restart the slurmctld?
This should move where the crash happens if nothing else.

This commit should prevent this issue in the future:
https://github.com/SchedMD/slurm/commit/4c48a84a6edb

Dominik

Comment 4 Damien 2019-10-28 02:30:40 MDT

Thanks for your reply.


We are running v18.08.6

Your patch is for v19.05 ... 


Can it work ?

Comment 5 Damien 2019-10-28 02:33:22 MDT

Patch for this file only ?

src/common/gres.c

Comment 6 Dominik Bartkiewicz 2019-10-28 03:05:54 MDT

Hi

You need to apply  attachment 12108 [details] and restart slurmctld.

Commit https://github.com/SchedMD/slurm/commit/4c48a84a6edb is included in 18.08.8. 

Dominik

Comment 7 Damien 2019-10-28 03:20:31 MDT

Created attachment 12111 [details]
slurmctld log

Comment 8 Dominik Bartkiewicz 2019-10-28 03:34:57 MDT

Hi

Dose slurmctld still segfaulting?
Does this log is taken after applying the patch?

Dominik

Comment 9 Damien 2019-10-28 03:38:02 MDT

Thanks, the patch seems to work.


Our Slurmctld is back...

Comment 10 Dominik Bartkiewicz 2019-10-28 03:50:44 MDT

Hi

I'm glad to hear that the slurmctld is working.
Can we lower the severity of this ticket to 3?

Is any reason why you use 18.08.6, not 18.08.8?

Dominik

Comment 11 Damien 2019-10-28 03:57:49 MDT

Hi Dominik


Yes , Please.


We are planning to upgrade to v19.05.X soon, but I am worried for the existing users' scripts with  "--gres=gpu:V100:1", the v19.05.X doesn't have GPU info in their gres.conf anymore , and everything is moving towards TRES.



Cheers

Damien

Comment 12 Dominik Bartkiewicz 2019-10-28 06:04:33 MDT

Hi

Syntax like "--gres=gpu:V100:1" is supported in 19.05 and we have no plan to remove it in the future.

Slurm still takes gres info from gres.conf.
To Enable AutoDetect you need to explicitly set it in gres.conf

Dominik

Comment 13 Dominik Bartkiewicz 2019-10-29 04:28:54 MDT

Hi

Dose situation is still sable?

Did I answer your concerns?
If you have more doubts, please let me know here
or open a separate ticket.

Dominik

Comment 14 Damien 2019-10-29 06:51:59 MDT

Hi Dominik

Thanks for your reply.

Our slurmctld is running with the mentioned patched.


Current plan:

1) Prepare v18.08.8  Just in case...

2) Gather clarity for v19.05.x upgrades
   - compatibility issues
   - any depreciated features or commands
   - Testing
  This should be a separate ticket if needed.




Once again, thanks for your help.

Cheers

Damien

Comment 15 Dominik Bartkiewicz 2019-10-31 08:30:06 MDT

Hi

If you can create a new ticket this would be the best option.

Dominik

Comment 16 Damien 2019-10-31 16:38:54 MDT

Thanks, I will do that.



Cheers

Damien

Comment 17 Dominik Bartkiewicz 2019-11-01 03:01:25 MDT

Hi

Closing as duplicate of 6739, 
please reopen if you have further questions.

Dominik

*** This ticket has been marked as a duplicate of ticket 6739 ***