Ticket 8651

Summary:	After upgrade from 18.08.9 to 19.05.5 slurmctld fails, bit_alloc: Assertion `(nbits) > 0' failed.
Product:	Slurm	Reporter:	ARC Admins <arc-slurm-admins>
Component:	slurmctld	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	marshall, nate
Version:	19.05.5
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=8285 https://bugs.schedmd.com/show_bug.cgi?id=6739 https://bugs.schedmd.com/show_bug.cgi?id=8360
Site:	University of Michigan	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.02.0
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	workaround to get slurmctld running workaround v2

Description ARC Admins 2020-03-10 10:45:40 MDT

Hello,

We updated our Slurm install from 18.08.9 to 19.05.5 but running into a problem with starting slurmctld.

#dubug5 last 8 lines before core dump
slurmctld: debug3: Version string in node_state header is PROTOCOL_VERSION
slurmctld: Recovered state of 407 nodes
slurmctld: Down nodes: gl3269
slurmctld: debug3: create_mmap_buf: loaded file `/var/spool/slurm.state/job_state` as Buf
slurmctld: debug3: Version string in job_state header is PROTOCOL_VERSION
slurmctld: debug3: Job id in job_state header is 4793557
slurmctld: bitstring.c:166: bit_alloc: Assertion `(nbits) > 0' failed.
Aborted (core dumped) 
 


Update steps taken:
1. slurmctld stopped
2. created a backup of /var/spool/slurm.state/ directory
3. server running slurmctld was rebuilt from scratch, end state was Slurm 19.05.5 
4. restored the /var/spool/slurm.state/ directory
5. attempted to start slurmctld and it crashes

Please advise

Thanks
Vasile

Comment 1 Marshall Garey 2020-03-10 10:52:00 MDT

Can you upload a backtrace from the core dump?

Also, do you still have the band-aid patch from bug 8360 applied? I haven't resolved that one yet and I'm wondering if it is causing this crash.

Comment 2 ARC Admins 2020-03-10 11:05:55 MDT

(In reply to Marshall Garey from comment #1)
> Can you upload a backtrace from the core dump?

(gdb) bt
#0  0x00002adfc01e2337 in raise () from /lib64/libc.so.6
#1  0x00002adfc01e3a28 in abort () from /lib64/libc.so.6
#2  0x00002adfc01db156 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00002adfc01db202 in __assert_fail () from /lib64/libc.so.6
#4  0x00002adfbfa04e5c in bit_alloc (nbits=0) at bitstring.c:166
#5  0x00002adfbfa245fa in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7fff487048f8, buffer=buffer@entry=0x430a840, job_id=3222565, protocol_version=protocol_version@entry=8448) at gres.c:6006
#6  0x0000000000460816 in _load_job_state (buffer=buffer@entry=0x430a840, protocol_version=<optimized out>) at job_mgr.c:1879
#7  0x00000000004634f5 in load_all_job_state () at job_mgr.c:1114
#8  0x000000000049f56c in read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1334
#9  0x000000000042d699 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:662
(gdb) 


 
> Also, do you still have the band-aid patch from bug 8360 applied? I haven't
> resolved that one yet and I'm wondering if it is causing this crash.

Yes, that was resolved, we had a node that was removed but still listed in slurm.conf

Comment 3 Nate Rini 2020-03-10 11:26:36 MDT

(In reply to ARCTS Admins from comment #2)
> (In reply to Marshall Garey from comment #1)
> > Can you upload a backtrace from the core dump?
> 
> (gdb) bt

Please call the following in gdb with the core:
> set pagination off
> set print pretty on
> f 5
> info locals
> info args

Comment 5 ARC Admins 2020-03-10 11:32:29 MDT

(In reply to Nate Rini from comment #3)

> Please call the following in gdb with the core:
> set pagination off
> set print pretty on
> f 5
> info locals
> info args

(gdb) set pagination off
(gdb) set print pretty on
(gdb) f 5
#5  0x00002adfbfa245fa in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7fff487048f8, buffer=buffer@entry=0x430a840, job_id=3222565, protocol_version=protocol_version@entry=8448) at gres.c:6006
6006    gres.c: No such file or directory.
(gdb) info locals
tmp_str = 0x4323480 "0x"
_size = 0
_tmp_uint32 = 3
i = 0
rc = 0
magic = 1133130964
plugin_id = 7696487
utmp32 = 1
rec_cnt = 0
has_more = 1 '\001'
gres_ptr = <optimized out>
gres_job_ptr = 0x4323380
__func__ = "gres_plugin_job_state_unpack"
(gdb) info args
gres_list = 0x7fff487048f8
buffer = 0x430a840
job_id = 3222565
protocol_version = 8448
(gdb)

Comment 6 Marshall Garey 2020-03-10 11:38:18 MDT

Can you also upload the output of the following from gdb?

frame 5
p *gres_job_ptr

Comment 8 ARC Admins 2020-03-10 11:41:57 MDT

(In reply to Marshall Garey from comment #6)
> Can you also upload the output of the following from gdb?
> 
> frame 5
> p *gres_job_ptr

(gdb) frame 5
#5  0x00002adfbfa245fa in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7fff487048f8, buffer=buffer@entry=0x430a840, job_id=3222565, protocol_version=protocol_version@entry=8448) at gres.c:6006
6006    in gres.c
(gdb) p *gres_job_ptr
$1 = {
  gres_name = 0x0, 
  type_id = 0, 
  type_name = 0x0, 
  flags = 0, 
  cpus_per_gres = 0, 
  gres_per_job = 0, 
  gres_per_node = 0, 
  gres_per_socket = 0, 
  gres_per_task = 2, 
  mem_per_gres = 4096, 
  def_cpus_per_gres = 0, 
  def_mem_per_gres = 0, 
  total_node_cnt = 0, 
  gres_bit_select = 0x0, 
  gres_cnt_node_select = 0x0, 
  total_gres = 0, 
  node_cnt = 1, 
  gres_bit_alloc = 0x4323460, 
  gres_cnt_node_alloc = 0x4323440, 
  gres_bit_step_alloc = 0x0, 
  gres_cnt_step_alloc = 0x0
}
(gdb)

Comment 9 Nate Rini 2020-03-10 11:42:08 MDT

Please also call the following in gdb with the core:
> p *buffer
> p buffer->size
> p buffer->processed
> p buffer->mmaped
> p buffer->head
> x/32xb buffer->head

Comment 10 ARC Admins 2020-03-10 11:47:22 MDT

(In reply to Nate Rini from comment #9)
> Please also call the following in gdb with the core:
> > p *buffer
> > p buffer->size
> > p buffer->processed
> > p buffer->mmaped
> > p buffer->head
> > x/32xb buffer->head

(gdb) p buffer->size
$3 = 1324725
(gdb) p buffer->processed
$4 = 759
(gdb) p buffer->mmaped
$5 = true
(gdb) p buffer->head
$6 = 0x2adfc855f000 <Address 0x2adfc855f000 out of bounds>
(gdb) x/32xb buffer->head
0x2adfc855f000: Cannot access memory at address 0x2adfc855f000
(gdb)

Comment 11 Marshall Garey 2020-03-10 11:49:21 MDT

Created attachment 13324 [details]
workaround to get slurmctld running

Can you apply this patch and restart slurmctld? This is a workaround to make slurmctld start running while we investigate the cause. Also, can you save a copy of the (pre-patch) slurmctld binary, libraries, and coredump so we can keep asking for information about it?

Comment 12 Marshall Garey 2020-03-10 11:49:54 MDT

Oh, and can you also keep a copy of your current state save location?

Comment 13 Nate Rini 2020-03-10 12:13:52 MDT

(In reply to Marshall Garey from comment #12)
> Oh, and can you also keep a copy of your current state save location?

Please attach a copy of your statesave directory when convenient.

Comment 14 Marshall Garey 2020-03-10 12:15:02 MDT

Created attachment 13327 [details]
workaround v2

I missed something in the previous patch that I believe will still allow slurmctld to crash - the previous patch failed to unpack something that was packed. I believe this one will correctly handle a size 0 bitmap, which is what appears to have happened. Can you apply this one instead? There still might be a crash somewhere else because of the 0 size bitmap but hopefully this at least lets slurmctld progress further.

Comment 15 ARC Admins 2020-03-10 12:19:37 MDT

(In reply to Marshall Garey from comment #14)
> Created attachment 13327 [details]
> workaround v2
> 
> I missed something in the previous patch that I believe will still allow
> slurmctld to crash - the previous patch failed to unpack something that was
> packed. I believe this one will correctly handle a size 0 bitmap, which is
> what appears to have happened. Can you apply this one instead? There still
> might be a crash somewhere else because of the 0 size bitmap but hopefully
> this at least lets slurmctld progress further.

This reply is 3 comments dated but I think might be of some value:
I have a copy of the slurm.state directory prior to the upgrade saved.

I applied the patch and ran
  slurmctld -D

That resulted in this error:

slurmctld: debug3: create_mmap_buf: loaded file `/var/spool/slurm.state/node_state` as Buf
slurmctld: debug3: Version string in node_state header is PROTOCOL_VERSION
slurmctld: Recovered state of 407 nodes
slurmctld: Down nodes: gl3269
slurmctld: debug3: create_mmap_buf: loaded file `/var/spool/slurm.state/job_state` as Buf
slurmctld: debug3: Version string in job_state header is PROTOCOL_VERSION
slurmctld: debug3: Job id in job_state header is 4793557
slurmctld: error: unpackmem_xmalloc: Buffer to be unpacked is too large (1936994115 > 100000000)
slurmctld: error: Incomplete job record
slurmctld: fatal: Incomplete job state save file, start with '-i' to ignore this

I then ran:
  slurmctld -D -i 

It seems to be up and staying up, but my queue is empty, I suspect that is the outcome of the -i

Comment 16 Nate Rini 2020-03-10 12:21:22 MDT

(In reply to ARCTS Admins from comment #15)
> It seems to be up and staying up, but my queue is empty, I suspect that is
> the outcome of the -i

Correct, do you still want to try to recover those jobs or would you rather lose those jobs and stay online?

Comment 17 ARC Admins 2020-03-10 12:23:21 MDT

(In reply to Marshall Garey from comment #14)
> Created attachment 13327 [details]
> workaround v2
> 
> I missed something in the previous patch that I believe will still allow
> slurmctld to crash - the previous patch failed to unpack something that was
> packed. I believe this one will correctly handle a size 0 bitmap, which is
> what appears to have happened. Can you apply this one instead? There still
> might be a crash somewhere else because of the 0 size bitmap but hopefully
> this at least lets slurmctld progress further.

I will apply this patch, also attached is the /var/spool/slurm* backup prior to the upgrade

Comment 18 ARC Admins 2020-03-10 12:24:24 MDT

Created attachment 13328 [details]
Slurm state files

Comment 19 ARC Admins 2020-03-10 12:25:04 MDT

(In reply to Nate Rini from comment #16)
> (In reply to ARCTS Admins from comment #15)
> > It seems to be up and staying up, but my queue is empty, I suspect that is
> > the outcome of the -i
> 
> Correct, do you still want to try to recover those jobs or would you rather
> lose those jobs and stay online?

Yes, if possible, we would like to restore the jobs from that were in the queue prior to the update.

Comment 20 Nate Rini 2020-03-10 12:26:20 MDT

(In reply to ARCTS Admins from comment #18)
> Created attachment 13328 [details]
> Slurm state files

Please also attach your slurm.conf (& friends).

Comment 21 Nate Rini 2020-03-10 12:27:39 MDT

(In reply to ARCTS Admins from comment #19)
> (In reply to Nate Rini from comment #16)
> > (In reply to ARCTS Admins from comment #15)
> > > It seems to be up and staying up, but my queue is empty, I suspect that is
> > > the outcome of the -i
> > 
> > Correct, do you still want to try to recover those jobs or would you rather
> > lose those jobs and stay online?
> 
> Yes, if possible, we would like to restore the jobs from that were in the
> queue prior to the update.

You will need to do the following:
1. stop slurmctld
2. apply patch from comment #14
3. restore the state directory to before the upgrade
4. run slurmctld (without -i)

Please attach logs and bt if it crashes

Comment 23 ARC Admins 2020-03-10 12:38:52 MDT

Created attachment 13329 [details]
Slurm.conf and friends

Comment 24 ARC Admins 2020-03-10 12:43:06 MDT

(In reply to Nate Rini from comment #21)
> (In reply to ARCTS Admins from comment #19)
> > (In reply to Nate Rini from comment #16)
> > > (In reply to ARCTS Admins from comment #15)
> > > > It seems to be up and staying up, but my queue is empty, I suspect that is
> > > > the outcome of the -i
> > > 
> > > Correct, do you still want to try to recover those jobs or would you rather
> > > lose those jobs and stay online?
> > 
> > Yes, if possible, we would like to restore the jobs from that were in the
> > queue prior to the update.
> 
> You will need to do the following:
> 1. stop slurmctld
> 2. apply patch from comment #14
> 3. restore the state directory to before the upgrade
> 4. run slurmctld (without -i)
> 
> Please attach logs and bt if it crashes

I've applied the patch, followed the steps provided and slurmctld seems to be up and the queue is showing jobs.

Comment 25 Nate Rini 2020-03-10 12:45:36 MDT

(In reply to ARCTS Admins from comment #24)
> I've applied the patch, followed the steps provided and slurmctld seems to
> be up and the queue is showing jobs.

Great, feel free to remove the patch when convenient as it should only be used to read the old 18.08 state.

Lowering ticket severity. We have your state and can recreate the issue locally.

Comment 26 ARC Admins 2020-03-10 12:48:32 MDT

(In reply to Nate Rini from comment #25)
> (In reply to ARCTS Admins from comment #24)
> > I've applied the patch, followed the steps provided and slurmctld seems to
> > be up and the queue is showing jobs.
> 
> Great, feel free to remove the patch when convenient as it should only be
> used to read the old 18.08 state.
> 
> Lowering ticket severity. We have your state and can recreate the issue
> locally.

Thanks, I will stop the slurmctld service, revert to the un-patched version and will report back.

Comment 27 ARC Admins 2020-03-10 13:36:22 MDT

(In reply to ARCTS Admins from comment #26)
> (In reply to Nate Rini from comment #25)
> > (In reply to ARCTS Admins from comment #24)
> > > I've applied the patch, followed the steps provided and slurmctld seems to
> > > be up and the queue is showing jobs.
> > 
> > Great, feel free to remove the patch when convenient as it should only be
> > used to read the old 18.08 state.
> > 
> > Lowering ticket severity. We have your state and can recreate the issue
> > locally.
> 
> Thanks, I will stop the slurmctld service, revert to the un-patched version
> and will report back.

Running the un-patched 19.05.5 looks good, jobs showing up in the queue.

Comment 35 Marshall Garey 2020-03-30 09:01:49 MDT

It turns out this was already fixed in commit 6e94ef316. But, it's in 20.02 since changed a macro used in the protocol layer code so we didn't want to make that sort of change in a stable version of Slurm. I updated the documentation and log message about slurmctld -i to warn that it will throw out any unrecoverable data from the StateSaveLocation. That's in commit cc07bc9341 in 20.02.1.

I'm closing this ticket as resolved/fixed in 20.02.0 per commit 6e94ef316