Ticket 8285

Summary:	slurmctld crashes on startup, same as 6739, but updating to 18.08.9 doesn't help.
Product:	Slurm	Reporter:	Mikael Öhman <mikael.ohman>
Component:	slurmctld	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	bart
Version:	18.08.9
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=6739 https://bugs.schedmd.com/show_bug.cgi?id=8360 https://bugs.schedmd.com/show_bug.cgi?id=8651
Site:	SNIC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	C3SE	Tzag Elita Sites:	---
Linux Distro:	CentOS	Machine Name:	Vera
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	workaround crash

Description Mikael Öhman 2020-01-03 13:56:17 MST

Was running slurmctld 18.08.3 when problem originated. Was trying out a few gres things and tried to restart slurmctld, which subsequently can't be started again. Symptoms were (and still are) identical as in 6739.

I saw this issue was supposedly resolved in 18.08.8, so I went to update to 18.08.9. 
This has not helped. Problems are identical, and system is still dead.

Output from 18.08.9:

...
slurmctld: Recovered JobId=885656 Assoc=15
slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed.


(gdb) backtrace
#0  0x00007ffff7258337 in raise () from /lib64/libc.so.6
#1  0x00007ffff7259a28 in abort () from /lib64/libc.so.6
#2  0x00007ffff7251156 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007ffff7251202 in __assert_fail () from /lib64/libc.so.6
#4  0x00007ffff7a781d1 in bit_nclear (b=b@entry=0xaed8c0, start=start@entry=0, stop=stop@entry=-1) at bitstring.c:292
#5  0x00007ffff7a7a787 in bit_unfmt_hexmask (bitmap=0xaed8c0, str=<optimized out>) at bitstring.c:1397
#6  0x00007ffff7a92a2d in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7fffffffd188, buffer=buffer@entry=0x8e1970, job_id=885657, 
    protocol_version=protocol_version@entry=8448) at gres.c:4316
#7  0x000000000045d609 in _load_job_state (buffer=buffer@entry=0x8e1970, protocol_version=<optimized out>) at job_mgr.c:1647
#8  0x0000000000460e73 in load_all_job_state () at job_mgr.c:1116
#9  0x000000000049c79c in read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1346
#10 0x000000000042bc02 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:664


885656 is one of my jobs that I was trying to use gpu types with the gres string "gpu:V100:1". 
I know which 3 jobs are probably the offending ones, and would be happy to clear them out from the queue, though this seems impossible.

Comment 2 Marshall Garey 2020-01-03 14:09:40 MST

Created attachment 12667 [details]
workaround crash

I think the fix in bug 6739 (https://github.com/SchedMD/slurm/commit/4c48a84a6edb) prevented the bug from happening that caused the segfault. But once the system is in that state I don't think the commit fixes that state. Tim had provided an initial workaround in bug 6739 to at least get the system running:

https://bugs.schedmd.com/attachment.cgi?id=9663&action=diff

I'm attaching a patch for the latest version of 18.08. This will at the very least move the crash to a later place, at best get slurmctld started so you can cancel the offending jobs.

Once slurmctld is running, could you cancel the jobs that are causing problems?

Comment 3 Mikael Öhman 2020-01-03 14:13:27 MST

Alright I will hack that into my slurmctld temporarily locally and try to get that up and running.

Comment 4 Marshall Garey 2020-01-03 14:27:08 MST

Sounds good. Please let us know if the patch works or fails as soon as you're able to test it.

Comment 5 Mikael Öhman 2020-01-03 14:47:57 MST

It seems to work, though, I changed it to "return 0;" because that what the other ticket suggested.

I don't see a good reason why not to include this fix (or an equivalent workaround). Without it, if bitsize is ever zero, it will just exit on an assert on the next line when it calls bit_nclear. Sure, it shouldn't occur and can't always check every value, but since it's known to have happened (at least twice now)? I fail to see much of a downside.

Comment 6 Marshall Garey 2020-01-03 15:30:29 MST

(In reply to Mikael Öhman from comment #5)
> It seems to work, though, I changed it to "return 0;" because that what the
> other ticket suggested.

Great! I'm moving this to a sev-3 then. (By the way, SLURM_SUCCESS is a macro that equates to 0, so it's functionally the same.)

> I don't see a good reason why not to include this fix (or an equivalent
> workaround). Without it, if bitsize is ever zero, it will just exit on an
> assert on the next line when it calls bit_nclear. Sure, it shouldn't occur
> and can't always check every value, but since it's known to have happened
> (at least twice now)? I fail to see much of a downside.

Our reasoning is from a developer's point of view:

* If you expect that it's possible for something to be NULL, use an if to prevent a NULL dereference (or out of bounds check, or some similar kind of undefined behavior).
* If you expect that something must always be true, use an assert so that the system crashes and you have a backtrace. Having a backtrace is desirable to understand and fix the underlying issue to prevent the crash. A band-aid solution may prevent a crash right now, but it also means that there is a bug lurking that may cause a problem elsewhere. This kind of bug can be extremely difficult to find and fix without a backtrace.

I've seen the asserts in bitstring.c catch several different bugs.

Slurm follows this philosophy in a lot of places, not just in bitstring.c.

Does that make sense?

That said, I'm looking to see if there is a better workaround (like in the caller rather than the callee).

Comment 7 Marshall Garey 2020-01-03 15:33:54 MST

(In reply to Marshall Garey from comment #6)
> Our reasoning is from a developer's point of view:
> 
> * If you expect that it's possible for something to be NULL, use an if to
> prevent a NULL dereference (or out of bounds check, or some similar kind of
> undefined behavior).
> * If you expect that something must always be true, use an assert so that
> the system crashes and you have a backtrace. Having a backtrace is desirable
> to understand and fix the underlying issue to prevent the crash. A band-aid
> solution may prevent a crash right now, but it also means that there is a
> bug lurking that may cause a problem elsewhere. This kind of bug can be
> extremely difficult to find and fix without a backtrace.

I forgot to add that assert()'s are only active in developer mode. So without them the code will just bypass the assert() and (hopefully) crash/segfault and produce a backtrace normally.

Comment 14 Mikael Öhman 2020-01-07 04:07:51 MST

(i thought i replied already, but I must have missed actually pressing send)

The function does have a return value for errors, though I suspect the return value isn't respected in the caller anyway.

I understand your rationale, but at the same time, it's a pretty harsh bug; since nothing can be started, nothing can be fixed, even if I was prepared to drop the offending job. Either way, I consider this fixed, so unless 

> That said, I'm looking to see if there is a better workaround (like in the caller rather than the callee).

you wish to look into this more, it can be closed.

Comment 16 Marshall Garey 2020-03-30 10:48:48 MDT

I'm finally closing this one. We haven't seen it in 19.05 at all so far, I haven't been able to reproduce it on my own, and without that I don't think I can justify a change to the 19.05 and 20.02 protocol code. If we do ever see it again we will have this bug and the other see also bugs as extra points of data.