Summary: | slurmctld crashes on startup, same as 6739, but updating to 18.08.9 doesn't help. | ||
---|---|---|---|
Product: | Slurm | Reporter: | Mikael Öhman <mikael.ohman> |
Component: | slurmctld | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | bart |
Version: | 18.08.9 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=6739 https://bugs.schedmd.com/show_bug.cgi?id=8360 https://bugs.schedmd.com/show_bug.cgi?id=8651 |
||
Site: | SNIC | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | C3SE | Tzag Elita Sites: | --- |
Linux Distro: | CentOS | Machine Name: | Vera |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | workaround crash |
Description
Mikael Öhman
2020-01-03 13:56:17 MST
Created attachment 12667 [details] workaround crash I think the fix in bug 6739 (https://github.com/SchedMD/slurm/commit/4c48a84a6edb) prevented the bug from happening that caused the segfault. But once the system is in that state I don't think the commit fixes that state. Tim had provided an initial workaround in bug 6739 to at least get the system running: https://bugs.schedmd.com/attachment.cgi?id=9663&action=diff I'm attaching a patch for the latest version of 18.08. This will at the very least move the crash to a later place, at best get slurmctld started so you can cancel the offending jobs. Once slurmctld is running, could you cancel the jobs that are causing problems? Alright I will hack that into my slurmctld temporarily locally and try to get that up and running. Sounds good. Please let us know if the patch works or fails as soon as you're able to test it. It seems to work, though, I changed it to "return 0;" because that what the other ticket suggested. I don't see a good reason why not to include this fix (or an equivalent workaround). Without it, if bitsize is ever zero, it will just exit on an assert on the next line when it calls bit_nclear. Sure, it shouldn't occur and can't always check every value, but since it's known to have happened (at least twice now)? I fail to see much of a downside. (In reply to Mikael Öhman from comment #5) > It seems to work, though, I changed it to "return 0;" because that what the > other ticket suggested. Great! I'm moving this to a sev-3 then. (By the way, SLURM_SUCCESS is a macro that equates to 0, so it's functionally the same.) > I don't see a good reason why not to include this fix (or an equivalent > workaround). Without it, if bitsize is ever zero, it will just exit on an > assert on the next line when it calls bit_nclear. Sure, it shouldn't occur > and can't always check every value, but since it's known to have happened > (at least twice now)? I fail to see much of a downside. Our reasoning is from a developer's point of view: * If you expect that it's possible for something to be NULL, use an if to prevent a NULL dereference (or out of bounds check, or some similar kind of undefined behavior). * If you expect that something must always be true, use an assert so that the system crashes and you have a backtrace. Having a backtrace is desirable to understand and fix the underlying issue to prevent the crash. A band-aid solution may prevent a crash right now, but it also means that there is a bug lurking that may cause a problem elsewhere. This kind of bug can be extremely difficult to find and fix without a backtrace. I've seen the asserts in bitstring.c catch several different bugs. Slurm follows this philosophy in a lot of places, not just in bitstring.c. Does that make sense? That said, I'm looking to see if there is a better workaround (like in the caller rather than the callee). (In reply to Marshall Garey from comment #6) > Our reasoning is from a developer's point of view: > > * If you expect that it's possible for something to be NULL, use an if to > prevent a NULL dereference (or out of bounds check, or some similar kind of > undefined behavior). > * If you expect that something must always be true, use an assert so that > the system crashes and you have a backtrace. Having a backtrace is desirable > to understand and fix the underlying issue to prevent the crash. A band-aid > solution may prevent a crash right now, but it also means that there is a > bug lurking that may cause a problem elsewhere. This kind of bug can be > extremely difficult to find and fix without a backtrace. I forgot to add that assert()'s are only active in developer mode. So without them the code will just bypass the assert() and (hopefully) crash/segfault and produce a backtrace normally. (i thought i replied already, but I must have missed actually pressing send)
The function does have a return value for errors, though I suspect the return value isn't respected in the caller anyway.
I understand your rationale, but at the same time, it's a pretty harsh bug; since nothing can be started, nothing can be fixed, even if I was prepared to drop the offending job. Either way, I consider this fixed, so unless
> That said, I'm looking to see if there is a better workaround (like in the caller rather than the callee).
you wish to look into this more, it can be closed.
I'm finally closing this one. We haven't seen it in 19.05 at all so far, I haven't been able to reproduce it on my own, and without that I don't think I can justify a change to the 19.05 and 20.02 protocol code. If we do ever see it again we will have this bug and the other see also bugs as extra points of data. |