Summary: | slurmctld abrt (slurm_xmalloc/hostlist_ranged_string_xmalloc_dims) | ||
---|---|---|---|
Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
Component: | slurmctld | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | bart, regine.gaudin |
Version: | 18.08.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=6659 | ||
Site: | Stanford | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | Sherlock |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
"thread apply all bt full" output
slurmctld log |
Description
Kilian Cavalotti
2019-04-17 21:45:55 MDT
Created attachment 9943 [details]
"thread apply all bt full" output
Here's the output of "thread apply all bt full"
Thanks for the bug report. I'm looking into it. Can you let us know if it aborts again? (In reply to Marshall Garey from comment #2) > Thanks for the bug report. I'm looking into it. Can you let us know if it > aborts again? Will do, and thanks! It has not aborted again so far. Cheers, -- Kilian Can you upload the slurmctld log file leading up to and including that abort? Because the abort happens inside of a malloc() call, I'd like to get some context about what was happening. Maybe there was heap corruption somewhere? Created attachment 10037 [details]
slurmctld log
Sure! Here it is. The crash happened on Apr 17 20:36:18
Cheers,
--
Kilian
Probably unrelated error messages: There are a lot of these. Internal bug 6659 - we haven't fixed it yet, but know about it (assigned to me). Apr 17 20:35:59 sh-sl01 slurmctld[13973]: error: _remove_accrue_time_internal: QOS owner user 312359 accrue_cnt underflow I'll mention it to Kilian, but most likely he's already on top of this: Apr 18 03:20:50 sh-sl01 slurmctld[174836]: error: Node sh-106-30 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. There are a lot of these. Internal bug 6769 - also haven't fixed it yet, but we're aware of it (also assigned to me). Apr 16 19:13:33 sh-sl01 slurmctld[13973]: error: select/cons_res: node sh-102-27 memory is under-allocated (63036-65536) for JobId=40525305_394(40674029) I'll get those fixed so we can clean up your slurmctld log while I also look for whatever is causing this abort in malloc - Googling for sigabrt in malloc indicates that it's most likely heap corruption somewhere else in the program. Hi Marshall, (In reply to Marshall Garey from comment #6) > There are a lot of these. Internal bug 6659 - we haven't fixed it yet, but > know about it (assigned to me). > Apr 17 20:35:59 sh-sl01 slurmctld[13973]: error: > _remove_accrue_time_internal: QOS owner user 312359 accrue_cnt underflow Great to know! > I'll mention it to Kilian, but most likely he's already on top of this: > Apr 18 03:20:50 sh-sl01 slurmctld[174836]: error: Node sh-106-30 appears to > have a different slurm.conf than the slurmctld. This could cause issues > with communication and functionality. Please review both files and make > sure they are the same. If this is expected ignore, and set > DebugFlags=NO_CONF_HASH in your slurm.conf. Yes, it can happen at times that individual compute nodes get a slightly out of sync config file when we change things. It usually doesn't last. > There are a lot of these. Internal bug 6769 - also haven't fixed it yet, but > we're aware of it (also assigned to me). > Apr 16 19:13:33 sh-sl01 slurmctld[13973]: error: select/cons_res: node > sh-102-27 memory is under-allocated (63036-65536) for > JobId=40525305_394(40674029) > > I'll get those fixed so we can clean up your slurmctld log Great, that would be awesome, thanks! > while I also look > for whatever is causing this abort in malloc - Googling for sigabrt in > malloc indicates that it's most likely heap corruption somewhere else in the > program. Got it. And aborts are pretty rare too, so that would make sense. Cheers, -- Kilian Hi Kilian, Have there been any more aborts of this nature? I haven't made any progress on this bug specifically. However, there are proposed fixes for both issues I mentioned (accrue_cnt underflow and memory under-allocated errors). - Marshall HI Marshall, (In reply to Marshall Garey from comment #8) > Have there been any more aborts of this nature? Not in a while, no. > I haven't made any progress on this bug specifically. However, there are > proposed fixes for both issues I mentioned (accrue_cnt underflow and memory > under-allocated errors). Nice, thanks for the update! Cheers, -- Kilian Hi I'm updating this bug as CEA is also encountering memory under-allocated errors you have mentionned (bug 6769), filling slurmctld.log error: select/cons_res: node machine1234 memory is under-allocated (0-188800) for JobID=XXXXXX As you wrote "there are proposed fixes for both issues I mentioned (accrue_cnt underflow and memory under-allocated errors)", I let us known that CEA would be also interested in proposed fixes. slurm controller is 18.08.06 and clients in 17.11.6 but will be upgraded soon in 18.08.06 Thanks Regine Regine - the patches for both bugs are pending internal QA/review. They'll both definitely be in 19.05, and probably will both be in 18.08. Although I hope they'll both be in the next tag, I can't promise that. If you'd like patches provided before they're in the public repo, can you create a new ticket for that? Kilian - we've been looking into adding address sanitizer into our QA toolbox. I'm hopeful it can help me find possible heap corruption bugs, although I think it will be hard to tell if whatever I find/fix (assuming I do find something) is the exact same bug that you encountered. I'd like to keep this bug open for a bit longer to give me time to look into it more before I close it. Let me know if the slurmctld hits an abort like this again. Tangentially, we are taking tangible steps at improving our QA. It is, unfortunately, a long process. ;) Hi Marshall, (In reply to Marshall Garey from comment #11) > Kilian - we've been looking into adding address sanitizer into our QA > toolbox. I'm hopeful it can help me find possible heap corruption bugs, > although I think it will be hard to tell if whatever I find/fix (assuming I > do find something) is the exact same bug that you encountered. I'd like to > keep this bug open for a bit longer to give me time to look into it more > before I close it. Let me know if the slurmctld hits an abort like this > again. I will! > Tangentially, we are taking tangible steps at improving our QA. It is, > unfortunately, a long process. ;) This is much much much appreciated, thanks! Cheers, -- Kilian Hi Kilian, I haven't managed to reproduce this at all, and I haven't heard about it happening again, so I'm going to close it for now as cannot reproduce. Please re-open it if you see this abort again. In other news, bug 6769 (memory underallocated errors) is closed with patches upstream; the patches in bug 6659 (accrue errors) is still pending review. - Marshall Hi Marshall, (In reply to Marshall Garey from comment #13) > I haven't managed to reproduce this at all, and I haven't heard about it > happening again, so I'm going to close it for now as cannot reproduce. > Please re-open it if you see this abort again. Will do, no problem. Thanks for looking into it anyway! > > In other news, bug 6769 (memory underallocated errors) is closed with > patches upstream; the patches in bug 6659 (accrue errors) is still pending > review. Thanks for the update, I appreciate the follow up. Cheers, -- Kilian |