Summary: | extern step reports an unexplained oom event/possibly doesn't terminate | ||
---|---|---|---|
Product: | Slurm | Reporter: | Tim McMullan <mcmullan> |
Component: | slurmstepd | Assignee: | Nate Rini <nate> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | alex, ezellma, felip.moll, fullop, marshall, sts, tim, tmerritt |
Version: | 20.02.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=9714 https://bugs.schedmd.com/show_bug.cgi?id=9737 https://bugs.schedmd.com/show_bug.cgi?id=9760 |
||
Site: | LANL | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 20.11pre1 | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Ticket Depends on: | |||
Ticket Blocks: | 8656 |
Description
Tim McMullan
2020-06-09 08:02:32 MDT
*** Ticket 9385 has been marked as a duplicate of this ticket. *** I have a user that seems to replicate this reliably under 19.05.6. Is there any workaround that would allow them to run their job? (In reply to Todd Merritt from comment #12) > I have a user that seems to replicate this reliably under 19.05.6. Is there > any workaround that would allow them to run their job? We can replicate this issue too and are working on a patchset. Please note that Slurm doesn't kill the job when there is an OOM event in the extern step but will instead only note it in the job results. The patch to catch extern step OOM events is now upstream for 20.11: > https://github.com/SchedMD/slurm/commit/f93b16670f3b07f6209099c24425036f9c54d136 Due to an slightly related issue, batch or extern step OOM events can cause other (numeric) steps to claim to have an been OOMed. This is being debugged in bug#9737. Thanks, --Nate |