Ticket 9202

Summary: extern step reports an unexplained oom event/possibly doesn't terminate
Product: Slurm Reporter: Tim McMullan <mcmullan>
Component: slurmstepdAssignee: Nate Rini <nate>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex, ezellma, felip.moll, fullop, marshall, sts, tim, tmerritt
Version: 20.02.3   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=9714
https://bugs.schedmd.com/show_bug.cgi?id=9737
https://bugs.schedmd.com/show_bug.cgi?id=9760
Site: LANL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 20.11pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Ticket Depends on:    
Ticket Blocks: 8656    

Description Tim McMullan 2020-06-09 08:02:32 MDT
In a few cases some of us have seen extern steps not appear to complete, and in some cases also report an OOM condition when it doesn't appear like one should have happened.

EG of oom:
[2020-06-05T11:33:04.146] [130.batch] debug2: task_cgroup_memory_check_oom: oom stop msg write success.
[2020-06-05T11:33:04.146] [130.batch] debug2: task_cgroup_memory_check_oom: attempt to join oom_thread.
[2020-06-05T11:33:04.146] [130.batch] debug2: _oom_event_monitor: stop msg read.
[2020-06-05T11:33:04.146] [130.batch] debug:  _oom_event_monitor: No oom events detected.
[2020-06-05T11:33:04.146] [130.batch] debug:  _oom_event_monitor: stopping.
...
[2020-06-05T11:33:04.830] [130.extern] debug2: Rank 0 got all children completions
[2020-06-05T11:33:04.830] [130.extern] debug2: _one_step_complete_msg: first=0, last=1
[2020-06-05T11:33:04.831] [130.extern] _oom_event_monitor: oom-kill event count: 1
[2020-06-05T11:33:04.841] [130.extern] debug2:   false, shutdown
[2020-06-05T11:33:04.841] [130.extern] debug:  Message thread exited
[2020-06-05T11:33:04.841] [130.extern] done with job

I had replicated this on 18.08.8 and 20.02.3 running debian running 4.19.0-8-amd64 and 5.4.0-0.bpo.4-amd64.
Comment 11 Marshall Garey 2020-07-14 10:07:26 MDT
*** Ticket 9385 has been marked as a duplicate of this ticket. ***
Comment 12 Todd Merritt 2020-07-14 10:25:12 MDT
I have a user that seems to replicate this reliably under 19.05.6. Is there any workaround that would allow them to run their job?
Comment 18 Nate Rini 2020-08-14 14:37:24 MDT
(In reply to Todd Merritt from comment #12)
> I have a user that seems to replicate this reliably under 19.05.6. Is there
> any workaround that would allow them to run their job?

We can replicate this issue too and are working on a patchset. Please note that Slurm doesn't kill the job when there is an OOM event in the extern step but will instead only note it in the job results.
Comment 37 Nate Rini 2020-09-22 08:47:06 MDT
The patch to catch extern step OOM events is now upstream for 20.11:
> https://github.com/SchedMD/slurm/commit/f93b16670f3b07f6209099c24425036f9c54d136

Due to an slightly related issue, batch or extern step OOM events can cause other (numeric) steps to claim to have an been OOMed. This is being debugged in bug#9737.

Thanks,
--Nate