Ticket 9202 - extern step reports an unexplained oom event/possibly doesn't terminate
Summary: extern step reports an unexplained oom event/possibly doesn't terminate
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 20.02.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Nate Rini
QA Contact:
URL:
: 9385 (view as ticket list)
Depends on:
Blocks: 8656
  Show dependency treegraph
 
Reported: 2020-06-09 08:02 MDT by Tim McMullan
Modified: 2020-09-22 08:47 MDT (History)
8 users (show)

See Also:
Site: LANL
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Tim McMullan 2020-06-09 08:02:32 MDT
In a few cases some of us have seen extern steps not appear to complete, and in some cases also report an OOM condition when it doesn't appear like one should have happened.

EG of oom:
[2020-06-05T11:33:04.146] [130.batch] debug2: task_cgroup_memory_check_oom: oom stop msg write success.
[2020-06-05T11:33:04.146] [130.batch] debug2: task_cgroup_memory_check_oom: attempt to join oom_thread.
[2020-06-05T11:33:04.146] [130.batch] debug2: _oom_event_monitor: stop msg read.
[2020-06-05T11:33:04.146] [130.batch] debug:  _oom_event_monitor: No oom events detected.
[2020-06-05T11:33:04.146] [130.batch] debug:  _oom_event_monitor: stopping.
...
[2020-06-05T11:33:04.830] [130.extern] debug2: Rank 0 got all children completions
[2020-06-05T11:33:04.830] [130.extern] debug2: _one_step_complete_msg: first=0, last=1
[2020-06-05T11:33:04.831] [130.extern] _oom_event_monitor: oom-kill event count: 1
[2020-06-05T11:33:04.841] [130.extern] debug2:   false, shutdown
[2020-06-05T11:33:04.841] [130.extern] debug:  Message thread exited
[2020-06-05T11:33:04.841] [130.extern] done with job

I had replicated this on 18.08.8 and 20.02.3 running debian running 4.19.0-8-amd64 and 5.4.0-0.bpo.4-amd64.
Comment 11 Marshall Garey 2020-07-14 10:07:26 MDT
*** Ticket 9385 has been marked as a duplicate of this ticket. ***
Comment 12 Todd Merritt 2020-07-14 10:25:12 MDT
I have a user that seems to replicate this reliably under 19.05.6. Is there any workaround that would allow them to run their job?
Comment 18 Nate Rini 2020-08-14 14:37:24 MDT
(In reply to Todd Merritt from comment #12)
> I have a user that seems to replicate this reliably under 19.05.6. Is there
> any workaround that would allow them to run their job?

We can replicate this issue too and are working on a patchset. Please note that Slurm doesn't kill the job when there is an OOM event in the extern step but will instead only note it in the job results.
Comment 37 Nate Rini 2020-09-22 08:47:06 MDT
The patch to catch extern step OOM events is now upstream for 20.11:
> https://github.com/SchedMD/slurm/commit/f93b16670f3b07f6209099c24425036f9c54d136

Due to an slightly related issue, batch or extern step OOM events can cause other (numeric) steps to claim to have an been OOMed. This is being debugged in bug#9737.

Thanks,
--Nate