In a few cases some of us have seen extern steps not appear to complete, and in some cases also report an OOM condition when it doesn't appear like one should have happened. EG of oom: [2020-06-05T11:33:04.146] [130.batch] debug2: task_cgroup_memory_check_oom: oom stop msg write success. [2020-06-05T11:33:04.146] [130.batch] debug2: task_cgroup_memory_check_oom: attempt to join oom_thread. [2020-06-05T11:33:04.146] [130.batch] debug2: _oom_event_monitor: stop msg read. [2020-06-05T11:33:04.146] [130.batch] debug: _oom_event_monitor: No oom events detected. [2020-06-05T11:33:04.146] [130.batch] debug: _oom_event_monitor: stopping. ... [2020-06-05T11:33:04.830] [130.extern] debug2: Rank 0 got all children completions [2020-06-05T11:33:04.830] [130.extern] debug2: _one_step_complete_msg: first=0, last=1 [2020-06-05T11:33:04.831] [130.extern] _oom_event_monitor: oom-kill event count: 1 [2020-06-05T11:33:04.841] [130.extern] debug2: false, shutdown [2020-06-05T11:33:04.841] [130.extern] debug: Message thread exited [2020-06-05T11:33:04.841] [130.extern] done with job I had replicated this on 18.08.8 and 20.02.3 running debian running 4.19.0-8-amd64 and 5.4.0-0.bpo.4-amd64.
*** Ticket 9385 has been marked as a duplicate of this ticket. ***
I have a user that seems to replicate this reliably under 19.05.6. Is there any workaround that would allow them to run their job?
(In reply to Todd Merritt from comment #12) > I have a user that seems to replicate this reliably under 19.05.6. Is there > any workaround that would allow them to run their job? We can replicate this issue too and are working on a patchset. Please note that Slurm doesn't kill the job when there is an OOM event in the extern step but will instead only note it in the job results.
The patch to catch extern step OOM events is now upstream for 20.11: > https://github.com/SchedMD/slurm/commit/f93b16670f3b07f6209099c24425036f9c54d136 Due to an slightly related issue, batch or extern step OOM events can cause other (numeric) steps to claim to have an been OOMed. This is being debugged in bug#9737. Thanks, --Nate