Ticket 9385 - unexplained _oom_event_monitor: oom-kill event count: 1 event
Summary: unexplained _oom_event_monitor: oom-kill event count: 1 event
Status: RESOLVED DUPLICATE of ticket 9202
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 19.05.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-07-13 07:18 MDT by Todd Merritt
Modified: 2020-07-14 10:07 MDT (History)
1 user (show)

See Also:
Site: U of AZ
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Todd Merritt 2020-07-13 07:18:47 MDT
Hi we have a user job that is triggering 

[2020-07-11T20:30:11.321] [42688.extern] _oom_event_monitor: oom-kill event count: 1

Hoever, I cannot find and oom events in the system messages file. The user believes that this job ran fine early in testing which may have been before cgroups was fully set up. Looking at the job in question, it doesn't look like it exceeded the requested memory either.

root@ericidle:~ # sacct -j 42688 --parsable2 --format jobid,account,user,exitcode,state,reqmem,MaxRSS,MaxRSSNode,MaxVMSize,MaxVMSizeNode,MaxVMSizeTask,jobname --state FAILED
JobID|Account|User|ExitCode|State|ReqMem|MaxRSS|MaxRSSNode|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|JobName
42688|windfall|mmazloff|9:0|FAILED|5Gc||||||SCCM_fwd
42688.batch|windfall||9:0|FAILED|5Gc|16172K|r1u12n2|3174408K|r1u12n2|0|batch
42688.extern|windfall||0:0|COMPLETED|5Gc|1152K|r1u15n1|142612K|r1u16n1|2|extern
42688.0|windfall||0:9|CANCELLED by 25123|5Gc|3050228K|r1u29n1|15955136K|r1u27n2|1000|mitgcmuv

Our cgroup config is

[root@r1u07n1 ~]# cat /etc/slurm/cgroup.conf 
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

#ConstrainDevices=yes
#TaskAffinity=yes
#TaskAffinity=no

with task affinity enabled in slurm.conf.

Here are the slurmd.log entries for this job

[root@r1u12n2 slurm]# zgrep 42688 slurmd.log-20200712.gz 
[2020-07-11T20:16:19.206] task_p_slurmd_batch_request: 42688
[2020-07-11T20:16:19.206] task/affinity: job 42688 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFF0
[2020-07-11T20:16:19.206] task/affinity: job 42688 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFF0
[2020-07-11T20:16:19.209] _run_prolog: prolog with lock for job 42688 ran for 0 seconds
[2020-07-11T20:16:19.231] [42688.extern] task/cgroup: /slurm/uid_25123/job_42688: alloc=471040MB mem.limit=471040MB memsw.limit=471040MB
[2020-07-11T20:16:19.239] [42688.extern] task/cgroup: /slurm/uid_25123/job_42688/step_extern: alloc=471040MB mem.limit=471040MB memsw.limit=471040MB
[2020-07-11T20:16:20.497] Launching batch job 42688 for UID 25123
[2020-07-11T20:16:20.503] [42688.batch] task/cgroup: /slurm/uid_25123/job_42688: alloc=471040MB mem.limit=471040MB memsw.limit=471040MB
[2020-07-11T20:16:20.509] [42688.batch] task/cgroup: /slurm/uid_25123/job_42688/step_batch: alloc=471040MB mem.limit=471040MB memsw.limit=471040MB
[2020-07-11T20:16:20.552] [42688.batch] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.106] launch task 42688.0 request from UID:25123 GID:340 HOST:10.141.16.34 PORT:57994
[2020-07-11T20:16:21.106] lllp_distribution jobid [42688] auto binding off: mask_cpu,one_thread
[2020-07-11T20:16:21.179] [42688.0] task/cgroup: /slurm/uid_25123/job_42688: alloc=471040MB mem.limit=471040MB memsw.limit=471040MB
[2020-07-11T20:16:21.185] [42688.0] task/cgroup: /slurm/uid_25123/job_42688/step_0: alloc=471040MB mem.limit=471040MB memsw.limit=471040MB
[2020-07-11T20:16:21.278] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.278] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.278] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.278] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.280] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.284] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.284] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.284] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.284] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.285] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.285] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.285] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.285] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.288] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.288] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.288] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.288] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.288] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.288] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.289] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.291] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.291] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.291] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.291] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.291] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.294] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.295] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.295] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.295] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.298] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.298] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.298] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.301] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.301] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.301] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.304] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.304] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.304] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.305] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.304] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.307] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.307] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.310] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.310] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.310] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.310] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.313] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.315] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.315] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.315] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.315] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.316] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.317] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.318] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.320] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.320] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.321] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.321] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.324] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.324] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.324] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.324] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.326] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.326] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.326] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.329] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.329] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.329] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.329] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.332] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.332] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.332] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.335] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.336] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.336] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.336] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.336] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.339] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.339] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.339] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.342] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.342] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.344] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:16:21.344] [42688.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-07-11T20:29:54.665] [42688.0] error: *** STEP 42688.0 ON r1u12n2 CANCELLED AT 2020-07-11T20:29:54 ***
[2020-07-11T20:30:11.298] [42688.0] done with job
[2020-07-11T20:30:11.306] [42688.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:35072
[2020-07-11T20:30:11.310] [42688.batch] done with job
[2020-07-11T20:30:11.321] [42688.extern] _oom_event_monitor: oom-kill event count: 1
[2020-07-11T20:30:25.915] [42688.extern] done with job

Thanks!
Comment 4 Marshall Garey 2020-07-14 10:07:26 MDT
Hi Todd,

Thanks for the information. It turns out we've already encountered this bug and have bug 9202 open to handle it. It was originally private but it's public now and I've marked bug 9202 comment 0 as public.

About this bug - we think that there weren't actually any OOM events but that there is a bug in the extern slurmstepd. We have reproduced it but aren't reproducing it consistently.

Feel free to post comments or questions on bug 9202.

I'm marking this bug as a duplicate of bug 9202.

*** This ticket has been marked as a duplicate of ticket 9202 ***