Hi we have a user job that is triggering [2020-07-11T20:30:11.321] [42688.extern] _oom_event_monitor: oom-kill event count: 1 Hoever, I cannot find and oom events in the system messages file. The user believes that this job ran fine early in testing which may have been before cgroups was fully set up. Looking at the job in question, it doesn't look like it exceeded the requested memory either. root@ericidle:~ # sacct -j 42688 --parsable2 --format jobid,account,user,exitcode,state,reqmem,MaxRSS,MaxRSSNode,MaxVMSize,MaxVMSizeNode,MaxVMSizeTask,jobname --state FAILED JobID|Account|User|ExitCode|State|ReqMem|MaxRSS|MaxRSSNode|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|JobName 42688|windfall|mmazloff|9:0|FAILED|5Gc||||||SCCM_fwd 42688.batch|windfall||9:0|FAILED|5Gc|16172K|r1u12n2|3174408K|r1u12n2|0|batch 42688.extern|windfall||0:0|COMPLETED|5Gc|1152K|r1u15n1|142612K|r1u16n1|2|extern 42688.0|windfall||0:9|CANCELLED by 25123|5Gc|3050228K|r1u29n1|15955136K|r1u27n2|1000|mitgcmuv Our cgroup config is [root@r1u07n1 ~]# cat /etc/slurm/cgroup.conf ### # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters #-- CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes #ConstrainDevices=yes #TaskAffinity=yes #TaskAffinity=no with task affinity enabled in slurm.conf. Here are the slurmd.log entries for this job [root@r1u12n2 slurm]# zgrep 42688 slurmd.log-20200712.gz [2020-07-11T20:16:19.206] task_p_slurmd_batch_request: 42688 [2020-07-11T20:16:19.206] task/affinity: job 42688 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFF0 [2020-07-11T20:16:19.206] task/affinity: job 42688 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFF0 [2020-07-11T20:16:19.209] _run_prolog: prolog with lock for job 42688 ran for 0 seconds [2020-07-11T20:16:19.231] [42688.extern] task/cgroup: /slurm/uid_25123/job_42688: alloc=471040MB mem.limit=471040MB memsw.limit=471040MB [2020-07-11T20:16:19.239] [42688.extern] task/cgroup: /slurm/uid_25123/job_42688/step_extern: alloc=471040MB mem.limit=471040MB memsw.limit=471040MB [2020-07-11T20:16:20.497] Launching batch job 42688 for UID 25123 [2020-07-11T20:16:20.503] [42688.batch] task/cgroup: /slurm/uid_25123/job_42688: alloc=471040MB mem.limit=471040MB memsw.limit=471040MB [2020-07-11T20:16:20.509] [42688.batch] task/cgroup: /slurm/uid_25123/job_42688/step_batch: alloc=471040MB mem.limit=471040MB memsw.limit=471040MB [2020-07-11T20:16:20.552] [42688.batch] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.106] launch task 42688.0 request from UID:25123 GID:340 HOST:10.141.16.34 PORT:57994 [2020-07-11T20:16:21.106] lllp_distribution jobid [42688] auto binding off: mask_cpu,one_thread [2020-07-11T20:16:21.179] [42688.0] task/cgroup: /slurm/uid_25123/job_42688: alloc=471040MB mem.limit=471040MB memsw.limit=471040MB [2020-07-11T20:16:21.185] [42688.0] task/cgroup: /slurm/uid_25123/job_42688/step_0: alloc=471040MB mem.limit=471040MB memsw.limit=471040MB [2020-07-11T20:16:21.278] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.278] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.278] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.278] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.280] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.283] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.284] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.284] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.284] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.284] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.285] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.285] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.285] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.285] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.288] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.288] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.288] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.288] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.288] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.288] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.289] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.291] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.291] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.291] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.291] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.291] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.294] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.295] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.295] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.295] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.298] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.298] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.298] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.301] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.301] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.301] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.304] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.304] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.304] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.305] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.304] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.307] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.307] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.310] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.310] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.310] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.310] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.313] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.315] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.315] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.315] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.315] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.316] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.317] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.318] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.320] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.320] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.321] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.321] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.324] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.324] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.324] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.324] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.326] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.326] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.326] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.329] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.329] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.329] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.329] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.332] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.332] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.332] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.335] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.336] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.336] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.336] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.336] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.339] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.339] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.339] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.342] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.342] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.344] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:16:21.344] [42688.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-11T20:29:54.665] [42688.0] error: *** STEP 42688.0 ON r1u12n2 CANCELLED AT 2020-07-11T20:29:54 *** [2020-07-11T20:30:11.298] [42688.0] done with job [2020-07-11T20:30:11.306] [42688.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:35072 [2020-07-11T20:30:11.310] [42688.batch] done with job [2020-07-11T20:30:11.321] [42688.extern] _oom_event_monitor: oom-kill event count: 1 [2020-07-11T20:30:25.915] [42688.extern] done with job Thanks!
Hi Todd, Thanks for the information. It turns out we've already encountered this bug and have bug 9202 open to handle it. It was originally private but it's public now and I've marked bug 9202 comment 0 as public. About this bug - we think that there weren't actually any OOM events but that there is a bug in the extern slurmstepd. We have reproduced it but aren't reproducing it consistently. Feel free to post comments or questions on bug 9202. I'm marking this bug as a duplicate of bug 9202. *** This ticket has been marked as a duplicate of ticket 9202 ***