Description
ARC Admins
2020-11-19 11:29:47 MST
I can confirm we're observing the same behavior, FWIW. Cheers, -- Kilian I am investigating the issue. bug 10122 (Kaust) is also affected. May you tell me which kernel version, OS, systemd version are you running? Can you upload your latest slurm.conf? Ignore my last comments. I reproduced that in my CentOS 7 and I am investigating the cause: [slurm@moll0 inst]$ sbatch --wrap "sleep 10" Submitted batch job 34 [slurm@moll0 inst]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 34 debug wrap slurm R 0:01 1 moll1 [slurm@moll0 inst]$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 34 wrap debug slurm 1 COMPLETED 0:0 34.batch batch slurm 1 COMPLETED 0:0 34.extern extern slurm 1 OUT_OF_ME+ 0:125 [slurm@moll0 inst]$ uname -a Linux moll0 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux [slurm@moll0 inst]$ cat /etc/redhat-release CentOS Linux release 7.4.1708 (Core) *** Ticket 10122 has been marked as a duplicate of this ticket. *** Hi, I am currently out of office, returning on November 30. If you need to reach Research Computing, please email srcc-support@stanford.edu Cheers, Hi, The specific case where the extern always ended in OOM was happening because extern step was registered as listener on events on the memory cgroup in order to count OOMs, and on termination the cgroup directory was deleted before reading the counter. According to cgroups v1 API, a cgroup rmdir generates an event notification, so the rmdir was counted as an OOM. This has been fixed in commit 272c636d507e1 and will be in 20.02.7. I am closing this bug. Thanks! Hi Felip, (In reply to Felip Moll from comment #21) > The specific case where the extern always ended in OOM was happening because > extern step was registered as listener on events on the memory cgroup in > order to count OOMs, and on termination the cgroup directory was deleted > before reading the counter. > > According to cgroups v1 API, a cgroup rmdir generates an event notification, > so the rmdir was counted as an OOM. > > This has been fixed in commit 272c636d507e1 and will be in 20.02.7. Thanks for the fix and the explanation! Will the fix also be in 20.11.1? Cheers, -- Kilian (In reply to Kilian Cavalotti from comment #22) > Hi Felip, > > (In reply to Felip Moll from comment #21) > > The specific case where the extern always ended in OOM was happening because > > extern step was registered as listener on events on the memory cgroup in > > order to count OOMs, and on termination the cgroup directory was deleted > > before reading the counter. > > > > According to cgroups v1 API, a cgroup rmdir generates an event notification, > > so the rmdir was counted as an OOM. > > > > This has been fixed in commit 272c636d507e1 and will be in 20.02.7. > > Thanks for the fix and the explanation! > Will the fix also be in 20.11.1? > > Cheers, > -- > Kilian Yes indeed, this will be merged in in every version >= 20.02.7 . Regards *** Ticket 10380 has been marked as a duplicate of this ticket. *** *** Ticket 12122 has been marked as a duplicate of this ticket. *** |