Ticket 10255 - Job extern step always ends in OUT_OF_MEMORY in 20.02.6
Summary: Job extern step always ends in OUT_OF_MEMORY in 20.02.6
: 4 - Minor Issue
Assignee: Felip Moll
QA Contact: Alejandro Sanchez
Reported: 2020-11-19 11:29 MST by ARC Admins
Modified: 2021-10-29 07:34 MDT (History)
7 users (show)

Site: University of Michigan
Description ARC Admins 2020-11-19 11:29:47 MST

We recently patched to 20.02.6 (which, the "Version" drop down doesn't have listed, btw) and are seeing something peculiar with jobs. Every job that is submitted, started, and completed after the patch has OUT_OF_MEMORY for its extern step exit code. Here's an example:

[drhey@glctld ~]$ cat test-for-dan.sbat
#SBATCH --job-name=hello_world
#SBATCH --time=10:00:00
#SBATCH --mail-user=drhey@umich.edu
#SBATCH --mail-type=none
#SBATCH --account=support
#SBATCH --partition=standard
#SBATCH --mem=19g

sleep 60
echo "test"

[drhey@glctld ~]$ sbatch test-for-dan.sbat
Submitted batch job 15345859

[drhey@glctld ~]$ sq
          15345859  standard hello_wo    drhey  support  R       0:02      1 gl3096

[drhey@glctld ~]$ ssh gl3096
Last login: Thu Nov 19 13:22:41 2020 from

[drhey@gl3096 ~]$ cat /sys/fs/cgroup/memory/slurm/uid_228441/job_15345859/step_extern/memory.limit_in_bytes
[drhey@gl3096 ~]$ logout
Connection to gl3096 closed.

[drhey@glctld ~]$ sacct -j 15345859 --format=User,JobName,JobID,Account,Partition,AllocTRES%40,Submit,Start,End,Elapsed,TimeLimit,ExitCode,State%25
     User    JobName        JobID    Account  Partition                                AllocTRES              Submit               Start                 End    Elapsed  Timelimit ExitCode                     State
--------- ---------- ------------ ---------- ---------- ---------------------------------------- ------------------- ------------------- ------------------- ---------- ---------- -------- -------------------------
    drhey hello_wor+ 15345859        support   standard         billing=116,cpu=1,mem=19G,node=1 2020-11-19T13:23:26 2020-11-19T13:23:26 2020-11-19T13:24:26   00:01:00   10:00:00      0:0                 COMPLETED
               batch 15345859.ba+    support                                cpu=1,mem=19G,node=1 2020-11-19T13:23:26 2020-11-19T13:23:26 2020-11-19T13:24:26   00:01:00                 0:0                 COMPLETED
              extern 15345859.ex+    support                    billing=116,cpu=1,mem=19G,node=1 2020-11-19T13:23:26 2020-11-19T13:23:26 2020-11-19T13:24:26   00:01:00               0:125             OUT_OF_MEMORY

Comment 1 Kilian Cavalotti 2020-11-19 15:48:30 MST
I can confirm we're observing the same behavior, FWIW.

Comment 2 Felip Moll 2020-11-20 05:05:39 MST
I am investigating the issue. bug 10122 (Kaust) is also affected.

May you tell me which kernel version, OS, systemd version are you running? Can you upload your latest slurm.conf?
Comment 3 Felip Moll 2020-11-20 06:56:42 MST
Ignore my last comments.

I reproduced that in my CentOS 7 and I am investigating the cause:

[slurm@moll0 inst]$ sbatch --wrap "sleep 10" 
Submitted batch job 34
[slurm@moll0 inst]$ squeue
                34     debug     wrap    slurm  R       0:01      1 moll1 
[slurm@moll0 inst]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
34                 wrap      debug      slurm          1  COMPLETED      0:0 
34.batch          batch                 slurm          1  COMPLETED      0:0 
34.extern        extern                 slurm          1 OUT_OF_ME+    0:125 
[slurm@moll0 inst]$ uname -a
Linux moll0 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[slurm@moll0 inst]$ cat /etc/redhat-release 
CentOS Linux release 7.4.1708 (Core)
Comment 4 Felip Moll 2020-11-20 06:58:36 MST
Comment 21 Felip Moll 2020-12-01 08:25:21 MST

The specific case where the extern always ended in OOM was happening because extern step was registered as listener on events on the memory cgroup in order to count OOMs, and on termination the cgroup directory was deleted before reading the counter.

According to cgroups v1 API, a cgroup rmdir generates an event notification, so the rmdir was counted as an OOM.

This has been fixed in commit 272c636d507e1 and will be in 20.02.7.

I am closing this bug.

Comment 22 Kilian Cavalotti 2020-12-01 09:00:36 MST
Hi Felip, 

(In reply to Felip Moll from comment #21)
> The specific case where the extern always ended in OOM was happening because
> extern step was registered as listener on events on the memory cgroup in
> order to count OOMs, and on termination the cgroup directory was deleted
> before reading the counter.
> According to cgroups v1 API, a cgroup rmdir generates an event notification,
> so the rmdir was counted as an OOM.
> This has been fixed in commit 272c636d507e1 and will be in 20.02.7.

Thanks for the fix and the explanation!
Will the fix also be in 20.11.1?

Comment 23 Felip Moll 2020-12-01 10:03:29 MST
(In reply to Kilian Cavalotti from comment #22)
> Hi Felip, 
> (In reply to Felip Moll from comment #21)
> > The specific case where the extern always ended in OOM was happening because
> > extern step was registered as listener on events on the memory cgroup in
> > order to count OOMs, and on termination the cgroup directory was deleted
> > before reading the counter.
> > 
> > According to cgroups v1 API, a cgroup rmdir generates an event notification,
> > so the rmdir was counted as an OOM.
> > 
> > This has been fixed in commit 272c636d507e1 and will be in 20.02.7.
> Thanks for the fix and the explanation!
> Will the fix also be in 20.11.1?
> Cheers,
> -- 
> Kilian

Yes indeed, this will be merged in in every version >= 20.02.7 .

Comment 24 Albert Gil 2020-12-09 07:45:07 MST
Comment 25 Albert Gil 2021-10-29 07:34:43 MDT
