12122 – OOM While Job Completed

Ticket 12122 - OOM While Job Completed

Summary: OOM While Job Completed

Status:	RESOLVED DUPLICATE of ticket 10255

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Limits (show other tickets)
Version:	20.02.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Albert Gil
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-07-26 11:56 MDT by Yang Liu
Modified:	2021-10-29 07:34 MDT (History)
CC List:	1 user (show)

See Also:	12127 10255 9737
Site:	Brown Univ
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Yang Liu 2021-07-26 11:56:40 MDT

Our user had a job which was finished with correct output. However, the job is shown as OOM from the sacct command, and slurmd log indicates that an OOM was detected.
* no dmesg logs on OOM for the job
* compute nodes are stateless, i.e., no local disks
* the user rerun the same job and didn't receive the OOM error
* the job still has exit code 0, though there is OOM

$ sacct -j 1802599 --format='jobid%30,state,reason,start,end,timelimit,MaxRSS,ReqMem,nodeList,jobIdRaw,partition,Exitcode'
                         JobID      State                 Reason               Start                 End  Timelimit     MaxRSS     ReqMem        NodeList     JobIDRaw  Partition ExitCode 
------------------------------ ---------- ---------------------- ------------------- ------------------- ---------- ---------- ---------- --------------- ------------ ---------- -------- 
                   1796948_176  COMPLETED         QOSGrpMemLimit 2021-07-20T12:55:07 2021-07-20T15:24:39 1-00:00:00                  32Gn        node1635 1802599           batch      0:0 
             1796948_176.batch  COMPLETED                        2021-07-20T12:55:07 2021-07-20T15:24:39             23684604K       32Gn        node1635 1802599.bat+                 0:0 
            1796948_176.extern OUT_OF_ME+                        2021-07-20T12:55:07 2021-07-20T15:24:39                 2100K       32Gn        node1635 1802599.ext+               0:125 



======================Slurmd Log============================
[2021-07-20T15:24:08.478] [1802599.batch] debug:  jag_common_poll_data: Task 0 pid 83664 ave_freq = 432 mem size/max 18041962496/24253034496 vmem size/max 18534100992/24745234432, disk read size/max (31123267534/31123267534), disk write size/max (287918719/287918719), time 7821.000000(7687+134) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0
[2021-07-20T15:24:38.415] [1802599.extern] debug:  jag_common_poll_data: Task 0 pid 83469 ave_freq = 3499829 mem size/max 360448/2150400 vmem size/max 110546944/202862592, disk read size/max (2012/2012), disk write size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0
[2021-07-20T15:24:38.478] [1802599.batch] debug:  jag_common_poll_data: Task 0 pid 83664 ave_freq = 127 mem size/max 18515107840/24253034496 vmem size/max 19007246336/24745234432, disk read size/max (31123267534/31123267534), disk write size/max (287918719/287918719), time 7851.000000(7716+134) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0
[2021-07-20T15:24:39.748] [1802599.batch] task 0 (83664) exited with exit code 0.
[2021-07-20T15:24:39.748] [1802599.batch] debug:  _oom_event_monitor: No oom events detected.
[2021-07-20T15:24:39.748] [1802599.batch] debug:  _oom_event_monitor: stopping.
[2021-07-20T15:24:39.749] [1802599.batch] debug:  step_terminate_monitor_stop signaling condition
[2021-07-20T15:24:39.809] [1802599.batch] job 1802599 completed with slurm_rc = 0, job_rc = 0
[2021-07-20T15:24:39.809] [1802599.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2021-07-20T15:24:39.812] [1802599.batch] debug:  Message thread exited
[2021-07-20T15:24:39.812] [1802599.batch] done with job
[2021-07-20T15:24:39.813] debug:  credential for job 1802599 revoked
[2021-07-20T15:24:39.813] [1802599.extern] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2021-07-20T15:24:39.813] [1802599.extern] debug:  _handle_signal_container for step=1802599.4294967295 uid=508 signal=18
[2021-07-20T15:24:39.814] [1802599.extern] Sent signal 18 to 1802599.4294967295
[2021-07-20T15:24:39.814] [1802599.extern] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2021-07-20T15:24:39.814] [1802599.extern] debug:  _handle_signal_container for step=1802599.4294967295 uid=508 signal=15
[2021-07-20T15:24:39.814] [1802599.extern] debug:  step_terminate_monitor_stop signaling condition
[2021-07-20T15:24:39.814] [1802599.extern] Sent signal 15 to 1802599.4294967295
[2021-07-20T15:24:39.814] [1802599.extern] debug:  _oom_event_monitor: stopping.
[2021-07-20T15:24:39.814] [1802599.extern] error: Detected 1 oom-kill event(s) in step 1802599.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2021-07-20T15:24:39.814] [1802599.extern] debug:  task_g_post_term: task/cgroup: Cannot allocate memory
[2021-07-20T15:24:39.815] [1802599.extern] debug:  Handling REQUEST_STATE
[2021-07-20T15:24:39.830] [1802599.extern] debug:  Message thread exited
[2021-07-20T15:24:39.830] [1802599.extern] done with job
[2021-07-20T15:24:39.835] debug:  Waiting for job 1802599's prolog to complete
[2021-07-20T15:24:39.835] debug:  Finished wait for job 1802599's prolog to complete
[2021-07-20T15:24:39.835] debug:  [job 1802599] attempting to run epilog [/usr/local/etc/slurm/epilog]
[2021-07-20T15:24:39.838] debug:  completed epilog for jobid 1802599
[2021-07-20T15:24:39.839] debug:  Job 1802599: sent epilog complete msg: rc = 0


Yang

Comment 1 Albert Gil 2021-07-28 04:58:00 MDT

Hi Yang,

Can you post the OS and Kernel versions that is using node1635?
Have you done any OS/Kernel update recently?

As you can see in the See Also bugs we faced similar cases related to OOM detected by cgroups, so we'll discard this first.

Regards,
Albert

Comment 2 Yang Liu 2021-07-28 06:11:08 MDT

Hi Albert,
No recent upgrade.

[yliu385@node1635 ~]$ cat /etc/*release*
cat: /etc/lsb-release.d: Is a directory
NAME="Red Hat Enterprise Linux Server"
VERSION="7.7 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.7"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.7 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.7:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.7"
Red Hat Enterprise Linux Server release 7.7 (Maipo)
Red Hat Enterprise Linux Server release 7.7 (Maipo)
cpe:/o:redhat:enterprise_linux:7.7:ga:server
[yliu385@node1635 ~]$ uname -r
3.10.0-1062.el7.x86_64

Linux version 3.10.0-1062.el7.x86_64 (mockbuild@x86-040.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Thu Jul 18 20:25:13 UTC 2019



Yang

Comment 4 Albert Gil 2021-08-17 05:16:44 MDT

Hi Yang,

We have detected some scenarios that could lead to false-OOM like yours seems to be.
We are investigating them all together and we are close to confirm the root cause of all of them.

Maybe it's a bit late late, but could attach the dmesg/syslog of node node1635 on 2021-07-20?

Anyway, we'll keep you posted in our internal progress about this.

Regards,
Albert

Comment 5 Albert Gil 2021-08-17 07:06:15 MDT

Hi Yang

> Maybe it's a bit late late, but could attach the dmesg/syslog of node
> node1635 on 2021-07-20?

I see that you already mentioned no OOM on dmesg in comment 0, sorry.

Then this ticket actually looks like a duplicate of bug 10255 that was already fixed in newer versions >= 20.02.7.

Do you see a similar behavior in other jobs, or only on 1796948_176?

Regards,
Albert

Comment 6 Yang Liu 2021-08-17 07:30:21 MDT

Hi Albert,
Thank you for pointing out bug 10255. I tested the test job script and 'sbatch --wrap "sleep 10" ', and didn't receive OOM for the test jobs (4). Our user also reun her job and didn't receive OOM again. The OOM state is always for the extern step in bug 10255. So it seems our case is different to bug 10255.

Thanks,
Yang

Comment 7 Albert Gil 2021-08-17 08:56:59 MDT

Hi Yang,

Just to avoid confusions:

- Can you confirm that you are still using 20.02.6 version, specially slurmds?
- Are you able to reproduce the issue somehow?

Thanks,
Albert

Comment 8 Albert Gil 2021-08-17 09:13:11 MDT

Hi Yang,

Are you using Bright?
Can you show me the output of cat /proc/mounts of node1635?

Regards,
Albert

Comment 9 Yang Liu 2021-08-17 09:25:13 MDT

Hi Albert,
1. Yes, slurmd is 20.02. from 'sinf -o"%v"'

2. I am asking the user if there is a job which always receives the OOM error

3. I don't think we are using Bright. We have gpfs+slurm+infiniband 

4. 
[yliu385@node1635 ~]$ cat /proc/mounts
rootfs / rootfs rw,size=98303864k,nr_inodes=24575966 0 0
sysfs /sys sysfs rw,relatime 0 0
proc /proc proc rw,relatime 0 0
devtmpfs /dev devtmpfs rw,nosuid,size=98312196k,nr_inodes=24578049,mode=755 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio,net_cls 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
rootfs / tmpfs rw,relatime,mode=755 0 0
rw /.sllocal/log tmpfs rw,relatime 0 0
rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=232861 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
gpfs /gpfs gpfs rw,relatime 0 0
mgt5:/install /install nfs4 ro,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.20.212.35,local_lock=none,addr=172.20.0.6 0 0
/etc/auto.misc /misc autofs rw,relatime,fd=5,pgrp=14327,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=276510 0 0
-hosts /net autofs rw,relatime,fd=11,pgrp=14327,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=276516 0 0
/etc/auto.cvmfs /cvmfs autofs rw,relatime,fd=17,pgrp=14327,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=276521 0 0
auto.direct /nfs/scratch autofs rw,relatime,fd=23,pgrp=14327,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=276525 0 0
auto.direct /truenas autofs rw,relatime,fd=23,pgrp=14327,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=276525 0 0
auto.direct /nfs/data autofs rw,relatime,fd=23,pgrp=14327,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=276525 0 0
auto.direct /nfs/jbailey5/baileyweb autofs rw,relatime,fd=23,pgrp=14327,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=276525 0 0
auto.tserre_cifs /cifs/data autofs rw,relatime,fd=28,pgrp=14327,timeout=600,minproto=5,maxproto=5,indirect,pipe_ino=276529 0 0
auto.tserre_lrs /cifs/data/tserre_lrs autofs rw,relatime,fd=34,pgrp=14327,timeout=600,minproto=5,maxproto=5,indirect,pipe_ino=276533 0 0
auto.lrs /lrs autofs rw,relatime,fd=40,pgrp=14327,timeout=600,minproto=5,maxproto=5,indirect,pipe_ino=276537 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=19666796k,mode=700 0 0
tmpfs /run/user/140348764 tmpfs rw,nosuid,nodev,relatime,size=19666796k,mode=700,uid=140348764,gid=601 0 0

Comment 10 Albert Gil 2021-08-18 03:52:57 MDT

Hi Yang,

> 1. Yes, slurmd is 20.02. from 'sinf -o"%v"'

Actually I'm more interested in the minor release number.
Note that the problem on bug 10255 was actually a regression bug introduced on 20.02.6 and fixed in 20.02.7.
This ticket is marked as "Version 20.02.6", so I'm wondering if maybe that user faced the issue because of that, but maybe now you've already updated Slurm to a version >= 20.02.7, so this would explain why the bug is no more reproduced.

You can also run "slurmd -V" in node1635 to check it.

> 2. I am asking the user if there is a job which always receives the OOM error

Thanks.

> 3. I don't think we are using Bright. We have gpfs+slurm+infiniband 
> 
> 4. 
> [yliu385@node1635 ~]$ cat /proc/mounts

Ok, this discards some possible source of problems.

Thanks,
Albert

Comment 11 Yang Liu 2021-08-18 07:33:09 MDT

Hi Albert,
1. [yliu385@node945 yliu385]$ slurmd -V
slurm 20.02.6

[yliu385@node945 yliu385]$ sbatch --wrap "sleep 10" 
Submitted batch job 2052515

[yliu385@node945 yliu385]$ sacct -j 2052515
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
2052515            wrap      batch    default          1  COMPLETED      0:0 
2052515.bat+      batch               default          1  COMPLETED      0:0 
2052515.ext+     extern               default          1  COMPLETED      0:0 

Bug 12055 listed slurm version of 20.02.5


Yang

Comment 12 Albert Gil 2021-08-18 08:40:39 MDT

Hi Yang,

> Bug 10255 listed slurm version of 20.02.5

Yes, but although the reporter of that ticket set the version to .5, in our debug process we actually identified the issue as an undesired side effect of this commit that landed on .6:

- https://github.com/SchedMD/slurm/commit/f93b16670f3b07f6209099c24425036f9c54d136

That undesired effect was solved on this commit that landed on .7:

- https://github.com/SchedMD/slurm/commit/272c636d507e1dc59d987da478d42f6713d88ae1

> 1. [yliu385@node945 yliu385]$ slurmd -V
> slurm 20.02.6

Thanks for double-checking.

> [yliu385@node945 yliu385]$ sbatch --wrap "sleep 10" 
> Submitted batch job 2052515
> 
> [yliu385@node945 yliu385]$ sacct -j 2052515
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
> ------------ ---------- ---------- ---------- ---------- ---------- -------- 
> 2052515            wrap      batch    default          1  COMPLETED      0:0 
> 2052515.bat+      batch               default          1  COMPLETED      0:0 
> 2052515.ext+     extern               default          1  COMPLETED      0:0 

I assume that all your nodes are running slurmd 20.02.6 but that you are not being able to reproduce the issue, right?
And you are also using jobacct_gather/cgroup, right?

If that's true, then I agree that this is not *exactly* what users on bug 10255 experienced (they were able to reproduce it quite easy), but still we cannot discard that the same Slurm version (20.02.6) with different versions of kernel behave slightly different (because cgroups sometime behave slightly different between kernel versions, that's why in comment 1 I asked for your kernel version).
Note that the known bug was kind of a race condition between Slurm and Kernel's cgroups, so a just a faster/slower Kernel can make it happen always or even never.

So, although we cannot be 100% sure that bug 10255 is the root cause of your issue, we cannot discarded neither. Therefore, my recommendation would be to update to latest 20.02 (or if you are able, better to latest 20.11). This way we can discard the known issue on 20.02.6 that could be impacting you (even not so hard like for other users).
Once you update, if you are not able to reproduce the issue in >= 20.02.7, then I think that we should close this as duplicate of bug 10255.
If you are able to reproduce the issue, then we should do further investigation knowing that the known bug 10255 is not the root cause (and also having some more clues about how to reproduce the issue).

I hope that this is ok for you too.

Regards,
Albert

Comment 13 Yang Liu 2021-08-18 08:48:33 MDT

Hi Albert,
Yes, all nodes have slurmd 20.02.6:
[yliu385@node945 10255]$ slurmd -V
slurm 20.02.6

Surprisingly that our slurm is configured with jobacct_gather/linux, not jobacct_gather/cgroup. Any impacts from this configuration?
[yliu385@node945 10255]$ scontrol show config|grep -i jobacct
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)


Yang

Comment 14 Yang Liu 2021-08-18 09:46:10 MDT

Hi Albert,
Our user reported that the issue still persists randomly. I read the slurmd log and it seems that the <job_id>.batch exited before <job_id>.external. Is that an error which caused OOM for <job_id>.external?

[2021-07-20T15:24:39.748] [1802599.batch] task 0 (83664) exited with exit code 0.
[2021-07-20T15:24:39.748] [1802599.batch] debug:  _oom_event_monitor: No oom events detected.
[2021-07-20T15:24:39.748] [1802599.batch] debug:  _oom_event_monitor: stopping.
[2021-07-20T15:24:39.749] [1802599.batch] debug:  step_terminate_monitor_stop signaling condition
[2021-07-20T15:24:39.809] [1802599.batch] job 1802599 completed with slurm_rc = 0, job_rc = 0
[2021-07-20T15:24:39.809] [1802599.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2021-07-20T15:24:39.812] [1802599.batch] debug:  Message thread exited
[2021-07-20T15:24:39.812] [1802599.batch] done with job
[2021-07-20T15:24:39.813] debug:  credential for job 1802599 revoked
[2021-07-20T15:24:39.813] [1802599.extern] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2021-07-20T15:24:39.813] [1802599.extern] debug:  _handle_signal_container for step=1802599.4294967295 uid=508 signal=18
[2021-07-20T15:24:39.814] [1802599.extern] Sent signal 18 to 1802599.4294967295
[2021-07-20T15:24:39.814] [1802599.extern] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2021-07-20T15:24:39.814] [1802599.extern] debug:  _handle_signal_container for step=1802599.4294967295 uid=508 signal=15
[2021-07-20T15:24:39.814] [1802599.extern] debug:  step_terminate_monitor_stop signaling condition
[2021-07-20T15:24:39.814] [1802599.extern] Sent signal 15 to 1802599.4294967295
[2021-07-20T15:24:39.814] [1802599.extern] debug:  _oom_event_monitor: stopping.
[2021-07-20T15:24:39.814] [1802599.extern] error: Detected 1 oom-kill event(s) in step 1802599.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2021-07-20T15:24:39.814] [1802599.extern] debug:  task_g_post_term: task/cgroup: Cannot allocate memory
[2021-07-20T15:24:39.815] [1802599.extern] debug:  Handling REQUEST_STATE
[2021-07-20T15:24:39.830] [1802599.extern] debug:  Message thread exited
[2021-07-20T15:24:39.830] [1802599.extern] done with job

Yang

Comment 15 Albert Gil 2021-08-18 10:36:45 MDT

Hi Yang,

> Yes, all nodes have slurmd 20.02.6:
> [yliu385@node945 10255]$ slurmd -V
> slurm 20.02.6

Thanks for confirming.

> Surprisingly that our slurm is configured with jobacct_gather/linux, not
> jobacct_gather/cgroup. Any impacts from this configuration?
> [yliu385@node945 10255]$ scontrol show config|grep -i jobacct
> JobAcctGatherFrequency  = 30
> JobAcctGatherType       = jobacct_gather/linux
> JobAcctGatherParams     = (null)

Ok, that could also explain some different behavior with other customers.
I assume that you do use task/cgroup, right?
Could you attach your slurm.conf?

> Our user reported that the issue still persists randomly.

Ok, assuming that we are impacted for the known bug mentioned before, that would be expected until we update to >= 20.02.7.


> I read the slurmd
> log and it seems that the <job_id>.batch exited before <job_id>.external. Is
> that an error which caused OOM for <job_id>.external?

Well, the error is not exactly that one step ends before/after another (this is not a problem), but it has more to do on how kernel reports "cgroups events" (that Slurm assumes as OOM events but they may be other events), and how in 20.02.6 we do the cleanup of the cgroups hierarchy of the job. In 20.02.7 we fixed the cleanup code on Slurm, so we avoid generating undesired cgroup events, so we stop detecting and reporting them as false-oom.
But this is a too low-level detail.

Do you think that you can manage to update to >= 20.02.7 so we can discard the known issue?

Regards,
Albert

Comment 16 Albert Gil 2021-08-27 07:23:50 MDT

Hi Yang,

Have you been able to update Slurm so we can discard hat you are impacted by the know issue on 20.02.6?

Regards,
Albert

Comment 17 Albert Gil 2021-09-03 02:33:11 MDT

Hi Yang,

Any news or plans about trying to update to solve or discard the known issue on 20.02.6?

Regards,
Albert

Comment 18 Yang Liu 2021-09-03 06:32:32 MDT

Hi Albert,
No, not yet. Our system team will plan and schedule the update. Once Slurm
is updated, I will let you know.

Best,
Yang

On Fri, Sep 3, 2021 at 4:33 AM <bugs@schedmd.com> wrote:

> *Comment # 17 <https://bugs.schedmd.com/show_bug.cgi?id=12122#c17> on bug
> 12122 <https://bugs.schedmd.com/show_bug.cgi?id=12122> from Albert Gil
> <albert.gil@schedmd.com> *
>
> Hi Yang,
>
> Any news or plans about trying to update to solve or discard the known issue on
> 20.02.6?
>
> Regards,
> Albert
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 19 Albert Gil 2021-10-22 08:53:57 MDT

Hi Yang,

Have you been able to upgrade discard the known issue on 20.02.6 (bug 10255)?

Regards,
Albert

Comment 20 Albert Gil 2021-10-29 07:34:43 MDT

Hi Yang,

If this is ok for you I'm closing this ticket assuming that it's a duplicate of bug 10255, and that an upgrade should fix it.
But if you need further related support, please don't hesitate to reopen it.

Regards,
Albert

*** This ticket has been marked as a duplicate of ticket 10255 ***