12127 – Job extern step ends in OUT_OF_MEMORY in 20.11.7 (round 2?)

Ticket 12127 - Job extern step ends in OUT_OF_MEMORY in 20.11.7 (round 2?)

Summary: Job extern step ends in OUT_OF_MEMORY in 20.11.7 (round 2?)

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	20.11.7
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-07-26 17:54 MDT by Sebastian Smith
Modified:	2021-08-27 07:48 MDT (History)
CC List:	1 user (show)

See Also:	12122 10255 9737
Site:	Nevada Reno
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmd debug log containing sbatch wrap, slurm.conf, cgroup.conf (3.84 KB, application/x-compressed-tar) 2021-07-26 17:54 MDT, Sebastian Smith	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Sebastian Smith 2021-07-26 17:54:38 MDT

Created attachment 20548 [details]
slurmd debug log containing sbatch wrap, slurm.conf, cgroup.conf

Hi,

We're running a fresh install of Slurm v20.11.7. I'm experiencing an error that looks similar to Bug ID 10255, and related -- job extern steps end with OUT_OF_MEMORY state. I've configured SlurmdDebug=debug and uploaded file "slurmd.log" containing log info from an `sbatch --wrap "sleep 10"`.  The job id is 2738808. All other extern steps I've investigated look similar.

I've also uploaded our slurm.conf and cgroup.conf. Please let me know if you need additional info.

Any thought on how to resolve this?

Thanks!

Sebastian Smith
HPC Engineer
University of Nevada, Reno

Comment 1 Sebastian Smith 2021-07-28 03:23:36 MDT

Hi,

I am out of the office until 8/2.  Please contact hpc@unr.edu if you need immediate assistance.

Thank you,

Sebastian

Comment 3 Felip Moll 2021-07-28 06:04:41 MDT

Hello Sebastian,

Considering your 20.11 version and after looking at your configuration and logs I see only two possibilities.

1. You have a pam_slurm_adopt session that adds pids to the extern step and consumes a considerable amount of memory, so the OOM is real. I think this is unlikely.. but, just to be sure, do you have any ssh session attached to the job (so to the extern step)?

2. Your kernel cgroup implementation is sending notify events to sibling cgroups instead of keeping the event into the child. I have seen this and studied the case in our private internal bug 9737. I've seen how in certain kernels, just removing a cgroup directory generates an event, and in some cases I see the event propagated to siblings.

Questions:
a) Does it happen in all nodes where the job ran? (for multiple-node jobs)
b) What OS and Kernel version do you use?

Thanks

Comment 4 Sebastian Smith 2021-08-02 17:10:23 MDT

Hi,

I'll start by addressing 'a':

a) This appears to be happening on all nodes.  Most extern steps are exiting with an OOM status.  There are a few outlier jobs that are completing without error, but I haven't tracked them to specific nodes.

1) pam_slurm_adopt is enabled, but most jobs exiting with extern OOM errors never had active SSH sessions.  I've logged into several nodes while testing but haven't left the session open for long -- memory consumption was nominal.  All extern steps in my tests ended in OOM, but it's the same result we're experiencing without opening a session...

b) We are running CentOS 7.9.2009.  The kernel release is 3.10.0-1160.31.1.el7.x86_64.

Thanks!


--

[University of Nevada, Reno]<http://www.unr.edu/>
Sebastian Smith
High-Performance Computing Engineer
Office of Information Technology
1664 North Virginia Street
MS 0291

work-phone: 775-682-5050<tel:7756825050>
email: stsmith@unr.edu<mailto:stsmith@unr.edu>
website: http://rc.unr.edu<http://rc.unr.edu/>

________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, July 28, 2021 5:04 AM
To: Sebastian T Smith <stsmith@unr.edu>
Subject: [Bug 12127] Job extern step ends in OUT_OF_MEMORY in 20.11.7 (round 2?)


Comment # 3<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12127%23c3&data=04%7C01%7Cstsmith%40unr.edu%7C4b8efcd5bb75429746ed08d951bfe246%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1%7C0%7C637630706844781985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hnnV674ouo43fjmRRcdX9qNAuA4r8X2Ui5%2BxIz6mg50%3D&reserved=0> on bug 12127<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12127&data=04%7C01%7Cstsmith%40unr.edu%7C4b8efcd5bb75429746ed08d951bfe246%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1%7C0%7C637630706844791985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=akJdCdoNVCERmDLcxe%2F5ZkX4O1K652drHqjHzJCi2WM%3D&reserved=0> from Felip Moll<mailto:felip.moll@schedmd.com>

Hello Sebastian,

Considering your 20.11 version and after looking at your configuration and logs
I see only two possibilities.

1. You have a pam_slurm_adopt session that adds pids to the extern step and
consumes a considerable amount of memory, so the OOM is real. I think this is
unlikely.. but, just to be sure, do you have any ssh session attached to the
job (so to the extern step)?

2. Your kernel cgroup implementation is sending notify events to sibling
cgroups instead of keeping the event into the child. I have seen this and
studied the case in our private internal bug 9737<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9737&data=04%7C01%7Cstsmith%40unr.edu%7C4b8efcd5bb75429746ed08d951bfe246%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1%7C0%7C637630706844801978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6XyhRq7BifoS7jkEztGpnKbpn3dJKPwIXkXsJ7581jA%3D&reserved=0>. I've seen how in certain
kernels, just removing a cgroup directory generates an event, and in some cases
I see the event propagated to siblings.

Questions:
a) Does it happen in all nodes where the job ran? (for multiple-node jobs)
b) What OS and Kernel version do you use?

Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 5 Felip Moll 2021-08-04 11:18:58 MDT

> b) We are running CentOS 7.9.2009.  The kernel release is
> 3.10.0-1160.31.1.el7.x86_64.
> 
> Thanks!
> 

Sebastian, are you using Bright?

Can you show me the output of cat /proc/mounts ?

I am fairly sure that for some reason, deleting the batch step creates an OOM event which is propagated to its siblings (extern step) on certain conditions. I have not been able to replicate it yet, nor I can find how a possible path in the could could allow that. Are you on 20.11.7 or greater in *all* compute nodes?

Comment 6 Sebastian Smith 2021-08-04 12:53:58 MDT

Hi,

We're using Bright v9.0.  Our Slurm configuration is frozen/managed manually.

Below is /proc/mount:

```
[root@cpu-0 ~]# cat /proc/mounts
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
devtmpfs /dev devtmpfs rw,relatime,size=132016412k,nr_inodes=33004103,mode=755 0 0
tmpfs /run tmpfs rw,relatime 0 0
tmpfs / tmpfs rw,relatime,size=237647388k,mpol=interleave:0-1 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,nosuid,nodev,noexec,relatime,cpu 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=382409 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
172.19.0.2:/apps /apps nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.0.2,mountvers=3,mountport=4002,mountproto=udp,local_lock=none,addr=172
.19.0.2 0 0
master:/cm/shared /cm/shared nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.200.255.254,mountvers=3,mountport=4002,mountproto=udp,local_lock=no
ne,addr=172.200.255.254 0 0
master:/home /home nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.200.255.254,mountvers=3,mountport=4002,mountproto=udp,local_lock=none,addr=17
2.200.255.254 0 0
pronghorn-0 /data/gpfs gpfs rw,relatime 0 0
```

Based on the related bugs, I agree with the cgroup event propagation idea.  I've verified that we're on 20.11.7 on all nodes.  I'll investigate epilog scripts to see if there might be an issue there.

Thanks,

Sebastian

--

[University of Nevada, Reno]<http://www.unr.edu/>
Sebastian Smith
High-Performance Computing Engineer
Office of Information Technology
1664 North Virginia Street
MS 0291

work-phone: 775-682-5050<tel:7756825050>
email: stsmith@unr.edu<mailto:stsmith@unr.edu>
website: http://rc.unr.edu<http://rc.unr.edu/>

________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, August 4, 2021 10:18 AM
To: Sebastian T Smith <stsmith@unr.edu>
Subject: [Bug 12127] Job extern step ends in OUT_OF_MEMORY in 20.11.7 (round 2?)


Comment # 5<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12127%23c5&data=04%7C01%7Cstsmith%40unr.edu%7Ca7846ebb41164671bfce08d9576bf2c5%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1%7C0%7C637636943419448364%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZdOgwufUSIYwRcppsiuoobEXDcXeuqMjYB5jKq3D1Zs%3D&reserved=0> on bug 12127<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12127&data=04%7C01%7Cstsmith%40unr.edu%7Ca7846ebb41164671bfce08d9576bf2c5%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1%7C0%7C637636943419458360%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Y51NILvit9V3RX2%2B5zgg8xXoYydbmakpzi1xkj8zbQ0%3D&reserved=0> from Felip Moll<mailto:felip.moll@schedmd.com>

> b) We are running CentOS 7.9.2009.  The kernel release is
> 3.10.0-1160.31.1.el7.x86_64.
>
> Thanks!
>

Sebastian, are you using Bright?

Can you show me the output of cat /proc/mounts ?

I am fairly sure that for some reason, deleting the batch step creates an OOM
event which is propagated to its siblings (extern step) on certain conditions.
I have not been able to replicate it yet, nor I can find how a possible path in
the could could allow that. Are you on 20.11.7 or greater in *all* compute
nodes?

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 7 Felip Moll 2021-08-05 03:38:55 MDT

I am really concerned about this:

 cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct 0 0


You have all these controllers mounted under the same mountpoint. JoinControllers= was used by Bright < 9.0 to mix these controllers, but that was not working fine in Slurm, see bug 7536.

Bright >= 9.1 doesn't use this anymore, and in fact JoinControllers is deprecated in systemd (https://bugs.schedmd.com/show_bug.cgi?id=7536#c29).
If you don't have any strong reason to use this, can you please check and separate the controllers as in a standard system?

That should look like this:

cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/misc cgroup rw,nosuid,nodev,noexec,relatime,misc 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0

There are some tips around to configure it correctly: https://bugs.schedmd.com/show_bug.cgi?id=9041#c12


-- 

To double-clear things, I think that when proctrack remove the stepd cgroup directories, the event is propagated to its siblings causing the event to be catched as an OOM. Unfortunately in your kernel there's no other way than to look at events to get the OOMs.


Let me know if fixing this configuration works for you.

Comment 8 Felip Moll 2021-08-27 07:14:59 MDT

(In reply to Felip Moll from comment #7)
> I am really concerned about this:
> 
>  cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup
> rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct 0 0
> 
> 
> You have all these controllers mounted under the same mountpoint.
> JoinControllers= was used by Bright < 9.0 to mix these controllers, but that
> was not working fine in Slurm, see bug 7536.
> 
> Bright >= 9.1 doesn't use this anymore, and in fact JoinControllers is
> deprecated in systemd (https://bugs.schedmd.com/show_bug.cgi?id=7536#c29).
> If you don't have any strong reason to use this, can you please check and
> separate the controllers as in a standard system?
> 
> That should look like this:
> 
> cgroup /sys/fs/cgroup/net_cls,net_prio cgroup
> rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
> cgroup /sys/fs/cgroup/cpu,cpuacct cgroup
> rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
> cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices
> 0 0
> cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb
> 0 0
> cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0
> 0
> cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0
> 0
> cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
> cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
> cgroup /sys/fs/cgroup/misc cgroup rw,nosuid,nodev,noexec,relatime,misc 0 0
> cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer
> 0 0
> cgroup /sys/fs/cgroup/perf_event cgroup
> rw,nosuid,nodev,noexec,relatime,perf_event 0 0
> 
> There are some tips around to configure it correctly:
> https://bugs.schedmd.com/show_bug.cgi?id=9041#c12
> 
> 
> -- 
> 
> To double-clear things, I think that when proctrack remove the stepd cgroup
> directories, the event is propagated to its siblings causing the event to be
> catched as an OOM. Unfortunately in your kernel there's no other way than to
> look at events to get the OOMs.
> 
> 
> Let me know if fixing this configuration works for you.

Hi, have you been able to change the configuration? Do you have any feedback?

Thank you

Comment 9 Sebastian Smith 2021-08-27 07:15:06 MDT

Hi,

I am out of the office until 8/30 but will be checking email.  Please expect a response delay.

Thank you,

Sebastian

Comment 10 Felip Moll 2021-08-27 07:48:46 MDT

(In reply to Sebastian Smith from comment #9)
> Hi,
> 
> I am out of the office until 8/30 but will be checking email.  Please expect
> a response delay.
> 
> Thank you,
> 
> Sebastian

Hi, please, reopen this issue when you have more feedback.

Thanks