Created attachment 20548 [details] slurmd debug log containing sbatch wrap, slurm.conf, cgroup.conf Hi, We're running a fresh install of Slurm v20.11.7. I'm experiencing an error that looks similar to Bug ID 10255, and related -- job extern steps end with OUT_OF_MEMORY state. I've configured SlurmdDebug=debug and uploaded file "slurmd.log" containing log info from an `sbatch --wrap "sleep 10"`. The job id is 2738808. All other extern steps I've investigated look similar. I've also uploaded our slurm.conf and cgroup.conf. Please let me know if you need additional info. Any thought on how to resolve this? Thanks! Sebastian Smith HPC Engineer University of Nevada, Reno
Hi, I am out of the office until 8/2. Please contact hpc@unr.edu if you need immediate assistance. Thank you, Sebastian
Hello Sebastian, Considering your 20.11 version and after looking at your configuration and logs I see only two possibilities. 1. You have a pam_slurm_adopt session that adds pids to the extern step and consumes a considerable amount of memory, so the OOM is real. I think this is unlikely.. but, just to be sure, do you have any ssh session attached to the job (so to the extern step)? 2. Your kernel cgroup implementation is sending notify events to sibling cgroups instead of keeping the event into the child. I have seen this and studied the case in our private internal bug 9737. I've seen how in certain kernels, just removing a cgroup directory generates an event, and in some cases I see the event propagated to siblings. Questions: a) Does it happen in all nodes where the job ran? (for multiple-node jobs) b) What OS and Kernel version do you use? Thanks
Hi, I'll start by addressing 'a': a) This appears to be happening on all nodes. Most extern steps are exiting with an OOM status. There are a few outlier jobs that are completing without error, but I haven't tracked them to specific nodes. 1) pam_slurm_adopt is enabled, but most jobs exiting with extern OOM errors never had active SSH sessions. I've logged into several nodes while testing but haven't left the session open for long -- memory consumption was nominal. All extern steps in my tests ended in OOM, but it's the same result we're experiencing without opening a session... b) We are running CentOS 7.9.2009. The kernel release is 3.10.0-1160.31.1.el7.x86_64. Thanks! -- [University of Nevada, Reno]<http://www.unr.edu/> Sebastian Smith High-Performance Computing Engineer Office of Information Technology 1664 North Virginia Street MS 0291 work-phone: 775-682-5050<tel:7756825050> email: stsmith@unr.edu<mailto:stsmith@unr.edu> website: http://rc.unr.edu<http://rc.unr.edu/> ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, July 28, 2021 5:04 AM To: Sebastian T Smith <stsmith@unr.edu> Subject: [Bug 12127] Job extern step ends in OUT_OF_MEMORY in 20.11.7 (round 2?) Comment # 3<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12127%23c3&data=04%7C01%7Cstsmith%40unr.edu%7C4b8efcd5bb75429746ed08d951bfe246%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1%7C0%7C637630706844781985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hnnV674ouo43fjmRRcdX9qNAuA4r8X2Ui5%2BxIz6mg50%3D&reserved=0> on bug 12127<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12127&data=04%7C01%7Cstsmith%40unr.edu%7C4b8efcd5bb75429746ed08d951bfe246%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1%7C0%7C637630706844791985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=akJdCdoNVCERmDLcxe%2F5ZkX4O1K652drHqjHzJCi2WM%3D&reserved=0> from Felip Moll<mailto:felip.moll@schedmd.com> Hello Sebastian, Considering your 20.11 version and after looking at your configuration and logs I see only two possibilities. 1. You have a pam_slurm_adopt session that adds pids to the extern step and consumes a considerable amount of memory, so the OOM is real. I think this is unlikely.. but, just to be sure, do you have any ssh session attached to the job (so to the extern step)? 2. Your kernel cgroup implementation is sending notify events to sibling cgroups instead of keeping the event into the child. I have seen this and studied the case in our private internal bug 9737<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9737&data=04%7C01%7Cstsmith%40unr.edu%7C4b8efcd5bb75429746ed08d951bfe246%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1%7C0%7C637630706844801978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6XyhRq7BifoS7jkEztGpnKbpn3dJKPwIXkXsJ7581jA%3D&reserved=0>. I've seen how in certain kernels, just removing a cgroup directory generates an event, and in some cases I see the event propagated to siblings. Questions: a) Does it happen in all nodes where the job ran? (for multiple-node jobs) b) What OS and Kernel version do you use? Thanks ________________________________ You are receiving this mail because: * You reported the bug.
> b) We are running CentOS 7.9.2009. The kernel release is > 3.10.0-1160.31.1.el7.x86_64. > > Thanks! > Sebastian, are you using Bright? Can you show me the output of cat /proc/mounts ? I am fairly sure that for some reason, deleting the batch step creates an OOM event which is propagated to its siblings (extern step) on certain conditions. I have not been able to replicate it yet, nor I can find how a possible path in the could could allow that. Are you on 20.11.7 or greater in *all* compute nodes?
Hi, We're using Bright v9.0. Our Slurm configuration is frozen/managed manually. Below is /proc/mount: ``` [root@cpu-0 ~]# cat /proc/mounts proc /proc proc rw,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 devtmpfs /dev devtmpfs rw,relatime,size=132016412k,nr_inodes=33004103,mode=755 0 0 tmpfs /run tmpfs rw,relatime 0 0 tmpfs / tmpfs rw,relatime,size=237647388k,mpol=interleave:0-1 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw 0 0 devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/cpu cgroup rw,nosuid,nodev,noexec,relatime,cpu 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio 0 0 mqueue /dev/mqueue mqueue rw,relatime 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0 debugfs /sys/kernel/debug debugfs rw,relatime 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=382409 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0 configfs /sys/kernel/config configfs rw,relatime 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 172.19.0.2:/apps /apps nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.0.2,mountvers=3,mountport=4002,mountproto=udp,local_lock=none,addr=172 .19.0.2 0 0 master:/cm/shared /cm/shared nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.200.255.254,mountvers=3,mountport=4002,mountproto=udp,local_lock=no ne,addr=172.200.255.254 0 0 master:/home /home nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.200.255.254,mountvers=3,mountport=4002,mountproto=udp,local_lock=none,addr=17 2.200.255.254 0 0 pronghorn-0 /data/gpfs gpfs rw,relatime 0 0 ``` Based on the related bugs, I agree with the cgroup event propagation idea. I've verified that we're on 20.11.7 on all nodes. I'll investigate epilog scripts to see if there might be an issue there. Thanks, Sebastian -- [University of Nevada, Reno]<http://www.unr.edu/> Sebastian Smith High-Performance Computing Engineer Office of Information Technology 1664 North Virginia Street MS 0291 work-phone: 775-682-5050<tel:7756825050> email: stsmith@unr.edu<mailto:stsmith@unr.edu> website: http://rc.unr.edu<http://rc.unr.edu/> ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, August 4, 2021 10:18 AM To: Sebastian T Smith <stsmith@unr.edu> Subject: [Bug 12127] Job extern step ends in OUT_OF_MEMORY in 20.11.7 (round 2?) Comment # 5<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12127%23c5&data=04%7C01%7Cstsmith%40unr.edu%7Ca7846ebb41164671bfce08d9576bf2c5%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1%7C0%7C637636943419448364%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZdOgwufUSIYwRcppsiuoobEXDcXeuqMjYB5jKq3D1Zs%3D&reserved=0> on bug 12127<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12127&data=04%7C01%7Cstsmith%40unr.edu%7Ca7846ebb41164671bfce08d9576bf2c5%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1%7C0%7C637636943419458360%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Y51NILvit9V3RX2%2B5zgg8xXoYydbmakpzi1xkj8zbQ0%3D&reserved=0> from Felip Moll<mailto:felip.moll@schedmd.com> > b) We are running CentOS 7.9.2009. The kernel release is > 3.10.0-1160.31.1.el7.x86_64. > > Thanks! > Sebastian, are you using Bright? Can you show me the output of cat /proc/mounts ? I am fairly sure that for some reason, deleting the batch step creates an OOM event which is propagated to its siblings (extern step) on certain conditions. I have not been able to replicate it yet, nor I can find how a possible path in the could could allow that. Are you on 20.11.7 or greater in *all* compute nodes? ________________________________ You are receiving this mail because: * You reported the bug.
I am really concerned about this: cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct 0 0 You have all these controllers mounted under the same mountpoint. JoinControllers= was used by Bright < 9.0 to mix these controllers, but that was not working fine in Slurm, see bug 7536. Bright >= 9.1 doesn't use this anymore, and in fact JoinControllers is deprecated in systemd (https://bugs.schedmd.com/show_bug.cgi?id=7536#c29). If you don't have any strong reason to use this, can you please check and separate the controllers as in a standard system? That should look like this: cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/misc cgroup rw,nosuid,nodev,noexec,relatime,misc 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 There are some tips around to configure it correctly: https://bugs.schedmd.com/show_bug.cgi?id=9041#c12 -- To double-clear things, I think that when proctrack remove the stepd cgroup directories, the event is propagated to its siblings causing the event to be catched as an OOM. Unfortunately in your kernel there's no other way than to look at events to get the OOMs. Let me know if fixing this configuration works for you.
(In reply to Felip Moll from comment #7) > I am really concerned about this: > > cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup > rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct 0 0 > > > You have all these controllers mounted under the same mountpoint. > JoinControllers= was used by Bright < 9.0 to mix these controllers, but that > was not working fine in Slurm, see bug 7536. > > Bright >= 9.1 doesn't use this anymore, and in fact JoinControllers is > deprecated in systemd (https://bugs.schedmd.com/show_bug.cgi?id=7536#c29). > If you don't have any strong reason to use this, can you please check and > separate the controllers as in a standard system? > > That should look like this: > > cgroup /sys/fs/cgroup/net_cls,net_prio cgroup > rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0 > cgroup /sys/fs/cgroup/cpu,cpuacct cgroup > rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 > cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices > 0 0 > cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb > 0 0 > cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 > 0 > cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 > 0 > cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 > cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 > cgroup /sys/fs/cgroup/misc cgroup rw,nosuid,nodev,noexec,relatime,misc 0 0 > cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer > 0 0 > cgroup /sys/fs/cgroup/perf_event cgroup > rw,nosuid,nodev,noexec,relatime,perf_event 0 0 > > There are some tips around to configure it correctly: > https://bugs.schedmd.com/show_bug.cgi?id=9041#c12 > > > -- > > To double-clear things, I think that when proctrack remove the stepd cgroup > directories, the event is propagated to its siblings causing the event to be > catched as an OOM. Unfortunately in your kernel there's no other way than to > look at events to get the OOMs. > > > Let me know if fixing this configuration works for you. Hi, have you been able to change the configuration? Do you have any feedback? Thank you
Hi, I am out of the office until 8/30 but will be checking email. Please expect a response delay. Thank you, Sebastian
(In reply to Sebastian Smith from comment #9) > Hi, > > I am out of the office until 8/30 but will be checking email. Please expect > a response delay. > > Thank you, > > Sebastian Hi, please, reopen this issue when you have more feedback. Thanks