Created attachment 31115 [details] slurmctld log file 1 We experienced a rather large crash July 1st around 14:00:00. They all seemed to have been running a large array of Julia jobs by user id 385621. All nodes crashed with OOM messages. Attached is the slurmctld.log and a couple of node messages logs from the time period in question. thanks
Created attachment 31116 [details] node message file
Please attach your slurm.conf and cgroup.conf and the slurmd.log from one or two of compute nodes that ran job 10889643. The logs attached show this took place on node209. Those slurmd.logs would be greatly appreciated. Finally please also include the output from the command "mount" on node209.
Created attachment 31119 [details] slurm.conf Hello See attached Below is the output of the mount command. sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) devtmpfs on /dev type devtmpfs (rw,nosuid,size=32854008k,nr_inodes=8213502,mode=755) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,na me=systemd) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) configfs on /sys/kernel/config type configfs (rw,relatime) /dev/mapper/VolGroup00-LogVol00 on / type xfs (rw,relatime,attr2,inode64,noquota) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=21,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=46214) mqueue on /dev/mqueue type mqueue (rw,relatime) debugfs on /sys/kernel/debug type debugfs (rw,relatime) hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime) binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime) /dev/sda1 on /boot type xfs (rw,relatime,attr2,inode64,noquota) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime) /etc/auto.misc on /misc type autofs (rw,relatime,fd=5,pgrp=2377,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=61507) -hosts on /net type autofs (rw,relatime,fd=12,pgrp=2377,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=61512) ldap://ldap.uvm.edu/ou=auto.netfiles,ou=nfs,dc=uvm,dc=edu on /netfiles type autofs (rw,relatime,fd=18,pgrp=2377,timeout=300,minproto=5,maxp roto=5,indirect,pipe_ino=61516) /etc/auto.master.d/auto.netfiles02 on /netfiles02 type autofs (rw,relatime,fd=24,pgrp=2377,timeout=300,minproto=5,maxproto=5,indirect,pipe_ ino=61520) gpfs2 on /gpfs2 type gpfs (rw,relatime) gpfs1 on /gpfs1 type gpfs (rw,relatime) tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=6574316k,mode=700) ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, July 6, 2023 1:10 PM To: Sean Blackerby <Sean.Blackerby@uvm.edu> Subject: [Bug 17140] 28 compute nodes crashed Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=17140#c2> on bug 17140<https://bugs.schedmd.com/show_bug.cgi?id=17140> from Jason Booth<mailto:jbooth@schedmd.com> Please attach your slurm.conf and cgroup.conf and the slurmd.log from one or two of compute nodes that ran job 10889643. The logs attached show this took place on node209. Those slurmd.logs would be greatly appreciated. Finally please also include the output from the command "mount" on node209. ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 31120 [details] cgroup.conf
Created attachment 31121 [details] slurmd.log
It looks like the slurmd.log you attached is from a much later date. The logs attached include the dates 2023-07-01 through 2023-07-06 yet the logs from the controller contain the dates 2023-06-25 through 2023-06-26 for that job. Please see if these logs are still around and attach those from that compute node. > [2023-06-25T23:48:46.997] _slurm_rpc_submit_batch_job: JobId=10889643 InitPrio=2253 usec=1366 > [2023-06-25T23:48:56.937] sched/backfill: _start_job: Started JobId=10889643 in bluemoon on node209 > [2023-06-26T00:52:41.745] _job_complete: JobId=10889643 OOM failure > [2023-06-26T00:52:41.745] _job_complete: JobId=10889643 done
Created attachment 31127 [details] slurmd.log-20230701.gz Sorry about that. See attached ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, July 6, 2023 3:37 PM To: Sean Blackerby <Sean.Blackerby@uvm.edu> Subject: [Bug 17140] 28 compute nodes crashed Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=17140#c6> on bug 17140<https://bugs.schedmd.com/show_bug.cgi?id=17140> from Jason Booth<mailto:jbooth@schedmd.com> It looks like the slurmd.log you attached is from a much later date. The logs attached include the dates 2023-07-01 through 2023-07-06 yet the logs from the controller contain the dates 2023-06-25 through 2023-06-26 for that job. Please see if these logs are still around and attach those from that compute node. > [2023-06-25T23:48:46.997] _slurm_rpc_submit_batch_job: JobId=10889643 InitPrio=2253 usec=1366 > [2023-06-25T23:48:56.937] sched/backfill: _start_job: Started JobId=10889643 in bluemoon on node209 > [2023-06-26T00:52:41.745] _job_complete: JobId=10889643 OOM failure > [2023-06-26T00:52:41.745] _job_complete: JobId=10889643 done ________________________________ You are receiving this mail because: * You reported the bug.
Since you are using CR_Core_Memory with cgroups we see the reporting via the cgroups and the oom even in message. If the entire node is OOMing then you may need some Memspec to account for the entire node OOMing though looking over your logs you have "RealMemory=62000" (MB) defined for the node, and the job requested 6144MB which is a fraction of what is availible. I suspect you just mean the steps OOMed since the cgroups memory mangement killed the step at 6290.596MB of usage. > Jun 26 00:52:27 node209 kernel: Memory cgroup stats for /slurm/uid_385621/job_10889643/step_batch: cache:860KB rss:6290596KB rss_huge:12288KB mapped_file:860KB swap:7511920KB inactive_anon:1252444KB active_anon:5039008KB inactive_file:0KB active_file:0KB unevictable:0KB > Jun 26 00:52:27 node209 kernel: Killed process 6597 (julia), UID 385621, total-vm:15450864kB, anon-rss:6286304kB, file-rss:12180kB, shmem-rss:1128kB > Jun 26 00:52:27 node209 kernel: Memory cgroup stats for /slurm/uid_385621/job_10889652/step_batch: cache:696KB rss:6290760KB rss_huge:10240KB mapped_file:696KB swa p:7266924KB inactive_anon:1382032KB active_anon:4909424KB inactive_file:0KB active_file:0KB unevictable:0KB > [2023-06-25T23:48:57.654] [10889643.batch] task/cgroup: _memcg_initialize: step: alloc=6144MB mem.limit=6144MB memsw.limit=unlimited > [2023-06-26T00:52:40.914] [10889643.batch] task/cgroup: task_cgroup_memory_check_oom: StepId=10889643.batch hit memory limit at least once during execution. This may or may not result in some failure. > [2023-06-26T00:52:40.924] [10889643.batch] error: Detected 1 oom-kill event(s) in StepId=10889643.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. > [2023-06-26T00:52:41.176] [10889643.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:35072 > [2023-06-26T00:52:41.748] [10889643.batch] done with job > [2023-06-26T00:52:45.884] [10889643.extern] done with job It is importatnt to note that the developer or software should use caution when allocating memory and in turn both need to take memory requested through Slurm into account as to not over allocate it. Cgroups memory constraints are enforced outside of Slurms which means Slurm is at the mercy of cgroups and systemd. Would you check with the user or allication to see if memory requested is accounted for in their application?
Please see comment#8. That update may not have been sent out correctly however the comment can be viewable directly in Bugzilla.
Hi Sean, Do you have a response to Jason's comment#8? This seems like an issue with the job allocating more memory than the requested one. As Jason suggested it would be a good idea to check the job submit of the user. Regards.
Hello Thank you for checking back on this. I believe the issue was faulty memory in a node in our mpi partition. Since replacing it we have not seen a recurrence. thanks ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, August 7, 2023 1:12 PM To: Sean Blackerby <Sean.Blackerby@uvm.edu> Subject: [Bug 17140] 28 compute nodes crashed Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=17140#c10> on bug 17140<https://bugs.schedmd.com/show_bug.cgi?id=17140> from Oriol Vilarrubi<mailto:jvilarru@schedmd.com> Hi Sean, Do you have a response to Jason's comment#8<show_bug.cgi?id=17140#c8>? This seems like an issue with the job allocating more memory than the requested one. As Jason suggested it would be a good idea to check the job submit of the user. Regards. ________________________________ You are receiving this mail because: * You reported the bug.
Hi Sean, Then I am closing this bug. Regards.