Ticket 17140

Summary:	28 compute nodes crashed
Product:	Slurm	Reporter:	sblacker
Component:	slurmd	Assignee:	Oriol Vilarrubi <jvilarru>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	21.08.7
Hardware:	Linux
OS:	Linux
Site:	U of Vermont	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld log file 1 node message file slurm.conf cgroup.conf slurmd.log slurmd.log-20230701.gz

Description sblacker 2023-07-06 09:58:22 MDT

Created attachment 31115 [details]
slurmctld log file 1

We experienced a rather large crash July 1st around 14:00:00.  They all seemed to have been running a large array of Julia jobs by user id 385621.  All nodes crashed with OOM messages. 

Attached is the slurmctld.log and a couple of node messages logs from the time period in question.

thanks

Comment 1 sblacker 2023-07-06 09:59:13 MDT

Created attachment 31116 [details]
node message file

Comment 2 Jason Booth 2023-07-06 11:10:48 MDT

Please attach your slurm.conf and cgroup.conf and the slurmd.log from one or two of compute nodes that ran job 10889643. The logs attached show this took place on node209. Those slurmd.logs would be greatly appreciated.

Finally please also include the output from the command "mount" on node209.

Comment 3 Sean Blackerby 2023-07-06 11:18:33 MDT

Created attachment 31119 [details]
slurm.conf

Hello

See attached

Below is the output of the mount command.

sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,size=32854008k,nr_inodes=8213502,mode=755)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,na
me=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
configfs on /sys/kernel/config type configfs (rw,relatime)
/dev/mapper/VolGroup00-LogVol00 on / type xfs (rw,relatime,attr2,inode64,noquota)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=21,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=46214)
mqueue on /dev/mqueue type mqueue (rw,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
/dev/sda1 on /boot type xfs (rw,relatime,attr2,inode64,noquota)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
/etc/auto.misc on /misc type autofs (rw,relatime,fd=5,pgrp=2377,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=61507)
-hosts on /net type autofs (rw,relatime,fd=12,pgrp=2377,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=61512)
ldap://ldap.uvm.edu/ou=auto.netfiles,ou=nfs,dc=uvm,dc=edu on /netfiles type autofs (rw,relatime,fd=18,pgrp=2377,timeout=300,minproto=5,maxp
roto=5,indirect,pipe_ino=61516)
/etc/auto.master.d/auto.netfiles02 on /netfiles02 type autofs (rw,relatime,fd=24,pgrp=2377,timeout=300,minproto=5,maxproto=5,indirect,pipe_
ino=61520)
gpfs2 on /gpfs2 type gpfs (rw,relatime)
gpfs1 on /gpfs1 type gpfs (rw,relatime)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=6574316k,mode=700)


________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, July 6, 2023 1:10 PM
To: Sean Blackerby <Sean.Blackerby@uvm.edu>
Subject: [Bug 17140] 28 compute nodes crashed


Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=17140#c2> on bug 17140<https://bugs.schedmd.com/show_bug.cgi?id=17140> from Jason Booth<mailto:jbooth@schedmd.com>

Please attach your slurm.conf and cgroup.conf and the slurmd.log from one or
two of compute nodes that ran job 10889643. The logs attached show this took
place on node209. Those slurmd.logs would be greatly appreciated.

Finally please also include the output from the command "mount" on node209.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 4 Sean Blackerby 2023-07-06 11:18:33 MDT

Created attachment 31120 [details]
cgroup.conf

Comment 5 Sean Blackerby 2023-07-06 11:18:33 MDT

Created attachment 31121 [details]
slurmd.log

Comment 6 Jason Booth 2023-07-06 13:37:12 MDT

It looks like the slurmd.log you attached is from a much later date.


The logs attached include the dates 2023-07-01 through 2023-07-06 yet the logs from the 
controller contain the dates 2023-06-25 through 2023-06-26 for that job. Please see if 
these logs are still around and attach those from that compute node.


> [2023-06-25T23:48:46.997] _slurm_rpc_submit_batch_job: JobId=10889643 InitPrio=2253 usec=1366
> [2023-06-25T23:48:56.937] sched/backfill: _start_job: Started JobId=10889643 in bluemoon on node209
> [2023-06-26T00:52:41.745] _job_complete: JobId=10889643 OOM failure
> [2023-06-26T00:52:41.745] _job_complete: JobId=10889643 done

Comment 7 Sean Blackerby 2023-07-06 13:51:43 MDT

Created attachment 31127 [details]
slurmd.log-20230701.gz

Sorry about that.

See attached


________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, July 6, 2023 3:37 PM
To: Sean Blackerby <Sean.Blackerby@uvm.edu>
Subject: [Bug 17140] 28 compute nodes crashed


Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=17140#c6> on bug 17140<https://bugs.schedmd.com/show_bug.cgi?id=17140> from Jason Booth<mailto:jbooth@schedmd.com>

It looks like the slurmd.log you attached is from a much later date.


The logs attached include the dates 2023-07-01 through 2023-07-06 yet the logs
from the
controller contain the dates 2023-06-25 through 2023-06-26 for that job. Please
see if
these logs are still around and attach those from that compute node.


> [2023-06-25T23:48:46.997] _slurm_rpc_submit_batch_job: JobId=10889643 InitPrio=2253 usec=1366
> [2023-06-25T23:48:56.937] sched/backfill: _start_job: Started JobId=10889643 in bluemoon on node209
> [2023-06-26T00:52:41.745] _job_complete: JobId=10889643 OOM failure
> [2023-06-26T00:52:41.745] _job_complete: JobId=10889643 done

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 8 Jason Booth 2023-07-06 14:50:50 MDT

Since you are using CR_Core_Memory with cgroups we see the reporting via the cgroups and the oom even in message.

If the entire node is OOMing then you may need some Memspec to account for the entire node OOMing though looking over your logs you have 
"RealMemory=62000" (MB) defined for the node, and the job requested 6144MB which is a fraction of what is availible.  

I suspect you just mean the steps OOMed since the cgroups memory mangement killed the step at 6290.596MB of usage.


> Jun 26 00:52:27 node209 kernel: Memory cgroup stats for /slurm/uid_385621/job_10889643/step_batch: cache:860KB rss:6290596KB rss_huge:12288KB mapped_file:860KB swap:7511920KB inactive_anon:1252444KB active_anon:5039008KB inactive_file:0KB active_file:0KB unevictable:0KB
> Jun 26 00:52:27 node209 kernel: Killed process 6597 (julia), UID 385621, total-vm:15450864kB, anon-rss:6286304kB, file-rss:12180kB, shmem-rss:1128kB
> Jun 26 00:52:27 node209 kernel: Memory cgroup stats for /slurm/uid_385621/job_10889652/step_batch: cache:696KB rss:6290760KB rss_huge:10240KB mapped_file:696KB swa
p:7266924KB inactive_anon:1382032KB active_anon:4909424KB inactive_file:0KB active_file:0KB unevictable:0KB



> [2023-06-25T23:48:57.654] [10889643.batch] task/cgroup: _memcg_initialize: step: alloc=6144MB mem.limit=6144MB memsw.limit=unlimited
> [2023-06-26T00:52:40.914] [10889643.batch] task/cgroup: task_cgroup_memory_check_oom: StepId=10889643.batch hit memory limit at least once during execution. This may or may not result in some failure.
> [2023-06-26T00:52:40.924] [10889643.batch] error: Detected 1 oom-kill event(s) in StepId=10889643.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
> [2023-06-26T00:52:41.176] [10889643.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:35072
> [2023-06-26T00:52:41.748] [10889643.batch] done with job
> [2023-06-26T00:52:45.884] [10889643.extern] done with job

It is importatnt to note that the developer or software should use caution when allocating memory and in turn both need
 to take memory requested through Slurm into account as to not over allocate it. Cgroups memory constraints are enforced 
 outside of Slurms which means Slurm is at the mercy of cgroups and systemd. 

Would you check with the user or allication to see if memory requested is accounted for in their application?

Comment 9 Jason Booth 2023-07-06 14:54:31 MDT

Please see comment#8. That update may not have been sent out correctly however the comment can be viewable directly in Bugzilla.

Comment 10 Oriol Vilarrubi 2023-08-07 11:12:40 MDT

Hi Sean,

Do you have a response to Jason's comment#8?

This seems like an issue with the job allocating more memory than the requested one. As Jason suggested it would be a good idea to check the job submit of the user.

Regards.

Comment 11 Sean Blackerby 2023-08-08 06:37:52 MDT

Hello

Thank you for checking back on this.

I believe the issue was faulty memory in a node in our mpi partition.  Since replacing it we have not seen a recurrence.


thanks
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, August 7, 2023 1:12 PM
To: Sean Blackerby <Sean.Blackerby@uvm.edu>
Subject: [Bug 17140] 28 compute nodes crashed


Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=17140#c10> on bug 17140<https://bugs.schedmd.com/show_bug.cgi?id=17140> from Oriol Vilarrubi<mailto:jvilarru@schedmd.com>

Hi Sean,

Do you have a response to Jason's comment#8<show_bug.cgi?id=17140#c8>?

This seems like an issue with the job allocating more memory than the requested
one. As Jason suggested it would be a good idea to check the job submit of the
user.

Regards.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 12 Oriol Vilarrubi 2023-08-08 09:07:15 MDT

Hi Sean,

Then I am closing this bug.

Regards.