Ticket 15834

Summary:	Slurm detecting OOM issues lately
Product:	Slurm	Reporter:	Tana Vinod <vinodkumar.tana>
Component:	Other	Assignee:	Oriol Vilarrubi <jvilarru>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	felip.moll, vinodkumar.tana
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	Cerence AI	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Tana Vinod 2023-01-20 02:49:28 MST

Hello Team,

some jobs are killing with OOM errors after several hours or days sometimes. any idea why it is taking so much time to detect OOM event by slurm?
We can reproduce this issue.. is there a way to check realtime memory usage of a job while it started and ended with OOM.

Slurm Version-20.02.7

for example Jobid-51427778 

[2023-01-13T11:00:20.680] [51427778.0] _oom_event_monitor: oom-kill event count: 1
[2023-01-13T11:00:20.998] [51427778.0] Step 51427778.0 hit memory+swap limit at least once during execution. This may or may not result in some failure.
[2023-01-13T11:00:20.999] [51427778.0] error: Detected 1 oom-kill event(s) in step 51427778.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

 #sacct -j 51427778 --format=start,end
              Start                 End
------------------- -------------------
2023-01-13T04:00:14 2023-01-13T11:00:21
2023-01-13T04:00:14 2023-01-13T11:00:21
2023-01-13T04:00:15 2023-01-13T11:00:21

[root@cn013 ~]# seff 51427778
Job ID: 51427778
Cluster: crg2
User/Group: fuqiang_luo/users
State: OUT_OF_MEMORY (exit code 0)
Cores: 1
CPU Utilized: 06:57:30
CPU Efficiency: 99.38% of 07:00:07 core-walltime
Job Wall-clock time: 07:00:07
Memory Utilized: 3.75 GB
Memory Efficiency: 125.00% of 3.00 GB


FYI- We don't have real time monitoring graphs for memory/cpu usage of jobs. Planning to implement with the help of slurm support next month.

Comment 1 Jason Booth 2023-01-20 12:31:44 MST

> Slurm Version-20.02.7 

The OOM code has been improved since 20.02. You should upgrade to take advantage of the fixes we have made around OOM detection. I will have Oriol follow up with you should you have additional questions.

Comment 2 Tana Vinod 2023-01-24 03:47:38 MST

Thanks Jason. Will check with latest version and get back to you.

Comment 3 Oriol Vilarrubi 2023-01-24 12:26:58 MST

Hello Tana,

In case you are not doing this already in your 20.02 installation, we would recommend to use the cgroup task plugin rather than the jobacctgather linux for the OOM control as explained here: https://slurm.schedmd.com/slurm.conf.html#OPT_OverMemoryKill.

Do you agree that we close this bug and in case that you find OOM related issue when you test it with a recent slurm version you reopen it?

Regards.

Comment 4 Tana Vinod 2023-01-24 15:04:22 MST

Hi,

we are already using JobAcctGatherType=jobacct_gather/cgroup in current environment. 

you may close this ticket as we are planning to use new version of Slurm.

Comment 5 Jason Booth 2023-01-24 15:05:15 MST

Resolving.