Ticket 11148

Summary:	Cgroup memory limits higher than AllocTRES, allowing system OOM
Product:	Slurm	Reporter:	Tony Racho <antonio-ii.racho>
Component:	Other	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, antonio-ii.racho, csamuel, felip.moll, marshall, peter.maxwell
Version:	20.11.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=9229 https://bugs.schedmd.com/show_bug.cgi?id=12009 https://bugs.schedmd.com/show_bug.cgi?id=13879
Site:	CRAY	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	NIWA/WELLINGTON	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.11.6 21.08pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Tony Racho 2021-03-18 20:52:35 MDT

#!/bin/bash
#SBATCH --reservation=POPS-1374    # A whole idle 105 GB node
#SBATCH --mem-per-cpu=25G          # Just under 1/4 of the memory
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1
#SBATCH --array=1-4                # should be able to run 4 of these at once.
#SBATCH --time=1
CGROUPMEM=$(cat /sys/fs/cgroup/memory/$(cat /proc/$$/cgroup | awk -F: '/memory/ {print $3}')/memory.limit_in_bytes)
echo "Job $SLURM_ARRAY_TASK_ID on $HOSTNAME at $(date) allowed to use $(($CGROUPMEM/(1024**3))) GB"
sleep 30 # so we can see the overlap of these 4 jobs
# Result on Mahuika, with 
#  - 20.11.4, 
#  - cons_res
#  - CR_Core_Memory,CR_ONE_TASK_PER_CORE,OTHER_CONS_RES
#  - hyperthreading enabled
#
# Only 3 jobs start at once, the 4th waits 30 seconds, consistent with planning 
# for 25 GB per job, and counting running jobs as consuming 25 GB, but testing 
# for 50 GB free when trying to launch.
# ReqTRES and AllocTRES show 25 GB, but their cgroup memory.limit_in_bytes is 
# 50 GB.  
# 3 jobs * 50 GB/job = 150 GB > 105 GB. So no protection against system OOM.

Comment 1 Tony Racho 2021-03-18 20:53:05 MDT

Recent changes to take note of:

[The recent change to AllocTRES
- https://github.com/SchedMD/slurm/commit/49a7d7f9fb9d554c3f51a33bc5de3bb3e9249a35]
Unchanged slurmd code calculating memory limit independently
- https://github.com/SchedMD/slurm/blob/46b4748c45e923d4a7006281c08920e413f27f82/src/slurmd/slurmd/req.c#L1187

Comment 2 Tony Racho 2021-03-18 20:53:27 MDT

Related Bug case:

# 5562 - raised by B. Gilmer (Cray)

Comment 3 Peter Maxwell 2021-03-22 02:08:00 MDT

My test script above doesn't work properly when JobAcctGatherType=jobacct_gather/cgroup. So I have since replaced

/sys/fs/cgroup/memory/$(cat /proc/$$/cgroup | awk -F: '/memory/ {print $3}')/memory.limit_in_bytes

with 

/sys/fs/cgroup/memory/slurm/uid_$UID/job_$SLURM_JOB_ID/memory.limit_in_bytes

Comment 4 Tony Racho 2021-03-22 20:40:20 MDT

Hi there:

Has this been looked into, please?

Cheers,
Tony

Comment 5 Peter Maxwell 2021-03-22 21:01:54 MDT

Regarding this minor aspect of the reported behaviour:
> Only 3 jobs start at once, the 4th waits 30 seconds, consistent with 
> planning for 25 GB per job, and counting running jobs as consuming 25 GB, 
> but testing for 50 GB free when trying to launch.
When we remove CR_ONE_TASK_PER_CORE this changes to a less surprising 4 jobs starting at once, consistent with 25 GB being used at all stages. The odd behaviour with CR_ONE_TASK_PER_CORE present probably belongs on its own ticket (or ticket 5562) rather than this one.

More importantly: With or without CR_ONE_TASK_PER_CORE, OTHER_CONS_RES, or cons_tres, the cgroup memory limits get set twice as high as they should be when --threads-per-core=1.

Comment 6 Marcin Stolarek 2021-03-23 03:58:38 MDT

I can reproduce the issue and I'm working on the fix.

cheers,
Marcin

Comment 17 Marcin Stolarek 2021-04-14 01:30:23 MDT

The value applied to memory cgroups should be fixed by e8b7ec47f9ce803e5[1]. That will be released in Slurm 20.11.6

I'm closing this report as fixed now.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/e8b7ec47f9ce803e5d810cb3f1f22452886575f5