Ticket 11148 - Cgroup memory limits higher than AllocTRES, allowing system OOM
Summary: Cgroup memory limits higher than AllocTRES, allowing system OOM
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 20.11.4
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-03-18 20:52 MDT by Tony Racho
Modified: 2022-04-22 18:18 MDT (History)
6 users (show)

See Also:
Site: CRAY
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: NIWA/WELLINGTON
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.6 21.08pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Tony Racho 2021-03-18 20:52:35 MDT
#!/bin/bash
#SBATCH --reservation=POPS-1374    # A whole idle 105 GB node
#SBATCH --mem-per-cpu=25G          # Just under 1/4 of the memory
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1
#SBATCH --array=1-4                # should be able to run 4 of these at once.
#SBATCH --time=1
CGROUPMEM=$(cat /sys/fs/cgroup/memory/$(cat /proc/$$/cgroup | awk -F: '/memory/ {print $3}')/memory.limit_in_bytes)
echo "Job $SLURM_ARRAY_TASK_ID on $HOSTNAME at $(date) allowed to use $(($CGROUPMEM/(1024**3))) GB"
sleep 30 # so we can see the overlap of these 4 jobs
# Result on Mahuika, with 
#  - 20.11.4, 
#  - cons_res
#  - CR_Core_Memory,CR_ONE_TASK_PER_CORE,OTHER_CONS_RES
#  - hyperthreading enabled
#
# Only 3 jobs start at once, the 4th waits 30 seconds, consistent with planning 
# for 25 GB per job, and counting running jobs as consuming 25 GB, but testing 
# for 50 GB free when trying to launch.
# ReqTRES and AllocTRES show 25 GB, but their cgroup memory.limit_in_bytes is 
# 50 GB.  
# 3 jobs * 50 GB/job = 150 GB > 105 GB. So no protection against system OOM.
Comment 1 Tony Racho 2021-03-18 20:53:05 MDT
Recent changes to take note of:

[The recent change to AllocTRES
- https://github.com/SchedMD/slurm/commit/49a7d7f9fb9d554c3f51a33bc5de3bb3e9249a35]
Unchanged slurmd code calculating memory limit independently
- https://github.com/SchedMD/slurm/blob/46b4748c45e923d4a7006281c08920e413f27f82/src/slurmd/slurmd/req.c#L1187
Comment 2 Tony Racho 2021-03-18 20:53:27 MDT
Related Bug case:

# 5562 - raised by B. Gilmer (Cray)
Comment 3 Peter Maxwell 2021-03-22 02:08:00 MDT
My test script above doesn't work properly when JobAcctGatherType=jobacct_gather/cgroup. So I have since replaced

/sys/fs/cgroup/memory/$(cat /proc/$$/cgroup | awk -F: '/memory/ {print $3}')/memory.limit_in_bytes

with 

/sys/fs/cgroup/memory/slurm/uid_$UID/job_$SLURM_JOB_ID/memory.limit_in_bytes
Comment 4 Tony Racho 2021-03-22 20:40:20 MDT
Hi there:

Has this been looked into, please?

Cheers,
Tony
Comment 5 Peter Maxwell 2021-03-22 21:01:54 MDT
Regarding this minor aspect of the reported behaviour:
> Only 3 jobs start at once, the 4th waits 30 seconds, consistent with 
> planning for 25 GB per job, and counting running jobs as consuming 25 GB, 
> but testing for 50 GB free when trying to launch.
When we remove CR_ONE_TASK_PER_CORE this changes to a less surprising 4 jobs starting at once, consistent with 25 GB being used at all stages. The odd behaviour with CR_ONE_TASK_PER_CORE present probably belongs on its own ticket (or ticket 5562) rather than this one.

More importantly: With or without CR_ONE_TASK_PER_CORE, OTHER_CONS_RES, or cons_tres, the cgroup memory limits get set twice as high as they should be when --threads-per-core=1.
Comment 6 Marcin Stolarek 2021-03-23 03:58:38 MDT
I can reproduce the issue and I'm working on the fix.

cheers,
Marcin
Comment 17 Marcin Stolarek 2021-04-14 01:30:23 MDT
The value applied to memory cgroups should be fixed by e8b7ec47f9ce803e5[1]. That will be released in Slurm 20.11.6

I'm closing this report as fixed now.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/e8b7ec47f9ce803e5d810cb3f1f22452886575f5