| Summary: | Cgroup memory limits higher than AllocTRES, allowing system OOM | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Tony Racho <antonio-ii.racho> |
| Component: | Other | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex, antonio-ii.racho, csamuel, felip.moll, marshall, peter.maxwell |
| Version: | 20.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=9229 https://bugs.schedmd.com/show_bug.cgi?id=12009 https://bugs.schedmd.com/show_bug.cgi?id=13879 |
||
| Site: | CRAY | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | NIWA/WELLINGTON | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 20.11.6 21.08pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
Recent changes to take note of: [The recent change to AllocTRES - https://github.com/SchedMD/slurm/commit/49a7d7f9fb9d554c3f51a33bc5de3bb3e9249a35] Unchanged slurmd code calculating memory limit independently - https://github.com/SchedMD/slurm/blob/46b4748c45e923d4a7006281c08920e413f27f82/src/slurmd/slurmd/req.c#L1187 Related Bug case: # 5562 - raised by B. Gilmer (Cray) My test script above doesn't work properly when JobAcctGatherType=jobacct_gather/cgroup. So I have since replaced
/sys/fs/cgroup/memory/$(cat /proc/$$/cgroup | awk -F: '/memory/ {print $3}')/memory.limit_in_bytes
with
/sys/fs/cgroup/memory/slurm/uid_$UID/job_$SLURM_JOB_ID/memory.limit_in_bytes
Hi there: Has this been looked into, please? Cheers, Tony Regarding this minor aspect of the reported behaviour: > Only 3 jobs start at once, the 4th waits 30 seconds, consistent with > planning for 25 GB per job, and counting running jobs as consuming 25 GB, > but testing for 50 GB free when trying to launch. When we remove CR_ONE_TASK_PER_CORE this changes to a less surprising 4 jobs starting at once, consistent with 25 GB being used at all stages. The odd behaviour with CR_ONE_TASK_PER_CORE present probably belongs on its own ticket (or ticket 5562) rather than this one. More importantly: With or without CR_ONE_TASK_PER_CORE, OTHER_CONS_RES, or cons_tres, the cgroup memory limits get set twice as high as they should be when --threads-per-core=1. I can reproduce the issue and I'm working on the fix. cheers, Marcin The value applied to memory cgroups should be fixed by e8b7ec47f9ce803e5[1]. That will be released in Slurm 20.11.6 I'm closing this report as fixed now. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/e8b7ec47f9ce803e5d810cb3f1f22452886575f5 |
#!/bin/bash #SBATCH --reservation=POPS-1374 # A whole idle 105 GB node #SBATCH --mem-per-cpu=25G # Just under 1/4 of the memory #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --threads-per-core=1 #SBATCH --array=1-4 # should be able to run 4 of these at once. #SBATCH --time=1 CGROUPMEM=$(cat /sys/fs/cgroup/memory/$(cat /proc/$$/cgroup | awk -F: '/memory/ {print $3}')/memory.limit_in_bytes) echo "Job $SLURM_ARRAY_TASK_ID on $HOSTNAME at $(date) allowed to use $(($CGROUPMEM/(1024**3))) GB" sleep 30 # so we can see the overlap of these 4 jobs # Result on Mahuika, with # - 20.11.4, # - cons_res # - CR_Core_Memory,CR_ONE_TASK_PER_CORE,OTHER_CONS_RES # - hyperthreading enabled # # Only 3 jobs start at once, the 4th waits 30 seconds, consistent with planning # for 25 GB per job, and counting running jobs as consuming 25 GB, but testing # for 50 GB free when trying to launch. # ReqTRES and AllocTRES show 25 GB, but their cgroup memory.limit_in_bytes is # 50 GB. # 3 jobs * 50 GB/job = 150 GB > 105 GB. So no protection against system OOM.