#!/bin/bash #SBATCH --reservation=POPS-1374 # A whole idle 105 GB node #SBATCH --mem-per-cpu=25G # Just under 1/4 of the memory #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --threads-per-core=1 #SBATCH --array=1-4 # should be able to run 4 of these at once. #SBATCH --time=1 CGROUPMEM=$(cat /sys/fs/cgroup/memory/$(cat /proc/$$/cgroup | awk -F: '/memory/ {print $3}')/memory.limit_in_bytes) echo "Job $SLURM_ARRAY_TASK_ID on $HOSTNAME at $(date) allowed to use $(($CGROUPMEM/(1024**3))) GB" sleep 30 # so we can see the overlap of these 4 jobs # Result on Mahuika, with # - 20.11.4, # - cons_res # - CR_Core_Memory,CR_ONE_TASK_PER_CORE,OTHER_CONS_RES # - hyperthreading enabled # # Only 3 jobs start at once, the 4th waits 30 seconds, consistent with planning # for 25 GB per job, and counting running jobs as consuming 25 GB, but testing # for 50 GB free when trying to launch. # ReqTRES and AllocTRES show 25 GB, but their cgroup memory.limit_in_bytes is # 50 GB. # 3 jobs * 50 GB/job = 150 GB > 105 GB. So no protection against system OOM.
Recent changes to take note of: [The recent change to AllocTRES - https://github.com/SchedMD/slurm/commit/49a7d7f9fb9d554c3f51a33bc5de3bb3e9249a35] Unchanged slurmd code calculating memory limit independently - https://github.com/SchedMD/slurm/blob/46b4748c45e923d4a7006281c08920e413f27f82/src/slurmd/slurmd/req.c#L1187
Related Bug case: # 5562 - raised by B. Gilmer (Cray)
My test script above doesn't work properly when JobAcctGatherType=jobacct_gather/cgroup. So I have since replaced /sys/fs/cgroup/memory/$(cat /proc/$$/cgroup | awk -F: '/memory/ {print $3}')/memory.limit_in_bytes with /sys/fs/cgroup/memory/slurm/uid_$UID/job_$SLURM_JOB_ID/memory.limit_in_bytes
Hi there: Has this been looked into, please? Cheers, Tony
Regarding this minor aspect of the reported behaviour: > Only 3 jobs start at once, the 4th waits 30 seconds, consistent with > planning for 25 GB per job, and counting running jobs as consuming 25 GB, > but testing for 50 GB free when trying to launch. When we remove CR_ONE_TASK_PER_CORE this changes to a less surprising 4 jobs starting at once, consistent with 25 GB being used at all stages. The odd behaviour with CR_ONE_TASK_PER_CORE present probably belongs on its own ticket (or ticket 5562) rather than this one. More importantly: With or without CR_ONE_TASK_PER_CORE, OTHER_CONS_RES, or cons_tres, the cgroup memory limits get set twice as high as they should be when --threads-per-core=1.
I can reproduce the issue and I'm working on the fix. cheers, Marcin
The value applied to memory cgroups should be fixed by e8b7ec47f9ce803e5[1]. That will be released in Slurm 20.11.6 I'm closing this report as fixed now. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/e8b7ec47f9ce803e5d810cb3f1f22452886575f5