Created attachment 14658 [details] configuration files I have a node with 4 GPUs, 64 threads, and 768GB RAM. I start a job with, e.g. sbatch --gpus=1 --mem-per-gpu 191000M job_test.sh and scontrol show jobid -d JOBID looks correct. It shows 1 GPU, 16 cores, and 191000M RAM. TRES=cpu=16,mem=191000M,node=1,billing=8,gres/gpu=1 JOB_GRES=gpu:1 Nodes=alvis1-13 CPU_IDs=0-15 Mem=191000 GRES=gpu:1(IDX:0) Going in to the node and checking the actual limits in the cgroup: $ cat /sys/fs/cgroup/memory/slurm/uid_156653/job_101/memory.limit_in_bytes 810009231360 which is 772485MB. So, clearly not limited to 191000MB. Confirmed via slurmd log: [2020-06-12T22:21:40.806] [107.extern] task/cgroup: /slurm/uid_156653/job_107: alloc=0MB mem.limit=772485MB memsw.limit=772485MB [2020-06-12T22:21:40.812] [107.extern] task/cgroup: /slurm/uid_156653/job_107/step_extern: alloc=0MB mem.limit=772485MB memsw.limit=772485MB I use: SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK and cgroup.conf enables constraints cpus, ram, devices, swap. If i use sbatch --gpus=1 --mem 191000M job_test.sh instead, limits are set correctly. This seems like a bug to me, or have I misconfigured something?
I can reproduce this. It definitely looks like a bug. Here's my reproducer with a few notes: $ sbatch --gpus=1 --mem-per-gpu=233 -Dtmp --wrap='srun whereami 6000' Submitted batch job 160 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 160 debug wrap marshall R 0:01 1 n1-1 $ scontrol -d show job 160 JobId=160 JobName=wrap JobState=RUNNING Reason=None Dependency=(null) NodeList=n1-1 BatchHost=n1-1 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,mem=233M,node=1,billing=2 JOB_GRES=gpu:1 Nodes=n1-1 CPU_IDs=0-1 Mem=233 GRES=gpu:1(IDX:0) MemPerTres=gpu:233 TresPerJob=gpu:1 (only showing relevant parts of the scontrol output) marshall@voyager:/sys/fs/cgroup/memory/slurm_n1-1/uid_1017/job_160$ cat memory.limit_in_bytes 16648241152 That looks very close to my actual system's physical memory (16 GB) $ free -b total used free shared buff/cache available Mem: 16649220096 Even though I've configured my node to be at 8000 MB in slurm.conf: $ scontrol show nodes n1-1 | grep -i memory RealMemory=8000 AllocMem=466 FreeMem=1637 Sockets=1 Boards=1 # slurm.conf NodeName=DEFAULT RealMemory=8000 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 \ Here I use --mem (memory per node) and Slurm sets the cgroup memory limit correctly. $ sbatch --mem=233 -Dtmp --wrap='srun whereami 6000' Submitted batch job 161 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 161 debug wrap marshall R 0:42 1 n1-1 160 debug wrap marshall R 2:02 1 n1-1 $ scontrol -d show job 161 JobId=161 JobName=wrap JobState=RUNNING Reason=None Dependency=(null) NodeList=n1-1 BatchHost=n1-1 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,mem=233M,node=1,billing=2 JOB_GRES=(null) Nodes=n1-1 CPU_IDs=2-3 Mem=233 GRES= MinCPUsNode=1 MinMemoryNode=233M MinTmpDiskNode=0 marshall@voyager:/sys/fs/cgroup/memory/slurm_n1-1/uid_1017/job_161$ cat memory.limit_in_bytes 244318208
As I've been researching this issue I've discovered this problem will be more complicated than I initially thought. For now I recommend not using --mem-per-gpu.
Just so you know I haven't forgotten about this. I'm hoping to get this done by the time we release 20.11 (though I'd like to aim a fix for both 20.02 and 20.11).
Update - I've made good progress on this bug. I have a set of patches that appear to be working. They fix a few issues, including enforcement of memory limits. There are a lot of changes (23 files so far) and I still have several more things to fix. Then these patches still need to go through peer review and QA. Unfortunately, this definitely won't make it into 20.11 by the time it's released. Thanks for your patience on this one.
*** Ticket 8248 has been marked as a duplicate of this ticket. ***
*** Ticket 9950 has been marked as a duplicate of this ticket. ***
Mikael, We've finally pushed a number of commits to fix --mem-per-gpu. These fixes are all in 21.08. There were two classes of fixes: (1) Fixing --mem-per-gpu for the job. We fixed this at the same time as other issues where the memory cgroup was incorrect for multi-node jobs. This was also problematic when using --mem-per-cpu under some circumstances. We did this by restructuring how we were passing the job memory allocation from slurmctld to the various slurmd's for the job. This was done in bug 11367 which is an internal bug, so I'm copying the main fixes here: Commits b393a7ab6..b6f93b76eb Commit 26eadbee8368 (2) Fixing --mem-per-gpu for the step. This was done in commits d92a1e73f05d48..399566f785657 (where commit d92a1e73f05d48 is the main commit that fixes things). Thanks so much for your patience on this one. It involved quite a lot of restructuring how we did GRES, and was done on top of of lot of other fixes and refactoring with GRES in 21.08. Because of this it won't be possible to backport the fixes to 20.11 or any other previous version. Feel free to use --mem-per-gpu in 21.08 and onward, and please submit new bugs for any issues you find with it moving forward. Thanks again for filing this ticket. I'm closing this as resolved/fixed.
*** Ticket 12683 has been marked as a duplicate of this ticket. ***