Ticket 9229

Summary:	cgroup not matching --mem-per-gpu
Product:	Slurm	Reporter:	Mikael Öhman <mikael.ohman>
Component:	GPU	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, cinek, x.huang
Version:	20.02.3
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=11148 https://bugs.schedmd.com/show_bug.cgi?id=11367 https://bugs.schedmd.com/show_bug.cgi?id=11834 https://bugs.schedmd.com/show_bug.cgi?id=11835 https://bugs.schedmd.com/show_bug.cgi?id=11837 https://bugs.schedmd.com/show_bug.cgi?id=11949 https://bugs.schedmd.com/show_bug.cgi?id=12203
Site:	SNIC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	C3SE	Tzag Elita Sites:	---
Linux Distro:	CentOS	Machine Name:
CLE Version:		Version Fixed:	21.08.0rc2
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Ticket Depends on:
Ticket Blocks:	11837
Attachments:	configuration files

Description Mikael Öhman 2020-06-12 14:26:52 MDT

Created attachment 14658 [details]
configuration files

I have a node with 4 GPUs, 64 threads, and 768GB RAM.
I start a job with, e.g.
  sbatch --gpus=1 --mem-per-gpu 191000M job_test.sh
and 
  scontrol show jobid -d JOBID
looks correct. It shows 1 GPU, 16 cores, and 191000M RAM.

   TRES=cpu=16,mem=191000M,node=1,billing=8,gres/gpu=1
   JOB_GRES=gpu:1
     Nodes=alvis1-13 CPU_IDs=0-15 Mem=191000 GRES=gpu:1(IDX:0)

Going in to the node and checking the actual limits in the cgroup:

$ cat /sys/fs/cgroup/memory/slurm/uid_156653/job_101/memory.limit_in_bytes 810009231360

which is 772485MB. So, clearly not limited to 191000MB.
Confirmed via slurmd log:
[2020-06-12T22:21:40.806] [107.extern] task/cgroup: /slurm/uid_156653/job_107: alloc=0MB mem.limit=772485MB memsw.limit=772485MB
[2020-06-12T22:21:40.812] [107.extern] task/cgroup: /slurm/uid_156653/job_107/step_extern: alloc=0MB mem.limit=772485MB memsw.limit=772485MB

I use:
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
and cgroup.conf enables constraints cpus, ram, devices, swap. 


If i use
sbatch --gpus=1 --mem 191000M job_test.sh
instead, limits are set correctly.

This seems like a bug to me, or have I misconfigured something?

Comment 1 Marshall Garey 2020-06-15 15:23:52 MDT

I can reproduce this. It definitely looks like a bug. Here's my reproducer with a few notes:


$ sbatch --gpus=1 --mem-per-gpu=233 -Dtmp --wrap='srun whereami 6000'   
Submitted batch job 160
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
               160     debug     wrap marshall  R       0:01      1 n1-1 

$ scontrol -d show job 160
JobId=160 JobName=wrap
   JobState=RUNNING Reason=None Dependency=(null)
   NodeList=n1-1
   BatchHost=n1-1
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=233M,node=1,billing=2
   JOB_GRES=gpu:1
     Nodes=n1-1 CPU_IDs=0-1 Mem=233 GRES=gpu:1(IDX:0)
   MemPerTres=gpu:233
   TresPerJob=gpu:1

(only showing relevant parts of the scontrol output)

marshall@voyager:/sys/fs/cgroup/memory/slurm_n1-1/uid_1017/job_160$ cat memory.limit_in_bytes 
16648241152

That looks very close to my actual system's physical memory (16 GB)

$ free -b
              total        used        free      shared  buff/cache   available
Mem:    16649220096


Even though I've configured my node to be at 8000 MB in slurm.conf:

$ scontrol show nodes n1-1 | grep -i memory
   RealMemory=8000 AllocMem=466 FreeMem=1637 Sockets=1 Boards=1

# slurm.conf
NodeName=DEFAULT RealMemory=8000 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 \





Here I use --mem (memory per node) and Slurm sets the cgroup memory limit correctly.

$ sbatch --mem=233 -Dtmp --wrap='srun whereami 6000'
Submitted batch job 161
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
               161     debug     wrap marshall  R       0:42      1 n1-1 
               160     debug     wrap marshall  R       2:02      1 n1-1 

$ scontrol -d show job 161
JobId=161 JobName=wrap
   JobState=RUNNING Reason=None Dependency=(null)
   NodeList=n1-1
   BatchHost=n1-1
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=233M,node=1,billing=2
   JOB_GRES=(null)
     Nodes=n1-1 CPU_IDs=2-3 Mem=233 GRES=
   MinCPUsNode=1 MinMemoryNode=233M MinTmpDiskNode=0

marshall@voyager:/sys/fs/cgroup/memory/slurm_n1-1/uid_1017/job_161$ cat memory.limit_in_bytes                                                                
244318208

Comment 4 Marshall Garey 2020-06-16 16:26:18 MDT

As I've been researching this issue I've discovered this problem will be more complicated than I initially thought. For now I recommend not using --mem-per-gpu.

Comment 6 Marshall Garey 2020-10-08 11:35:17 MDT

Just so you know I haven't forgotten about this. I'm hoping to get this done by the time we release 20.11 (though I'd like to aim a fix for both 20.02 and 20.11).

Comment 8 Marshall Garey 2020-11-06 17:15:12 MST

Update -

I've made good progress on this bug. I have a set of patches that appear to be working. They fix a few issues, including enforcement of memory limits. There are a lot of changes (23 files so far) and I still have several more things to fix. Then these patches still need to go through peer review and QA. Unfortunately, this definitely won't make it into 20.11 by the time it's released.

Thanks for your patience on this one.

Comment 11 Marshall Garey 2020-11-10 10:24:04 MST

*** Ticket 8248 has been marked as a duplicate of this ticket. ***

Comment 13 Marshall Garey 2021-02-12 09:40:20 MST

*** Ticket 9950 has been marked as a duplicate of this ticket. ***

Comment 39 Marshall Garey 2021-08-03 13:48:31 MDT

Mikael,

We've finally pushed a number of commits to fix --mem-per-gpu. These fixes are all in 21.08. There were two classes of fixes:

(1) Fixing --mem-per-gpu for the job. We fixed this at the same time as other issues where the memory cgroup was incorrect for multi-node jobs. This was also problematic when using --mem-per-cpu under some circumstances. We did this by restructuring how we were passing the job memory allocation from slurmctld to the various slurmd's for the job. This was done in bug 11367 which is an internal bug, so I'm copying the main fixes here:

Commits b393a7ab6..b6f93b76eb
Commit 26eadbee8368

(2) Fixing --mem-per-gpu for the step. This was done in commits d92a1e73f05d48..399566f785657 (where commit d92a1e73f05d48 is the main commit that fixes things).


Thanks so much for your patience on this one. It involved quite a lot of restructuring how we did GRES, and was done on top of of lot of other fixes and refactoring with GRES in 21.08. Because of this it won't be possible to backport the fixes to 20.11 or any other previous version.

Feel free to use --mem-per-gpu in 21.08 and onward, and please submit new bugs for any issues you find with it moving forward.

Thanks again for filing this ticket. I'm closing this as resolved/fixed.

Comment 40 Marshall Garey 2021-10-18 11:25:50 MDT

*** Ticket 12683 has been marked as a duplicate of this ticket. ***