Ticket 17393

Summary:	CPU jobs on GPU nodes
Product:	Slurm	Reporter:	Wei Feinstein <wfeinstein>
Component:	Heterogeneous Jobs	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	22.05.6
Hardware:	Linux
OS:	Linux
Site:	LBNL - Lawrence Berkeley National Laboratory	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Wei Feinstein 2023-08-09 14:42:17 MDT

Dear support team,

I have users running jobs on the GPU partition, which is configured as shared.

[wfeinstein@n0000 ~]$ grep -i partition /etc/slurm/slurm.conf  |grep es1
PartitionName=es1           Nodes=n00[00-05,12-52].es[1]                Oversubscribe=FORCE        DefMemPerCPU=8000     LLN=Yes 

Below is the qos definition. 

es_normal|1000|00:00:00|es_lowprio||cluster|||1.000000|||||||node=64|||3-00:00:00|||||||cpu=2,gres/gpu=1|

Typically the flags --gres and --cpus-per-task are required to request GPU card(s) as below:
srun -A scs -p es1 -q es_normal --gres=gpu:A40:1 --cpus-per-task=2 --pty bash 

However, with --exclusive, users can request an entire node without passing above parameters. 

I wonder if there are ways to enforce using --gres on the GPU partition?

Thank you,
Wei

Comment 1 Jason Booth 2023-08-09 14:59:48 MDT

Do all nodes in the partition "es1" have GPUs or only a select few?

Comment 2 Wei Feinstein 2023-08-09 15:03:59 MDT

Hi Jason,

Only GPU nodes in the Es1 partition, including v100, A40...

Thanks,

Wei

Comment 3 Ben Roberts 2023-08-10 10:55:37 MDT

Hi Wei,

It sounds like you should be able to do what you want with the MaxTRESPerNode setting.  If I understand correctly you are asking for a way to prevent users from requesting an entire node, but only want them to use some of the CPUs when they are going to use a GPU.  If that's right then you can define a maximum number of CPUs per node that a job can request.  This is something you can set on a QOS.

Here's a quick example that only allows up to 12 CPUs per node to be used in my GPU partition.  I configured the 'gpu' partition to also use the 'gpu' QOS.
  PartitionName=gpu   Default=NO  Nodes=node[07-08] MaxTime=5:00:00 State=UP QOS=gpu



I set the MaxTRESPerNode to 12 CPUs.

$ sacctmgr show qos gpu format=name,maxtrespernode%20
      Name       MaxTRESPerNode 
---------- -------------------- 
       gpu               cpu=12 




When I run a job that requests 12 CPUs it runs fine, but when I request exclusive access to a node the job doesn't run.

$ srun -pgpu --ntasks 1 --cpus-per-task=12 --gpus-per-task=1 hostname
kitt

$ srun -pgpu -N1 --exclusive hostname
srun: job 9176 queued and waiting for resources




You can see that the Reason for the job is QOSMaxCpuPerNode.

$ scontrol show jobs 9176 | grep Reason
   JobState=PENDING Reason=QOSMaxCpuPerNode Dependency=(null)



Let me know if this looks like what you're looking for or if I'm not quite getting your use case.

Thanks,
Ben

Comment 4 Wei Feinstein 2023-08-14 13:41:04 MDT

Hi Ben,

Thank you for the suggestion and sorry for the late response. 

This suggestion will sort of work. Below is the GPU nodes in slurm.conf. 

[root@perceus-00 ~]# grep es1 /etc/slurm/slurm.conf
##es1 Nodes
## NodeName=n00[00-11].es[1] NodeAddr=10.0.43.[0-11]      CPUS=8  Sockets=2 CoresPerSocket=4  Feature=es1_1080ti,es1     Weight=1  Gres=gpu:GTX1080TI:4  RealMemory=64318
NodeName=n00[12-13].es[1] NodeAddr=10.0.44.[12-13]    CPUS=8  Sockets=2 CoresPerSocket=4  Feature=es1_v100,es1       Weight=3  Gres=gpu:V100:2       RealMemory=64318
NodeName=n00[14-23].es[1] NodeAddr=10.0.43.[14-23]    CPUS=8  Sockets=2 CoresPerSocket=4  Feature=es1_v100,es1       Weight=4  Gres=gpu:V100:2       RealMemory=192094
NodeName=n00[24-31].es[1] NodeAddr=10.0.43.[24-31]    CPUS=8  Sockets=2 CoresPerSocket=4  Feature=es1_2080ti,es1     Weight=1  Gres=gpu:GRTX2080TI:4 RealMemory=96236
NodeName=n00[32].es[1]    NodeAddr=10.0.43.[32]       CPUS=8  Sockets=2 CoresPerSocket=4  Feature=es1_v100,es1       Weight=4  Gres=gpu:V100:2       RealMemory=192086
NodeName=n00[33-34].es[1] NodeAddr=10.0.43.[33-34]    CPUS=8  Sockets=2 CoresPerSocket=4  Feature=es1_2080ti,es1     Weight=1  Gres=gpu:GRTX2080TI:4 RealMemory=95228
NodeName=n00[35-38].es[1] NodeAddr=10.0.43.[35-38]    CPUS=8  Sockets=2 CoresPerSocket=4  Feature=es1_2080ti,es1     Weight=2  Gres=gpu:GRTX2080TI:4 RealMemory=191996
NodeName=n00[39-40].es[1] NodeAddr=10.0.43.[39-40]    CPUS=8  Sockets=2 CoresPerSocket=4  Feature=es1_2080ti,es1     Weight=1  Gres=gpu:GRTX2080TI:4 RealMemory=95228
NodeName=n0041.es[1]      NodeAddr=10.0.43.41         CPUS=8  Sockets=2 CoresPerSocket=4  Feature=es1_2080ti,es1     Weight=2  Gres=gpu:GRTX2080TI:4 RealMemory=191996
NodeName=n00[42].es[1]    NodeAddr=10.0.43.42         CPUS=8  Sockets=2 CoresPerSocket=4  Feature=es1_2080ti,es1     Weight=1  Gres=gpu:GRTX2080TI:4 RealMemory=95228
NodeName=n00[43-44].es[1] NodeAddr=10.0.43.[43-44]    CPUS=16 Sockets=2 CoresPerSocket=8  Feature=es1_v100,es1,c16   Weight=5  Gres=gpu:V100:2       RealMemory=192093
NodeName=n00[00-05].es[1] NodeAddr=10.0.43.[0-5]      CPUS=64 Sockets=1  CoresPerSocket=64 Feature=es1_a40,es1       Weight=6  Gres=gpu:A40:4        RealMemory=515865
NodeName=n00[45-52].es[1] NodeAddr=10.0.43.[45-52]    CPUS=64 Sockets=1  CoresPerSocket=64 Feature=es1_a40,es1       Weight=6  Gres=gpu:A40:4        RealMemory=515865
PartitionName=es1           Nodes=n00[00-05,12-52].es[1]                Oversubscribe=FORCE        DefMemPerCPU=8000     LLN=Yes 

And 

es_normal|1000|00:00:00|es_lowprio||cluster|||1.000000|||||||node=64|||3-00:00:00|||||||cpu=2,gres/gpu=1|

With --exclusive, a user can get an entire node without using any GPU cards. Yet given the heterogeneity of the es1 GPU partition, MaxTRESPerNode is difficult to set. 

Is there anyway to prevent CPU usage on these GPU nodes?

Thank you,
Wei

Comment 5 Ben Roberts 2023-08-14 14:37:13 MDT

Hi Wei,

If you want to exclude CPU-only jobs from running on this partition then there are a couple options that should work. There is an option to set a minimum number of generic resources that must be requested on a QOS. This will allow you to prevent jobs that don't request a GPU from running. Here's a quick example of how that might look.

$ sacctmgr show qos member format=name,mintresperjob
Name MinTRES
---------- -------------
member gres/gpu=1

$ sbatch -qmember -n12 --wrap='srun sleep 120'
Submitted batch job 9240

$ sbatch -qmember -n12 --gpus=1 --wrap='srun sleep 120'
Submitted batch job 9241

$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
9240 debug wrap ben PD 0:00 1 (QOSMinGRES)
9241 debug wrap ben R 0:01 1 node01

In this example I just requested the QOS directly, but you could tie the QOS to the partition so that the limit is enforced on the partition, regardless of which QOS the user specifies.
https://slurm.schedmd.com/resource_limits.html#qos_mintresperjob

Another possible approach would be to create a submit filter that looks for jobs requesting this partition. If the job doesn't request a GPU you could add a GPU request to the job for them or reject the job with a message of your choice to make it clear why the job is rejected. You can read more about submit filters here:
https://slurm.schedmd.com/job_submit_plugins.html

Let me know if these don't sound like they would work.

Thanks,
Ben

Comment 6 Wei Feinstein 2023-08-14 14:47:40 MDT

Hi Ben,

es_normal|1000|00:00:00|es_lowprio||cluster|||1.000000|||||||node=64|||3-00:00:00|||||||cpu=2,gres/gpu=1|

As you can see, MinTRES is configured above, which works as expected until the flag --exclusive is used. 

I know slurm plugin is another way to check jobs at the submission stage, which has a lot more involved in terms of coding. We do have several plugins in place working with our account management portal.

Thanks,
Wei

Comment 7 Ben Roberts 2023-08-16 10:13:36 MDT

Hi Wei,

Ah, I didn't realize that you had MinTRES configured.  It's true that if a user specifies --exclusive it will allow the job to run because the exclusive flag will cause it to allocate all the resources on the node (with the exception of memory) to the job.  Since this includes a GPU then the job will be allowed to run.  Here's an example of how this looks with the same QOS configuration I showed in my previous example.

$ sbatch -qmember -n12 --exclusive --wrap='srun sleep 30'
Submitted batch job 9243

$ scontrol show jobs 9243 | grep TRES
   ReqTRES=cpu=12,mem=2400M,node=1,billing=13
   AllocTRES=cpu=24,mem=4800M,node=1,billing=34,gres/gpu=4,gres/gpu:tesla=4



I don't have a working example of a submit filter that would reject a job that doesn't request a GPU, but here's an example of how you can look for one.

    if (string.find(job_desc.tres_per_job, "gpu")) then
        slurm.log_user("Matched gpu on job")
    end


Let me know if you have any questions about this.

Thanks,
Ben

Comment 8 Ben Roberts 2023-09-26 12:56:07 MDT

Hi Wei,

I wanted to check in with you to see if you have any additional questions about this or if we can close the ticket.

Thanks,
Ben

Comment 9 Wei Feinstein 2023-09-26 13:46:37 MDT

Thank you Ben.

Comment 10 Ben Roberts 2023-09-26 13:56:59 MDT

No problem, closing now.