Ticket 6548

Summary:	tying gpu allocation to reserved cpus
Product:	Slurm	Reporter:	Naveed Near-Ansari <naveed>
Component:	Configuration	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	albert.gil
Version:	17.11.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=6734
Site:	Caltech	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	current slurm.conf

Description Naveed Near-Ansari 2019-02-20 10:29:43 MST

Created attachment 9230 [details]
current slurm.conf

We have a mixture of gpu nodes and cpu nodes.  We currently allow cpu jobs on any machines including those with gpus as gpu adoption has been slower than expected. We do however want to make sure that the gpus are available whenever some needs them if not in use.

we are also using the gpus in open on demand requests so want them to be able to be used there as well (but that is on our side of things)

Is it possible to keep a core or 2 available ties to allocation of a gpu on the node?  

Basically we want to be able to schedule up to 24 cores on a gpu machine for cpu jobs, but keep 4 around for allocation when someone requests a gpu as well.

If this is possible, how would i accomplish it?

All nodes are in a single partition (any) and resources are allocated be request rather than within a partition.

Comment 2 Ben Roberts 2019-02-20 16:07:50 MST

Hi Naveed,

I think you may be able to accomplish what you want to do with overlapping reservations.  You can create a reservation that covers the whole node for your general cpu jobs and create an overlapping reservation to reserve a few cpus to go with the gpus you have.  I set up example reservations on a test node, in my example I just made them be for different users:

$ scontrol create reservation starttime=now duration=2:00:00 nodes=node01 user=user1 tres=cpu=4
Reservation created: user1_28

$ scontrol create reservation starttime=now duration=2:00:00 nodes=node01 user=ben tres=cpu=2
Reservation created: ben_29

$ scontrol show reservation
ReservationName=user1_28 StartTime=2019-02-20T16:45:06 EndTime=2019-02-20T18:45:06 Duration=02:00:00
   Nodes=node01 NodeCnt=1 CoreCnt=4 Features=(null) PartitionName=debug Flags=SPEC_NODES
     NodeName=node01 CoreIDs=0-3
   TRES=cpu=8
   Users=user1 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=ben_29 StartTime=2019-02-20T16:45:15 EndTime=2019-02-20T18:45:15 Duration=02:00:00
   Nodes=node01 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=debug Flags=SPEC_NODES
     NodeName=node01 CoreIDs=4-5
   TRES=cpu=4
   Users=ben Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a




With those two reservations in place I submitted jobs to the first reservation as 'user1' to verify that it would use 8 cpus but not more:

user1@ben-XPS-15-9570:~$ sbatch -n4 -wnode01 --reservation=user1_28 --wrap="sleep 300"
Submitted batch job 2178
user1@ben-XPS-15-9570:~$ sbatch -n4 -wnode01 --reservation=user1_28 --wrap="sleep 300"
Submitted batch job 2179
user1@ben-XPS-15-9570:~$ sbatch -n4 -wnode01 --reservation=user1_28 --wrap="sleep 300"
Submitted batch job 2180

user1@ben-XPS-15-9570:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2180     debug     wrap    user1 PD       0:00      1 (Resources)
              2179     debug     wrap    user1  R       0:01      1 node01
              2178     debug     wrap    user1  R       0:04      1 node01



I then submitted jobs to the other reservation as 'ben' and requested gpus and verified that they would occupy the other cpus on the node:

ben@ben-XPS-15-9570:~/slurm$ sbatch -n2 -wnode01 --reservation=ben_29 --gres=gpu:2 --wrap="sleep 300"
Submitted batch job 2181
ben@ben-XPS-15-9570:~/slurm$ sbatch -n2 -wnode01 --reservation=ben_29 --gres=gpu:2 --wrap="sleep 300"
Submitted batch job 2182
ben@ben-XPS-15-9570:~/slurm$ sbatch -n2 -wnode01 --reservation=ben_29 --gres=gpu:2 --wrap="sleep 300"
Submitted batch job 2183

ben@ben-XPS-15-9570:~/slurm$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2183     debug     wrap      ben PD       0:00      1 (Resources)
              2180     debug     wrap    user1 PD       0:00      1 (ReqNodeNotAvail, UnavailableNodes:node01)
              2182     debug     wrap      ben  R       0:04      1 node01
              2181     debug     wrap      ben  R       0:06      1 node01
              2179     debug     wrap    user1  R       0:57      1 node01
              2178     debug     wrap    user1  R       1:00      1 node01




You mention that you have all your nodes in a single partition.  Have you thought of creating a partition for your gpu nodes?  You can leave the nodes in the primary partition as well, but with different partitions it would be easier to separate the reservations.  

Let me know if you have questions about this.

Thanks,
Ben

Comment 3 Naveed Near-Ansari 2019-02-25 10:40:32 MST

Thank you.  That wasn’t exactly what I was looking for.  Overlapping reservations would make it more complicated for my users, ad I am trying to make it simpler. I was hoping there would be a way to tie a particular core to a particular gpu so that one couldn’t be requested without the other to ensure that a core is always available for a request of a gpu while still allowing use of the rest of the node since we have very heavy cpu users.

Any thoughts on other ways to accomplish this?  Do you know of other sites that do similar?

Can a part of a  node be in one partition and the rest in another? I assume not, but perhaps there is some other wy to accomplish it.

Naveed

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Wednesday, February 20, 2019 at 3:08 PM
To: Naveed Near-Ansari <naveed@caltech.edu>
Subject: [Bug 6548] tying gpu allocation to reserved cpus

Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=6548#c2> on bug 6548<https://bugs.schedmd.com/show_bug.cgi?id=6548> from Ben<mailto:ben@schedmd.com>

Hi Naveed,



I think you may be able to accomplish what you want to do with overlapping

reservations.  You can create a reservation that covers the whole node for your

general cpu jobs and create an overlapping reservation to reserve a few cpus to

go with the gpus you have.  I set up example reservations on a test node, in my

example I just made them be for different users:



$ scontrol create reservation starttime=now duration=2:00:00 nodes=node01

user=user1 tres=cpu=4

Reservation created: user1_28



$ scontrol create reservation starttime=now duration=2:00:00 nodes=node01

user=ben tres=cpu=2

Reservation created: ben_29



$ scontrol show reservation

ReservationName=user1_28 StartTime=2019-02-20T16:45:06

EndTime=2019-02-20T18:45:06 Duration=02:00:00

   Nodes=node01 NodeCnt=1 CoreCnt=4 Features=(null) PartitionName=debug

Flags=SPEC_NODES

     NodeName=node01 CoreIDs=0-3

   TRES=cpu=8

   Users=user1 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null)

Watts=n/a



ReservationName=ben_29 StartTime=2019-02-20T16:45:15

EndTime=2019-02-20T18:45:15 Duration=02:00:00

   Nodes=node01 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=debug

Flags=SPEC_NODES

     NodeName=node01 CoreIDs=4-5

   TRES=cpu=4

   Users=ben Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null)

Watts=n/a









With those two reservations in place I submitted jobs to the first reservation

as 'user1' to verify that it would use 8 cpus but not more:



user1@ben-XPS-15-9570:~$ sbatch -n4 -wnode01 --reservation=user1_28

--wrap="sleep 300"

Submitted batch job 2178

user1@ben-XPS-15-9570:~$ sbatch -n4 -wnode01 --reservation=user1_28

--wrap="sleep 300"

Submitted batch job 2179

user1@ben-XPS-15-9570:~$ sbatch -n4 -wnode01 --reservation=user1_28

--wrap="sleep 300"

Submitted batch job 2180



user1@ben-XPS-15-9570:~$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES

NODELIST(REASON)

              2180     debug     wrap    user1 PD       0:00      1 (Resources)

              2179     debug     wrap    user1  R       0:01      1 node01

              2178     debug     wrap    user1  R       0:04      1 node01







I then submitted jobs to the other reservation as 'ben' and requested gpus and

verified that they would occupy the other cpus on the node:



ben@ben-XPS-15-9570:~/slurm$ sbatch -n2 -wnode01 --reservation=ben_29

--gres=gpu:2 --wrap="sleep 300"

Submitted batch job 2181

ben@ben-XPS-15-9570:~/slurm$ sbatch -n2 -wnode01 --reservation=ben_29

--gres=gpu:2 --wrap="sleep 300"

Submitted batch job 2182

ben@ben-XPS-15-9570:~/slurm$ sbatch -n2 -wnode01 --reservation=ben_29

--gres=gpu:2 --wrap="sleep 300"

Submitted batch job 2183



ben@ben-XPS-15-9570:~/slurm$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES

NODELIST(REASON)

              2183     debug     wrap      ben PD       0:00      1 (Resources)

              2180     debug     wrap    user1 PD       0:00      1

(ReqNodeNotAvail, UnavailableNodes:node01)

              2182     debug     wrap      ben  R       0:04      1 node01

              2181     debug     wrap      ben  R       0:06      1 node01

              2179     debug     wrap    user1  R       0:57      1 node01

              2178     debug     wrap    user1  R       1:00      1 node01









You mention that you have all your nodes in a single partition.  Have you

thought of creating a partition for your gpu nodes?  You can leave the nodes in

the primary partition as well, but with different partitions it would be easier

to separate the reservations.



Let me know if you have questions about this.



Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 5 Ben Roberts 2019-02-26 09:28:39 MST

Hi Naveed,

Yes, I was trying to come up with a way to do this in a single partition initially, but it sounds like you are open to creating a separate partition for the gpu jobs.  If this is the case you can use the MaxCPUsPerNode parameter to define how many cpus each partition can use.  Here's the description of this parameter from the documentation:

MaxCPUsPerNode
    Maximum number of CPUs on any node available to all jobs from this partition. This can be especially useful to schedule GPUs. For example a node can be associated with two Slurm partitions (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be limited to only a subset of the node's CPUs, ensuring that one or more CPUs would be available to jobs in the "gpu" partition/queue. 


You would define this in the slurm.conf on the partition line, so it looked something like this:
PartitionName=cpu Nodes=node[01-10] Default=YES MaxTime=INFINITE MaxCPUsPerNode=20 State=UP
PartitionName=gpu Nodes=node[01-10] Default=NO MaxTime=INFINITE MaxCPUsPerNode=4 State=UP

You can find the information about this parameter from the documentation here:
https://slurm.schedmd.com/slurm.conf.html

Along with this you can associate GPUs with certain CPUs in your gres.conf file.  It could look something like this:
Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1
Name=gpu Type=gtx560 File=/dev/nvidia1 COREs=0,1
Name=gpu Type=tesla File=/dev/nvidia2 COREs=2,3
Name=gpu Type=tesla File=/dev/nvidia3 COREs=2,3 

This example comes from the gres.conf documentation:
https://slurm.schedmd.com/gres.conf.html


I hope this is closer to what you are looking for.  Let me know if you have questions about it.

Thanks,
Ben

Comment 6 Naveed Near-Ansari 2019-02-26 10:25:30 MST

That looks closer, but not quite there.  Can you set MaxCPUsPerNode per node in the partition?  I ask since there are both cpu and gpu nodes that would be in the “any” queue, each of which  have different core counts.

Something like this perhaps?

PartitionName=any Nodes= hpc-22-[07-24],hpc-23-[07-24],hpc-24-[07-24],hpc-25-[03-10],hpc-89-[03-26],hpc-90-[03-26,29-30],hpc-91-[09-21,24-25],hpc-92-[03-26,29-30],hpc-93-[03-26,29-30] MaxCPUsPerNode=32 Nodes= hpc-22-[28,30,32,34,36,38],hpc-23-[28,30,32,34,36,38],hpc-24-[28,30,32,34,36,38],hpc-25-[14-15,17-18,20-21,23-24],hpc-26-[14-15,17-18,20-21,23-24],hpc-89-[35-38],hpc-90-[35-38],hpc-91-[32-33],hpc-92-[36-38],hpc-93-[36-38] MaxCPUsPerNode=20 MaxTime=14-0 State=UP


Or possibly a QOS tied to a partition that has maximums?

Not sure the best way or the other choices.  This is defeintely a step in the right direction as far as simplicity to the end user goes.

Naveed

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, February 26, 2019 at 8:28 AM
To: Naveed Near-Ansari <naveed@caltech.edu>
Subject: [Bug 6548] tying gpu allocation to reserved cpus

Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=6548#c5> on bug 6548<https://bugs.schedmd.com/show_bug.cgi?id=6548> from Ben<mailto:ben@schedmd.com>

Hi Naveed,



Yes, I was trying to come up with a way to do this in a single partition

initially, but it sounds like you are open to creating a separate partition for

the gpu jobs.  If this is the case you can use the MaxCPUsPerNode parameter to

define how many cpus each partition can use.  Here's the description of this

parameter from the documentation:



MaxCPUsPerNode

    Maximum number of CPUs on any node available to all jobs from this

partition. This can be especially useful to schedule GPUs. For example a node

can be associated with two Slurm partitions (e.g. "cpu" and "gpu") and the

partition/queue "cpu" could be limited to only a subset of the node's CPUs,

ensuring that one or more CPUs would be available to jobs in the "gpu"

partition/queue.





You would define this in the slurm.conf on the partition line, so it looked

something like this:

PartitionName=cpu Nodes=node[01-10] Default=YES MaxTime=INFINITE

MaxCPUsPerNode=20 State=UP

PartitionName=gpu Nodes=node[01-10] Default=NO MaxTime=INFINITE

MaxCPUsPerNode=4 State=UP



You can find the information about this parameter from the documentation here:

https://slurm.schedmd.com/slurm.conf.html



Along with this you can associate GPUs with certain CPUs in your gres.conf

file.  It could look something like this:

Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1

Name=gpu Type=gtx560 File=/dev/nvidia1 COREs=0,1

Name=gpu Type=tesla File=/dev/nvidia2 COREs=2,3

Name=gpu Type=tesla File=/dev/nvidia3 COREs=2,3



This example comes from the gres.conf documentation:

https://slurm.schedmd.com/gres.conf.html





I hope this is closer to what you are looking for.  Let me know if you have

questions about it.



Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 7 Ben Roberts 2019-02-26 16:04:45 MST

Hi Naveed,

Unfortunately you can't have multiple 'Nodes=' definitions in a single partition.  Slurm will complain about the syntax and just use the latest value in the config line, ignoring the first.  

I was looking at whether you could configure something at the qos level to enforce this, but there isn't an equivalent setting for a qos.  The closest is MaxTRESPerNode, but this enforces the maximum number of resources a single job can request per node, but if individual jobs are below this limit and there's room for multiple jobs in the same partition on a node it will consume all the available resources on the node.  

What you can do is define unique partitions for the cpu and gpu nodes so you can define the MaxCPUsPerNode appropriate for each type of node.  Then the users can request that their jobs run in either of the partitions when they submit, like this:
sbatch -p debug,gpu test.job

This will allow the job to go to whichever partition is available to run the job first.  This does require the user to specify both partitions at submission time, but that should be easier to do than requesting a reservation at submit time.  

Let me know if this looks like it will work.

Thanks,
Ben

Comment 8 Ben Roberts 2019-03-14 08:43:22 MDT

Hi Naveed,

I wanted to follow up and see if you had additional questions about the information I sent.  Does using multiple partitions work for you?  Let me know if you still need help with this ticket.

Thanks,
Ben

Comment 9 Ben Roberts 2019-03-21 15:30:53 MDT

Hi Naveed,

I have a colleague with a similar request and we were discussing how to best address a situation where you want to have certain CPUs/cores reserved for your gpus. The idea we decided would probably work best is to create another gres that you can associate with the CPUs not associated with your GPU gres. Then you can use a submit plugin to assign the "nongpu" gres unless they request the "gpu" gres.

Here is an example configuration I used for testing. In my slurm.conf these are the relevant lines that need to be there (I used node10 for this test):
...
TaskPlugin=task/affinity,task/cgroup
GresTypes=gpu,test
NodeName=node10 Gres=gpu:1,test:1 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2
...

In my gres.conf I needed to specify which cores would be bound to each gres:
NodeName=node10 Name=gpu Count=1 File=/dev/nvidia0 Cores=0
NodeName=node10 Name=test Count=1 File=/dev/zero Cores=2

Then when I run a job that requests one of those gres' I can see that I'm limited to the CPUs associated with the cores I tied the gres to:
$ srun -n1 --gres=test:1 cat /proc/self/status | grep Cpus_allowed
Cpus_allowed: 082
Cpus_allowed_list: 1,7

$ srun -n1 --gres=gpu:1 cat /proc/self/status | grep Cpus_allowed
Cpus_allowed: 041
Cpus_allowed_list: 0,6

One thing to note is that I had to have a 'File=' defined in my gres.conf for this to work properly. If I just specified a count of the number of generic resources I wanted it would go anywhere on the cluster.

There is also a feature request that has been opened to allow you to accomplish the same thing without having to create a second gres. You can follow the status of this request here:
https://bugs.schedmd.com/show_bug.cgi?id=6734

Let me know if you have questions about implementing a secondary gres.

Thanks,
Ben

Comment 13 Ben Roberts 2019-04-09 12:22:56 MDT

Hi Naveed,

I wanted to follow up and make sure you were able to use the suggestion I sent of using the gres.conf file to allow you to tie the gres's to certain cores on the system.  Let me know if you still need help with this ticket or if it's ok to close.

Thanks,
Ben

Comment 14 Ben Roberts 2019-04-25 09:16:08 MDT

Hi Naveed,

I haven't heard from you in a while on this ticket and I think the information I sent should have addressed your question.  I'm going to close this ticket but feel free to let me know if you have additional questions about it.  

Thanks,
Ben

Comment 15 Naveed Near-Ansari 2019-05-02 09:33:04 MDT

This seems like a potential solution in the short term.  Do you have a example of the submit script you used on it?

In this example, the node has 12 cores, correct?  I see it has a gres for the gpus and a gres for the cpus, but how are they associated? Id this done in the script?

Thanks

Naveed

PS – sorry I took so long to get back in touch with you.

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Thursday, March 21, 2019 at 2:30 PM
To: Naveed Near-Ansari <naveed@caltech.edu>
Subject: [Bug 6548] tying gpu allocation to reserved cpus

Ben<mailto:ben@schedmd.com> changed bug 6548<https://bugs.schedmd.com/show_bug.cgi?id=6548>
What

Removed

Added

See Also



https://bugs.schedmd.com/show_bug.cgi?id=6734

Comment # 9<https://bugs.schedmd.com/show_bug.cgi?id=6548#c9> on bug 6548<https://bugs.schedmd.com/show_bug.cgi?id=6548> from Ben<mailto:ben@schedmd.com>

Hi Naveed,



I have a colleague with a similar request and we were discussing how to best

address a situation where you want to have certain CPUs/cores reserved for your

gpus.  The idea we decided would probably work best is to create another gres

that you can associate with the CPUs not associated with your GPU gres.  Then

you can use a submit plugin to assign the "nongpu" gres unless they request the

"gpu" gres.



Here is an example configuration I used for testing.  In my slurm.conf these

are the relevant lines that need to be there (I used node10 for this test):

...

TaskPlugin=task/affinity,task/cgroup

GresTypes=gpu,test

NodeName=node10  Gres=gpu:1,test:1 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6

ThreadsPerCore=2

...





In my gres.conf I needed to specify which cores would be bound to each gres:

NodeName=node10 Name=gpu Count=1 File=/dev/nvidia0 Cores=0

NodeName=node10 Name=test Count=1 File=/dev/zero Cores=2





Then when I run a job that requests one of those gres' I can see that I'm

limited to the CPUs associated with the cores I tied the gres to:

$ srun -n1 --gres=test:1 cat /proc/self/status | grep Cpus_allowed

Cpus_allowed:   082

Cpus_allowed_list:      1,7



$ srun -n1 --gres=gpu:1 cat /proc/self/status | grep Cpus_allowed

Cpus_allowed:   041

Cpus_allowed_list:      0,6





One thing to note is that I had to have a 'File=' defined in my gres.conf for

this to work properly.  If I just specified a count of the number of generic

resources I wanted it would go anywhere on the cluster.



There is also a feature request that has been opened to allow you to accomplish

the same thing without having to create a second gres.  You can follow the

status of this request here:

https://bugs.schedmd.com/show_bug.cgi?id=6734<show_bug.cgi?id=6734>



Let me know if you have questions about implementing a secondary gres.



Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 16 Ben Roberts 2019-05-02 15:00:49 MDT

Hi Naveed,

I didn't use a submit script, but here are the steps I took to accomplish this.  These are the relevant lines from my slurm.conf:

...
TaskPlugin=task/affinity,task/cgroup
GresTypes=gpu,test
NodeName=node10  Gres=gpu:1,test:1 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2
...



In my gres.conf I needed to specify which cores would be bound to each gres:

NodeName=node10 Name=gpu Count=1 File=/dev/nvidia0 Cores=0
NodeName=node10 Name=test Count=1 File=/dev/zero Cores=2



Then when I run a job that requests one of those gres' I can see that I'm limited to the CPUs associated with the cores I tied the gres to:

$ srun -n1 --gres=test:1 cat /proc/self/status | grep Cpus_allowed
Cpus_allowed:   082
Cpus_allowed_list:      1,7

$ srun -n1 --gres=gpu:1 cat /proc/self/status | grep Cpus_allowed
Cpus_allowed:   041
Cpus_allowed_list:      0,6



The association of the gres with the CPUs is done in the gres.conf.  You specify the Cores you would like to associate the gres with.  You can use lstopo to see the layout of the processors on your machine.  There is more information about this in the gres.conf documentation:
https://slurm.schedmd.com/gres.conf.html#OPT_Cores

Let me know if you have additional questions about this.

Thanks,
Ben