| Summary: | tying gpu allocation to reserved cpus | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Naveed Near-Ansari <naveed> |
| Component: | Configuration | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | albert.gil |
| Version: | 17.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=6734 | ||
| Site: | Caltech | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | current slurm.conf | ||
Hi Naveed,
I think you may be able to accomplish what you want to do with overlapping reservations. You can create a reservation that covers the whole node for your general cpu jobs and create an overlapping reservation to reserve a few cpus to go with the gpus you have. I set up example reservations on a test node, in my example I just made them be for different users:
$ scontrol create reservation starttime=now duration=2:00:00 nodes=node01 user=user1 tres=cpu=4
Reservation created: user1_28
$ scontrol create reservation starttime=now duration=2:00:00 nodes=node01 user=ben tres=cpu=2
Reservation created: ben_29
$ scontrol show reservation
ReservationName=user1_28 StartTime=2019-02-20T16:45:06 EndTime=2019-02-20T18:45:06 Duration=02:00:00
Nodes=node01 NodeCnt=1 CoreCnt=4 Features=(null) PartitionName=debug Flags=SPEC_NODES
NodeName=node01 CoreIDs=0-3
TRES=cpu=8
Users=user1 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
ReservationName=ben_29 StartTime=2019-02-20T16:45:15 EndTime=2019-02-20T18:45:15 Duration=02:00:00
Nodes=node01 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=debug Flags=SPEC_NODES
NodeName=node01 CoreIDs=4-5
TRES=cpu=4
Users=ben Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
With those two reservations in place I submitted jobs to the first reservation as 'user1' to verify that it would use 8 cpus but not more:
user1@ben-XPS-15-9570:~$ sbatch -n4 -wnode01 --reservation=user1_28 --wrap="sleep 300"
Submitted batch job 2178
user1@ben-XPS-15-9570:~$ sbatch -n4 -wnode01 --reservation=user1_28 --wrap="sleep 300"
Submitted batch job 2179
user1@ben-XPS-15-9570:~$ sbatch -n4 -wnode01 --reservation=user1_28 --wrap="sleep 300"
Submitted batch job 2180
user1@ben-XPS-15-9570:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2180 debug wrap user1 PD 0:00 1 (Resources)
2179 debug wrap user1 R 0:01 1 node01
2178 debug wrap user1 R 0:04 1 node01
I then submitted jobs to the other reservation as 'ben' and requested gpus and verified that they would occupy the other cpus on the node:
ben@ben-XPS-15-9570:~/slurm$ sbatch -n2 -wnode01 --reservation=ben_29 --gres=gpu:2 --wrap="sleep 300"
Submitted batch job 2181
ben@ben-XPS-15-9570:~/slurm$ sbatch -n2 -wnode01 --reservation=ben_29 --gres=gpu:2 --wrap="sleep 300"
Submitted batch job 2182
ben@ben-XPS-15-9570:~/slurm$ sbatch -n2 -wnode01 --reservation=ben_29 --gres=gpu:2 --wrap="sleep 300"
Submitted batch job 2183
ben@ben-XPS-15-9570:~/slurm$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2183 debug wrap ben PD 0:00 1 (Resources)
2180 debug wrap user1 PD 0:00 1 (ReqNodeNotAvail, UnavailableNodes:node01)
2182 debug wrap ben R 0:04 1 node01
2181 debug wrap ben R 0:06 1 node01
2179 debug wrap user1 R 0:57 1 node01
2178 debug wrap user1 R 1:00 1 node01
You mention that you have all your nodes in a single partition. Have you thought of creating a partition for your gpu nodes? You can leave the nodes in the primary partition as well, but with different partitions it would be easier to separate the reservations.
Let me know if you have questions about this.
Thanks,
Ben
Thank you. That wasn’t exactly what I was looking for. Overlapping reservations would make it more complicated for my users, ad I am trying to make it simpler. I was hoping there would be a way to tie a particular core to a particular gpu so that one couldn’t be requested without the other to ensure that a core is always available for a request of a gpu while still allowing use of the rest of the node since we have very heavy cpu users. Any thoughts on other ways to accomplish this? Do you know of other sites that do similar? Can a part of a node be in one partition and the rest in another? I assume not, but perhaps there is some other wy to accomplish it. Naveed From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Wednesday, February 20, 2019 at 3:08 PM To: Naveed Near-Ansari <naveed@caltech.edu> Subject: [Bug 6548] tying gpu allocation to reserved cpus Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=6548#c2> on bug 6548<https://bugs.schedmd.com/show_bug.cgi?id=6548> from Ben<mailto:ben@schedmd.com> Hi Naveed, I think you may be able to accomplish what you want to do with overlapping reservations. You can create a reservation that covers the whole node for your general cpu jobs and create an overlapping reservation to reserve a few cpus to go with the gpus you have. I set up example reservations on a test node, in my example I just made them be for different users: $ scontrol create reservation starttime=now duration=2:00:00 nodes=node01 user=user1 tres=cpu=4 Reservation created: user1_28 $ scontrol create reservation starttime=now duration=2:00:00 nodes=node01 user=ben tres=cpu=2 Reservation created: ben_29 $ scontrol show reservation ReservationName=user1_28 StartTime=2019-02-20T16:45:06 EndTime=2019-02-20T18:45:06 Duration=02:00:00 Nodes=node01 NodeCnt=1 CoreCnt=4 Features=(null) PartitionName=debug Flags=SPEC_NODES NodeName=node01 CoreIDs=0-3 TRES=cpu=8 Users=user1 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=ben_29 StartTime=2019-02-20T16:45:15 EndTime=2019-02-20T18:45:15 Duration=02:00:00 Nodes=node01 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=debug Flags=SPEC_NODES NodeName=node01 CoreIDs=4-5 TRES=cpu=4 Users=ben Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a With those two reservations in place I submitted jobs to the first reservation as 'user1' to verify that it would use 8 cpus but not more: user1@ben-XPS-15-9570:~$ sbatch -n4 -wnode01 --reservation=user1_28 --wrap="sleep 300" Submitted batch job 2178 user1@ben-XPS-15-9570:~$ sbatch -n4 -wnode01 --reservation=user1_28 --wrap="sleep 300" Submitted batch job 2179 user1@ben-XPS-15-9570:~$ sbatch -n4 -wnode01 --reservation=user1_28 --wrap="sleep 300" Submitted batch job 2180 user1@ben-XPS-15-9570:~$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2180 debug wrap user1 PD 0:00 1 (Resources) 2179 debug wrap user1 R 0:01 1 node01 2178 debug wrap user1 R 0:04 1 node01 I then submitted jobs to the other reservation as 'ben' and requested gpus and verified that they would occupy the other cpus on the node: ben@ben-XPS-15-9570:~/slurm$ sbatch -n2 -wnode01 --reservation=ben_29 --gres=gpu:2 --wrap="sleep 300" Submitted batch job 2181 ben@ben-XPS-15-9570:~/slurm$ sbatch -n2 -wnode01 --reservation=ben_29 --gres=gpu:2 --wrap="sleep 300" Submitted batch job 2182 ben@ben-XPS-15-9570:~/slurm$ sbatch -n2 -wnode01 --reservation=ben_29 --gres=gpu:2 --wrap="sleep 300" Submitted batch job 2183 ben@ben-XPS-15-9570:~/slurm$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2183 debug wrap ben PD 0:00 1 (Resources) 2180 debug wrap user1 PD 0:00 1 (ReqNodeNotAvail, UnavailableNodes:node01) 2182 debug wrap ben R 0:04 1 node01 2181 debug wrap ben R 0:06 1 node01 2179 debug wrap user1 R 0:57 1 node01 2178 debug wrap user1 R 1:00 1 node01 You mention that you have all your nodes in a single partition. Have you thought of creating a partition for your gpu nodes? You can leave the nodes in the primary partition as well, but with different partitions it would be easier to separate the reservations. Let me know if you have questions about this. Thanks, Ben ________________________________ You are receiving this mail because: * You reported the bug. Hi Naveed,
Yes, I was trying to come up with a way to do this in a single partition initially, but it sounds like you are open to creating a separate partition for the gpu jobs. If this is the case you can use the MaxCPUsPerNode parameter to define how many cpus each partition can use. Here's the description of this parameter from the documentation:
MaxCPUsPerNode
Maximum number of CPUs on any node available to all jobs from this partition. This can be especially useful to schedule GPUs. For example a node can be associated with two Slurm partitions (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be limited to only a subset of the node's CPUs, ensuring that one or more CPUs would be available to jobs in the "gpu" partition/queue.
You would define this in the slurm.conf on the partition line, so it looked something like this:
PartitionName=cpu Nodes=node[01-10] Default=YES MaxTime=INFINITE MaxCPUsPerNode=20 State=UP
PartitionName=gpu Nodes=node[01-10] Default=NO MaxTime=INFINITE MaxCPUsPerNode=4 State=UP
You can find the information about this parameter from the documentation here:
https://slurm.schedmd.com/slurm.conf.html
Along with this you can associate GPUs with certain CPUs in your gres.conf file. It could look something like this:
Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1
Name=gpu Type=gtx560 File=/dev/nvidia1 COREs=0,1
Name=gpu Type=tesla File=/dev/nvidia2 COREs=2,3
Name=gpu Type=tesla File=/dev/nvidia3 COREs=2,3
This example comes from the gres.conf documentation:
https://slurm.schedmd.com/gres.conf.html
I hope this is closer to what you are looking for. Let me know if you have questions about it.
Thanks,
Ben
That looks closer, but not quite there. Can you set MaxCPUsPerNode per node in the partition? I ask since there are both cpu and gpu nodes that would be in the “any” queue, each of which have different core counts. Something like this perhaps? PartitionName=any Nodes= hpc-22-[07-24],hpc-23-[07-24],hpc-24-[07-24],hpc-25-[03-10],hpc-89-[03-26],hpc-90-[03-26,29-30],hpc-91-[09-21,24-25],hpc-92-[03-26,29-30],hpc-93-[03-26,29-30] MaxCPUsPerNode=32 Nodes= hpc-22-[28,30,32,34,36,38],hpc-23-[28,30,32,34,36,38],hpc-24-[28,30,32,34,36,38],hpc-25-[14-15,17-18,20-21,23-24],hpc-26-[14-15,17-18,20-21,23-24],hpc-89-[35-38],hpc-90-[35-38],hpc-91-[32-33],hpc-92-[36-38],hpc-93-[36-38] MaxCPUsPerNode=20 MaxTime=14-0 State=UP Or possibly a QOS tied to a partition that has maximums? Not sure the best way or the other choices. This is defeintely a step in the right direction as far as simplicity to the end user goes. Naveed From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, February 26, 2019 at 8:28 AM To: Naveed Near-Ansari <naveed@caltech.edu> Subject: [Bug 6548] tying gpu allocation to reserved cpus Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=6548#c5> on bug 6548<https://bugs.schedmd.com/show_bug.cgi?id=6548> from Ben<mailto:ben@schedmd.com> Hi Naveed, Yes, I was trying to come up with a way to do this in a single partition initially, but it sounds like you are open to creating a separate partition for the gpu jobs. If this is the case you can use the MaxCPUsPerNode parameter to define how many cpus each partition can use. Here's the description of this parameter from the documentation: MaxCPUsPerNode Maximum number of CPUs on any node available to all jobs from this partition. This can be especially useful to schedule GPUs. For example a node can be associated with two Slurm partitions (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be limited to only a subset of the node's CPUs, ensuring that one or more CPUs would be available to jobs in the "gpu" partition/queue. You would define this in the slurm.conf on the partition line, so it looked something like this: PartitionName=cpu Nodes=node[01-10] Default=YES MaxTime=INFINITE MaxCPUsPerNode=20 State=UP PartitionName=gpu Nodes=node[01-10] Default=NO MaxTime=INFINITE MaxCPUsPerNode=4 State=UP You can find the information about this parameter from the documentation here: https://slurm.schedmd.com/slurm.conf.html Along with this you can associate GPUs with certain CPUs in your gres.conf file. It could look something like this: Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1 Name=gpu Type=gtx560 File=/dev/nvidia1 COREs=0,1 Name=gpu Type=tesla File=/dev/nvidia2 COREs=2,3 Name=gpu Type=tesla File=/dev/nvidia3 COREs=2,3 This example comes from the gres.conf documentation: https://slurm.schedmd.com/gres.conf.html I hope this is closer to what you are looking for. Let me know if you have questions about it. Thanks, Ben ________________________________ You are receiving this mail because: * You reported the bug. Hi Naveed, Unfortunately you can't have multiple 'Nodes=' definitions in a single partition. Slurm will complain about the syntax and just use the latest value in the config line, ignoring the first. I was looking at whether you could configure something at the qos level to enforce this, but there isn't an equivalent setting for a qos. The closest is MaxTRESPerNode, but this enforces the maximum number of resources a single job can request per node, but if individual jobs are below this limit and there's room for multiple jobs in the same partition on a node it will consume all the available resources on the node. What you can do is define unique partitions for the cpu and gpu nodes so you can define the MaxCPUsPerNode appropriate for each type of node. Then the users can request that their jobs run in either of the partitions when they submit, like this: sbatch -p debug,gpu test.job This will allow the job to go to whichever partition is available to run the job first. This does require the user to specify both partitions at submission time, but that should be easier to do than requesting a reservation at submit time. Let me know if this looks like it will work. Thanks, Ben Hi Naveed, I wanted to follow up and see if you had additional questions about the information I sent. Does using multiple partitions work for you? Let me know if you still need help with this ticket. Thanks, Ben Hi Naveed, I have a colleague with a similar request and we were discussing how to best address a situation where you want to have certain CPUs/cores reserved for your gpus. The idea we decided would probably work best is to create another gres that you can associate with the CPUs not associated with your GPU gres. Then you can use a submit plugin to assign the "nongpu" gres unless they request the "gpu" gres. Here is an example configuration I used for testing. In my slurm.conf these are the relevant lines that need to be there (I used node10 for this test): ... TaskPlugin=task/affinity,task/cgroup GresTypes=gpu,test NodeName=node10 Gres=gpu:1,test:1 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 ... In my gres.conf I needed to specify which cores would be bound to each gres: NodeName=node10 Name=gpu Count=1 File=/dev/nvidia0 Cores=0 NodeName=node10 Name=test Count=1 File=/dev/zero Cores=2 Then when I run a job that requests one of those gres' I can see that I'm limited to the CPUs associated with the cores I tied the gres to: $ srun -n1 --gres=test:1 cat /proc/self/status | grep Cpus_allowed Cpus_allowed: 082 Cpus_allowed_list: 1,7 $ srun -n1 --gres=gpu:1 cat /proc/self/status | grep Cpus_allowed Cpus_allowed: 041 Cpus_allowed_list: 0,6 One thing to note is that I had to have a 'File=' defined in my gres.conf for this to work properly. If I just specified a count of the number of generic resources I wanted it would go anywhere on the cluster. There is also a feature request that has been opened to allow you to accomplish the same thing without having to create a second gres. You can follow the status of this request here: https://bugs.schedmd.com/show_bug.cgi?id=6734 Let me know if you have questions about implementing a secondary gres. Thanks, Ben Hi Naveed, I wanted to follow up and make sure you were able to use the suggestion I sent of using the gres.conf file to allow you to tie the gres's to certain cores on the system. Let me know if you still need help with this ticket or if it's ok to close. Thanks, Ben Hi Naveed, I haven't heard from you in a while on this ticket and I think the information I sent should have addressed your question. I'm going to close this ticket but feel free to let me know if you have additional questions about it. Thanks, Ben This seems like a potential solution in the short term. Do you have a example of the submit script you used on it? In this example, the node has 12 cores, correct? I see it has a gres for the gpus and a gres for the cpus, but how are they associated? Id this done in the script? Thanks Naveed PS – sorry I took so long to get back in touch with you. From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Thursday, March 21, 2019 at 2:30 PM To: Naveed Near-Ansari <naveed@caltech.edu> Subject: [Bug 6548] tying gpu allocation to reserved cpus Ben<mailto:ben@schedmd.com> changed bug 6548<https://bugs.schedmd.com/show_bug.cgi?id=6548> What Removed Added See Also https://bugs.schedmd.com/show_bug.cgi?id=6734 Comment # 9<https://bugs.schedmd.com/show_bug.cgi?id=6548#c9> on bug 6548<https://bugs.schedmd.com/show_bug.cgi?id=6548> from Ben<mailto:ben@schedmd.com> Hi Naveed, I have a colleague with a similar request and we were discussing how to best address a situation where you want to have certain CPUs/cores reserved for your gpus. The idea we decided would probably work best is to create another gres that you can associate with the CPUs not associated with your GPU gres. Then you can use a submit plugin to assign the "nongpu" gres unless they request the "gpu" gres. Here is an example configuration I used for testing. In my slurm.conf these are the relevant lines that need to be there (I used node10 for this test): ... TaskPlugin=task/affinity,task/cgroup GresTypes=gpu,test NodeName=node10 Gres=gpu:1,test:1 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 ... In my gres.conf I needed to specify which cores would be bound to each gres: NodeName=node10 Name=gpu Count=1 File=/dev/nvidia0 Cores=0 NodeName=node10 Name=test Count=1 File=/dev/zero Cores=2 Then when I run a job that requests one of those gres' I can see that I'm limited to the CPUs associated with the cores I tied the gres to: $ srun -n1 --gres=test:1 cat /proc/self/status | grep Cpus_allowed Cpus_allowed: 082 Cpus_allowed_list: 1,7 $ srun -n1 --gres=gpu:1 cat /proc/self/status | grep Cpus_allowed Cpus_allowed: 041 Cpus_allowed_list: 0,6 One thing to note is that I had to have a 'File=' defined in my gres.conf for this to work properly. If I just specified a count of the number of generic resources I wanted it would go anywhere on the cluster. There is also a feature request that has been opened to allow you to accomplish the same thing without having to create a second gres. You can follow the status of this request here: https://bugs.schedmd.com/show_bug.cgi?id=6734<show_bug.cgi?id=6734> Let me know if you have questions about implementing a secondary gres. Thanks, Ben ________________________________ You are receiving this mail because: * You reported the bug. Hi Naveed, I didn't use a submit script, but here are the steps I took to accomplish this. These are the relevant lines from my slurm.conf: ... TaskPlugin=task/affinity,task/cgroup GresTypes=gpu,test NodeName=node10 Gres=gpu:1,test:1 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 ... In my gres.conf I needed to specify which cores would be bound to each gres: NodeName=node10 Name=gpu Count=1 File=/dev/nvidia0 Cores=0 NodeName=node10 Name=test Count=1 File=/dev/zero Cores=2 Then when I run a job that requests one of those gres' I can see that I'm limited to the CPUs associated with the cores I tied the gres to: $ srun -n1 --gres=test:1 cat /proc/self/status | grep Cpus_allowed Cpus_allowed: 082 Cpus_allowed_list: 1,7 $ srun -n1 --gres=gpu:1 cat /proc/self/status | grep Cpus_allowed Cpus_allowed: 041 Cpus_allowed_list: 0,6 The association of the gres with the CPUs is done in the gres.conf. You specify the Cores you would like to associate the gres with. You can use lstopo to see the layout of the processors on your machine. There is more information about this in the gres.conf documentation: https://slurm.schedmd.com/gres.conf.html#OPT_Cores Let me know if you have additional questions about this. Thanks, Ben |
Created attachment 9230 [details] current slurm.conf We have a mixture of gpu nodes and cpu nodes. We currently allow cpu jobs on any machines including those with gpus as gpu adoption has been slower than expected. We do however want to make sure that the gpus are available whenever some needs them if not in use. we are also using the gpus in open on demand requests so want them to be able to be used there as well (but that is on our side of things) Is it possible to keep a core or 2 available ties to allocation of a gpu on the node? Basically we want to be able to schedule up to 24 cores on a gpu machine for cpu jobs, but keep 4 around for allocation when someone requests a gpu as well. If this is possible, how would i accomplish it? All nodes are in a single partition (any) and resources are allocated be request rather than within a partition.