6693 – How to allow CPU-only jobs to run on a GPU node while maintaining CPU-GPU affinity for GPU jobs

Ticket 6693 - How to allow CPU-only jobs to run on a GPU node while maintaining CPU-GPU affinity for GPU jobs

Summary: How to allow CPU-only jobs to run on a GPU node while maintaining CPU-GPU aff...

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	18.08.5
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Albert Gil
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-03-14 08:54 MDT by NYU HPC Team
Modified:	2019-03-22 10:13 MDT (History)
CC List:	1 user (show)

See Also:	6734
Site:	NYU
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description NYU HPC Team 2019-03-14 08:54:41 MDT

Hi Experts,

We have a computing node as the following:
$ grep gpu-90 /opt/slurm/etc/slurm.conf
NodeName=gpu-90 Gres=gpu:v100:2 Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 MemSpecLimit=1500 RealMemory=192080 TmpDisk=108033 State=UNKNOWN

$ grep 90 /opt/slurm/etc/gres.conf 
NodeName=gpu-90 Name=gpu Type=v100  File=/dev/nvidia0 CPUs=[0-19]
NodeName=gpu-90 Name=gpu Type=v100  File=/dev/nvidia1 CPUs=[20-39]

The question is how to schedule cpu-only jobs to run on this node, while GPU jobs can keep CPU-GPU affinity? Thanks very much!


Best Regards,
Wensheng

Comment 1 Albert Gil 2019-03-14 12:52:00 MDT

Hi Wensheng,

I think that this explanation from gres.conf should help you:

If the Cores configuration option is specified and a job is submitted with the --gres-flags=enforce-binding option then only the identified cores can be allocated with each generic resource; which will tend to improve performance of jobs, but slow the allocation of resources to them. If specified and a job is not submitted with the --gres-flags=enforce-binding option the identified cores will be preferred for scheduled with each generic resource. If --gres-flags=disable-binding is specified, then any core can be used with the resources, which also increases the speed of Slurm's scheduling algorithm but can degrade the application performance. The --gres-flags=disable-binding option is currently required to use more CPUs than are bound to a GRES.


Does it answer your question?


Albert

Comment 2 NYU HPC Team 2019-03-15 11:44:51 MDT

Hi Albert,

Thank you for the answering. We have the option --gres-flags=enforce-binding on already, and it's implemented in the Lua submit plugin, to enforce CPU-GPU affinity. Currently we do not schedule CPU-only jobs on the GPU node. 

What we are seeking now is: 
as an addition, put CPU-only jobs on the example node gpu-90, without impacting GPU jobs' performance. How do we config to achieve the goal if that is doable?


Best Regards,
Wensheng

Comment 3 Albert Gil 2019-03-18 09:01:05 MDT

Hi Wensheng,

> Thank you for the answering. We have the option --gres-flags=enforce-binding
> on already, and it's implemented in the Lua submit plugin, to enforce
> CPU-GPU affinity. Currently we do not schedule CPU-only jobs on the GPU
> node. 

Good!

> What we are seeking now is: 
> as an addition, put CPU-only jobs on the example node gpu-90, without
> impacting GPU jobs' performance. How do we config to achieve the goal if
> that is doable?

I'm not certain if this is now doable.
Let me check it and I'll comeback to you.


Albert

Comment 5 Albert Gil 2019-03-18 10:27:30 MDT

Hi Wensheng,

> What we are seeking now is: 
> as an addition, put CPU-only jobs on the example node gpu-90, without
> impacting GPU jobs' performance. How do we config to achieve the goal if
> that is doable?

Looking again at your config I've realized that you are binding *all* CPU to one or the other GPU.
So, what do you exactly mean by "without impacting GPU jobs' performance"?

Do you mean that if for example a GPU job is running on /dev/nvidia0 a new cpu-only job should go to CPU's id > 19?
And if both GPUs are used no new jobs should be allowed to avoid any performance penalty to running jobs?

Or are you planning to change the node configuration and have some CPUs not bind to any GPU and then limit/encourage binding any cpu-only jobs to those "free CPUs"? For example limit/encouraging cpu-only jobs to go to CPUs 15-19,35-40 if changing your gres.conf to:

NodeName=gpu-90 Name=gpu Type=v100  File=/dev/nvidia0 CPUs=[0-14]
NodeName=gpu-90 Name=gpu Type=v100  File=/dev/nvidia1 CPUs=[20-34]


Or even something different?

Comment 8 NYU HPC Team 2019-03-18 18:05:17 MDT

Hi Albert,

> Or are you planning to change the node configuration and have some CPUs not
> bind to any GPU and then limit/encourage binding any cpu-only jobs to those
> "free CPUs"? For example limit/encouraging cpu-only jobs to go to CPUs
> 15-19,35-40 if changing your gres.conf to:
> 
> NodeName=gpu-90 Name=gpu Type=v100  File=/dev/nvidia0 CPUs=[0-14]
> NodeName=gpu-90 Name=gpu Type=v100  File=/dev/nvidia1 CPUs=[20-34]
> 

Yes this is what we are speculating. Is there a way to achieve this? or how to create partition which includes these "free CPUs" 15-19,35-40 only? Thank you.

Comment 9 Albert Gil 2019-03-19 02:40:33 MDT

Hi,

> Yes this is what we are speculating. Is there a way to achieve this? or how
> to create partition which includes these "free CPUs" 15-19,35-40 only? Thank
> you.

In 18.08 the closest setup recommended is what is explained in the slurm.conf documentation of the MaxCPUsPerNode parameter:

MaxCPUsPerNode
Maximum number of CPUs on any node available to all jobs from this partition. This can be especially useful to schedule GPUs. For example a node can be associated with two Slurm partitions (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be limited to only a subset of the node's CPUs, ensuring that one or more CPUs would be available to jobs in the "gpu" partition/queue.


But please note that this is not related to affinity, just a count.
And this is probably not good enough for you, right?

I need to double check if the work in progress for 19.05 includes some way to include the affinity on that setup.
Maybe adding some binding options to job_submit/lua plugin and/or new options to the -m,--distribution or --hint parameters based on the CPUs defined in gres.conf?

Right now for 18.08 I don't see a way to achieve it, but please let me double-check it internally.
If I'm right and it's ok for you we will move this bug to a 5-Enhancement.

Comment 10 NYU HPC Team 2019-03-19 05:24:14 MDT

(In reply to Albert Gil from comment #9)
> Hi,
> 
> > Yes this is what we are speculating. Is there a way to achieve this? or how
> > to create partition which includes these "free CPUs" 15-19,35-40 only? Thank
> > you.
> 
> In 18.08 the closest setup recommended is what is explained in the
> slurm.conf documentation of the MaxCPUsPerNode parameter:
> 
> MaxCPUsPerNode
> Maximum number of CPUs on any node available to all jobs from this
> partition. This can be especially useful to schedule GPUs. For example a
> node can be associated with two Slurm partitions (e.g. "cpu" and "gpu") and
> the partition/queue "cpu" could be limited to only a subset of the node's
> CPUs, ensuring that one or more CPUs would be available to jobs in the "gpu"
> partition/queue.
> 
> 
> But please note that this is not related to affinity, just a count.
> And this is probably not good enough for you, right?
> 
> I need to double check if the work in progress for 19.05 includes some way
> to include the affinity on that setup.
> Maybe adding some binding options to job_submit/lua plugin and/or new
> options to the -m,--distribution or --hint parameters based on the CPUs
> defined in gres.conf?
> 
> Right now for 18.08 I don't see a way to achieve it, but please let me
> double-check it internally.
> If I'm right and it's ok for you we will move this bug to a 5-Enhancement.

Hi Albert,

All great points. We will try what you suggested. Look forward to hearing what you find out after checking with the team and also regarding 19.05, as that is directly related to our new procurement. Thanks  very much!

Best,
Wensheng

Comment 12 NYU HPC Team 2019-03-20 11:27:35 MDT

Hi Albert,

Trying 'srun' to send a one-core job to run on one of these "free CPUs" 15-19,35-39 only, on the example node gpu-90, I am having trouble to use cpu-bind with mask_cpu or map_cpu, could you help? 

$ srun --cpu-bind=mask_cpu:????  --cpu-bind=verbose sh -c 'cat /sys/fs/cgroup/cpuset/slurm/uid_$UID/job_$SLURM_JOB_ID/cpuset.cpus'

$ srun --cpu-bind=map_cpu:????  --cpu-bind=verbose sh -c 'cat /sys/fs/cgroup/cpuset/slurm/uid_$UID/job_$SLURM_JOB_ID/cpuset.cpus'

Comment 14 Albert Gil 2019-03-21 10:34:26 MDT

Hi Wensheng,

As you can see in theses slides from SLUG18, we have been working hard to make GPUs a first citizen class in Slurm:
https://slurm.schedmd.com/SLUG18/cons_tres.pdf

In 19.05 we will have several new options related to GPUs:
       --cpus-per-gpu=<ncpus>
       -G, --gpus=[<type>:]<number>
       --gpu-bind=<type>
       --gpu-freq=[<type]=value>[,<type=value>][,verbose]
       --gpus-per-node=[<type>:]<number>
       --gpus-per-socket=[<type>:]<number>

Unfortunately none of them are actually solving your request.

A workaround that internally raised to provide the functionality that you are asking for is this:

- Create a fake GRES type: eg "nongpu"
- Assign this nongpu GRES to the CPUs you want for cpu-only jobs
- Ensure that users are asking for --gres=gpu or --gres=nongpu with some submit plugin

I think that this should provide what you was asking for, right?

Finally, if it's ok for you, I'm closing this ticket as infogiven, but I'm also opening an new one in your behalf as a Enhancement to discuss further about an actual enhancement.

Hope that helps,
Albert

Comment 15 NYU HPC Team 2019-03-21 11:58:30 MDT

Hi Albert,

This sounds very interesting!! We should try it. Yes please do as you proposed, and keep us in the loop. Thank you very much!

Regards,
Wensheng

Comment 16 Albert Gil 2019-03-22 10:13:47 MDT

Hi Wensheng,

> A workaround that internally raised to provide the functionality that you
> are asking for is this:
> 
> - Create a fake GRES type: eg "nongpu"
> - Assign this nongpu GRES to the CPUs you want for cpu-only jobs
> - Ensure that users are asking for --gres=gpu or --gres=nongpu with some
> submit plugin
> 
> I think that this should provide what you was asking for, right?


Just in case you try it, please note that "File" and "no_consume" are important parameters to:
- File: it's also necessary for the nongpu to make the binding work (I'm not certain if this is expected, but at least it's not well documented, I'll work on it).
- no_consume: to allow multiple jobs to run on the "free cpus", the ones not associated with the real GRES.

This is an example of a setup:

# slurm.conf
GresTypes=gpu,cpu
NodeName=c1 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=4096 Gres=gpu:2,cpu:no_consume:1 # note the no_consume

# gres.conf
NodeName=c1 Name=gpu File=/dev/nvidia0 Cores=0,1 
NodeName=c1 Name=cpu File=/dev/zero    Cores=2,3 # note that File is necessary even not used. no_consume keep it also accessible for multiple jobs.


To use them:

$ srun --gres=gpu -c1 whereami
   0 c1 - Cpus_allowed: 1       Cpus_allowed_list:      0
$ srun --gres=gpu -c2 whereami
   0 c1 - Cpus_allowed: 3       Cpus_allowed_list:      0-1

$ srun --gres=cpu  whereami
   0 c1 - Cpus_allowed: 4       Cpus_allowed_list:      2
$ srun --gres=cpu -c2  whereami
   0 c1 - Cpus_allowed: c       Cpus_allowed_list:      2-3


Now you probably want a lua submit plugin that forces one or the other gres.
Also note that by default we cannot ask for more CPUs than the assigned ones to the GRES:

$ srun --gres=gpu -c3 whereami
srun: error: Unable to allocate resources: Requested node configuration is not available
$ srun --gres=cpu -c3 whereami
srun: error: Unable to allocate resources: Requested node configuration is not available


If we want it we have to disable the binding *at all* with --gres-flags:

$ srun --gres=gpu -c3 --gres-flags=disable-binding whereami
   0 c1 - Cpus_allowed: 7       Cpus_allowed_list:      0-2   # -> lucky binding
$ srun --gres=cpu -c3 --gres-flags=disable-binding whereami
   0 c1 - Cpus_allowed: 7       Cpus_allowed_list:      0-2   # -> no right binding


Hope that helps,
Albert