Ticket 13038

Summary:	Oversubscribe gpu gres (maybe with --overlap?)
Product:	Slurm	Reporter:	Chrysovalantis Paschoulas <c.paschoulas>
Component:	GPU	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	21.08.4
Hardware:	Linux
OS:	Linux
Site:	Jülich	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Chrysovalantis Paschoulas 2021-12-15 07:47:32 MST

Hi!

First I would like to ask if there is any way to have GRES oversubscription in general, maybe it is supported and I missed that..

More specific, on our site we would like to have the following functionality: for some specific workflows (e.g. like MPS) our users would like to spawn multiple jobsteps using the same GPUs. But the combination of `--gres=gpu:X` and `--overlap` doesn't work as we would like to. So it seems that the overlap option doesn't have any effect on GRESs. Is there a way to make this work? The current workaround is for the users to specify for each jobstep `--gres=gpu:0` which works "fine" but is ugly.

I know Slurm has the MPS GRES that solves it but on our site we have a bit different setup. So the question is: are you planning to support GRES oversubscription in the future? Would you be so kind to implement it for us? :P

Best Regards,
Valantis

Comment 2 Marshall Garey 2021-12-15 10:17:27 MST

As far as I know, MPS is already an obsolete technology and never really worked well.

MIG[1][2] is the newer technology, and should actually work better than the proposed --overlap=force option. With MIG, you can partition a GPU to make it appear as multiple GPUs. You can partition a GPU up to 7 ways. Once that is done, the GPU is simply treated like multiple GPUs. 

Currently, --overlap only allows sharing CPUs. However, a site is sponsoring us to add this functionality to allow job steps to overlap not just CPUs but all resources (CPUs, memory, GRES). So for 22.05 we're adding a new option for --overlap (--overlap=force) which will allow this to happen. However, the performance of doing this will be worse than using MIG and cgroups (ConstrainDevices=yes), since with MIG plus cgroups different jobs/steps won't be contending for the same resources on the GPU.

The --gres=gpu:0 workaround only works if you don't constrain devices via cgroups (see ConstrainDevices in cgroup.conf). If that is what you want, then that is fine, but be warned - without constraining devices, *any* job can use *any* GPU on its compute node(s), even if the job was not allocated those GPUs.

Have I answered your question? Do you have any follow-up questions?

[1] https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#running-with-mig
[2] https://slurm.schedmd.com/gres.html#MIG_Management

Comment 3 Chrysovalantis Paschoulas 2021-12-16 06:37:36 MST

(In reply to Marshall Garey from comment #2)
> As far as I know, MPS is already an obsolete technology and never really
> worked well.
> 
> MIG[1][2] is the newer technology, and should actually work better than the
> proposed --overlap=force option. With MIG, you can partition a GPU to make
> it appear as multiple GPUs. You can partition a GPU up to 7 ways. Once that
> is done, the GPU is simply treated like multiple GPUs. 
> 
I don't know exactly the users' requirements but for me it looks like the MIG improves the situation but doesn't solve the problem of oversubscription. I mean, what if they want to run more tasks than the max number of MIG partitions?

> Currently, --overlap only allows sharing CPUs. However, a site is sponsoring
> us to add this functionality to allow job steps to overlap not just CPUs but
> all resources (CPUs, memory, GRES). So for 22.05 we're adding a new option
> for --overlap (--overlap=force) which will allow this to happen. However,
That is great news and it seems that this '--overlap=force' is exactly what I asking for! :)

> the performance of doing this will be worse than using MIG and cgroups
> (ConstrainDevices=yes), since with MIG plus cgroups different jobs/steps
> won't be contending for the same resources on the GPU.
> 
> The --gres=gpu:0 workaround only works if you don't constrain devices via
> cgroups (see ConstrainDevices in cgroup.conf). If that is what you want,
> then that is fine, but be warned - without constraining devices, *any* job
> can use *any* GPU on its compute node(s), even if the job was not allocated
> those GPUs.
Yes it is true that without constraining devices the users can use any GPUs, this is fine because we don't have node sharing on those nodes, so they can manage themselves how they allocate the available GPUs. With node sharing we will see what we will do.

I have another question here, with ConstrainDevices Slurm constrains devices on a jobstep level or on job level? 

> 
> Have I answered your question? Do you have any follow-up questions?
> 
> [1]
> https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#running-
> with-mig
> [2] https://slurm.schedmd.com/gres.html#MIG_Management

Thanks for you detailed feedback ;)

Comment 5 Marshall Garey 2021-12-21 16:10:42 MST

(In reply to Chrysovalantis Paschoulas from comment #3)
> I don't know exactly the users' requirements but for me it looks like the
> MIG improves the situation but doesn't solve the problem of
> oversubscription. I mean, what if they want to run more tasks than the max
> number of MIG partitions?

In that case, you would need the new --overlap=force option.


> I have another question here, with ConstrainDevices Slurm constrains devices
> on a jobstep level or on job level? 

Both.


> Thanks for you detailed feedback ;)

You're welcome!

I actually do need to correct myself on MPS, though: MPS is not an obsolete technology. It was a wrong choice of words - it can be useful although there are limitations (just as there are limitations to MIG). Instead, I should explain what MPS and MIG, the limitations of each, and the further limitations in Slurm:

In my understanding, the current status and main limitations are:

* MPS: a single user can share a dynamic percentage of a GPU. In Slurm, only a single GPU can be used for N jobs (jobs decide their %). We (SchedMD) aren't currently planning to add support for N GPUs (simultaneously). And again, this is only for jobs (not steps). This is why it is very limiting in Slurm - steps can't share a GPU with MPS. (And even out of Slurm only a single user can share a GPU at a time.) With this knowledge hopefully you can decide if MPS will be of use to you. For sharing a GPU between steps, --overlap=force or MIG are your options.

* MIG: GPUs can be statically configured like vGPUs with an static % of the actual GPUs and jobs can select full static vGPUs (like normal GPUs). A GPU can only be "partitioned" up to seven ways right now. For Slurm, there is no plan (yet) to allow jobs to dynamically reconfigure GPUs into vGPUs, ie dynamic change of number of vGPUs in a node, although this is in theory possible to do.

* MPS and MIG can be used simultaneously: a vGPU (from MIG) can also be shared with MPS.


Hopefully that clear things up! I'll close this as infogiven for now. Please re-open this ticket if you have more questions.