| Summary: | Oversubscribe gpu gres (maybe with --overlap?) | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Chrysovalantis Paschoulas <c.paschoulas> |
| Component: | GPU | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 21.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Jülich | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Chrysovalantis Paschoulas
2021-12-15 07:47:32 MST
As far as I know, MPS is already an obsolete technology and never really worked well. MIG[1][2] is the newer technology, and should actually work better than the proposed --overlap=force option. With MIG, you can partition a GPU to make it appear as multiple GPUs. You can partition a GPU up to 7 ways. Once that is done, the GPU is simply treated like multiple GPUs. Currently, --overlap only allows sharing CPUs. However, a site is sponsoring us to add this functionality to allow job steps to overlap not just CPUs but all resources (CPUs, memory, GRES). So for 22.05 we're adding a new option for --overlap (--overlap=force) which will allow this to happen. However, the performance of doing this will be worse than using MIG and cgroups (ConstrainDevices=yes), since with MIG plus cgroups different jobs/steps won't be contending for the same resources on the GPU. The --gres=gpu:0 workaround only works if you don't constrain devices via cgroups (see ConstrainDevices in cgroup.conf). If that is what you want, then that is fine, but be warned - without constraining devices, *any* job can use *any* GPU on its compute node(s), even if the job was not allocated those GPUs. Have I answered your question? Do you have any follow-up questions? [1] https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#running-with-mig [2] https://slurm.schedmd.com/gres.html#MIG_Management (In reply to Marshall Garey from comment #2) > As far as I know, MPS is already an obsolete technology and never really > worked well. > > MIG[1][2] is the newer technology, and should actually work better than the > proposed --overlap=force option. With MIG, you can partition a GPU to make > it appear as multiple GPUs. You can partition a GPU up to 7 ways. Once that > is done, the GPU is simply treated like multiple GPUs. > I don't know exactly the users' requirements but for me it looks like the MIG improves the situation but doesn't solve the problem of oversubscription. I mean, what if they want to run more tasks than the max number of MIG partitions? > Currently, --overlap only allows sharing CPUs. However, a site is sponsoring > us to add this functionality to allow job steps to overlap not just CPUs but > all resources (CPUs, memory, GRES). So for 22.05 we're adding a new option > for --overlap (--overlap=force) which will allow this to happen. However, That is great news and it seems that this '--overlap=force' is exactly what I asking for! :) > the performance of doing this will be worse than using MIG and cgroups > (ConstrainDevices=yes), since with MIG plus cgroups different jobs/steps > won't be contending for the same resources on the GPU. > > The --gres=gpu:0 workaround only works if you don't constrain devices via > cgroups (see ConstrainDevices in cgroup.conf). If that is what you want, > then that is fine, but be warned - without constraining devices, *any* job > can use *any* GPU on its compute node(s), even if the job was not allocated > those GPUs. Yes it is true that without constraining devices the users can use any GPUs, this is fine because we don't have node sharing on those nodes, so they can manage themselves how they allocate the available GPUs. With node sharing we will see what we will do. I have another question here, with ConstrainDevices Slurm constrains devices on a jobstep level or on job level? > > Have I answered your question? Do you have any follow-up questions? > > [1] > https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#running- > with-mig > [2] https://slurm.schedmd.com/gres.html#MIG_Management Thanks for you detailed feedback ;) (In reply to Chrysovalantis Paschoulas from comment #3) > I don't know exactly the users' requirements but for me it looks like the > MIG improves the situation but doesn't solve the problem of > oversubscription. I mean, what if they want to run more tasks than the max > number of MIG partitions? In that case, you would need the new --overlap=force option. > I have another question here, with ConstrainDevices Slurm constrains devices > on a jobstep level or on job level? Both. > Thanks for you detailed feedback ;) You're welcome! I actually do need to correct myself on MPS, though: MPS is not an obsolete technology. It was a wrong choice of words - it can be useful although there are limitations (just as there are limitations to MIG). Instead, I should explain what MPS and MIG, the limitations of each, and the further limitations in Slurm: In my understanding, the current status and main limitations are: * MPS: a single user can share a dynamic percentage of a GPU. In Slurm, only a single GPU can be used for N jobs (jobs decide their %). We (SchedMD) aren't currently planning to add support for N GPUs (simultaneously). And again, this is only for jobs (not steps). This is why it is very limiting in Slurm - steps can't share a GPU with MPS. (And even out of Slurm only a single user can share a GPU at a time.) With this knowledge hopefully you can decide if MPS will be of use to you. For sharing a GPU between steps, --overlap=force or MIG are your options. * MIG: GPUs can be statically configured like vGPUs with an static % of the actual GPUs and jobs can select full static vGPUs (like normal GPUs). A GPU can only be "partitioned" up to seven ways right now. For Slurm, there is no plan (yet) to allow jobs to dynamically reconfigure GPUs into vGPUs, ie dynamic change of number of vGPUs in a node, although this is in theory possible to do. * MPS and MIG can be used simultaneously: a vGPU (from MIG) can also be shared with MPS. Hopefully that clear things up! I'll close this as infogiven for now. Please re-open this ticket if you have more questions. |