Ticket 17070

Summary:	Options for aplitting a GPU
Product:	Slurm	Reporter:	Aravind <aravind.padmanabhan>
Component:	GPU	Assignee:	Director of Support <support>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	22.05.2
Hardware:	Linux
OS:	Linux
Site:	Sick Kids	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Aravind 2023-06-28 10:15:11 MDT

Hi,
Thank you for your support on previous cases.
We were wondering if it would be possible to
1, Split a GPU
2, Request GPU core/GPU Memory rather than a full GPU so that the resource can be effectively shared across jobs.
Thank you

Comment 1 Caden Ellis 2023-06-28 17:04:26 MDT

This link is a good place to start:

https://slurm.schedmd.com/gres.html

You're options are MPS, MIGS, and shards.

MIGS: If your GPUs are NVIDIA and are new enough, you may have the option of enabling MIGS. Here are the supported GPUS:

https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus

You can effectively separate each GPU into up to 7 parts and schedule them independently. MIGS also have the advantage of fencing processes to keep users from using more than their share, like cgroups.

Shards: Very similar to MIGs, but are an abstraction. They only allow the GPU to be shared but do not fence processes. This works best if your workflow is homogeneous. For example, you can split your GPU into 4 shards, but one job could take all of the GPU resources if they wanted. You are saying there are 4 shares of a GPU, but it doesn't fence the GPU into 4 equal parts. So a job requesting 4 shards isn't getting more resources than a 1 shard job, it just uses up all the shares so no other job can use that GPU. Assuming 4 users knew how to work together, they each could use 15%, 35%, 40%, and 5% of the GPU. If you only had 4 shards configured, a 5th user could not run on the remaining 5%. 

MPS: https://slurm.schedmd.com/gres.html#MPS_Management

Do one of these options work for you?

Caden

Comment 2 Aravind 2023-07-04 12:44:53 MDT

Hi Caden,
Thank you so much for the info.
Just curious the MPS method does not need any extra licensing etc?
Also, aside from the vendor documentation, wondering if SchedMD has any guides for us.
Thanks

Comment 3 Caden Ellis 2023-07-06 10:57:51 MDT

We have the link I shared https://slurm.schedmd.com/gres.html. Other than that you'd need the vendor documentation. 

There is no other licensing for MPS. You just need the correct hardware.

Caden

Comment 4 Caden Ellis 2023-07-10 10:44:06 MDT

Do you have any other related questions? If not I will close this out.

Caden

Comment 5 Caden Ellis 2023-07-17 10:47:36 MDT

Closing