Ticket 17070

Summary: Options for aplitting a GPU
Product: Slurm Reporter: Aravind <aravind.padmanabhan>
Component: GPUAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 22.05.2   
Hardware: Linux   
OS: Linux   
Site: Sick Kids Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Aravind 2023-06-28 10:15:11 MDT
Hi,
Thank you for your support on previous cases.
We were wondering if it would be possible to
1, Split a GPU
2, Request GPU core/GPU Memory rather than a full GPU so that the resource can be effectively shared across jobs.
Thank you
Comment 1 Caden Ellis 2023-06-28 17:04:26 MDT
This link is a good place to start:

https://slurm.schedmd.com/gres.html

You're options are MPS, MIGS, and shards.

MIGS: If your GPUs are NVIDIA and are new enough, you may have the option of enabling MIGS. Here are the supported GPUS:

https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus

You can effectively separate each GPU into up to 7 parts and schedule them independently. MIGS also have the advantage of fencing processes to keep users from using more than their share, like cgroups.

Shards: Very similar to MIGs, but are an abstraction. They only allow the GPU to be shared but do not fence processes. This works best if your workflow is homogeneous. For example, you can split your GPU into 4 shards, but one job could take all of the GPU resources if they wanted. You are saying there are 4 shares of a GPU, but it doesn't fence the GPU into 4 equal parts. So a job requesting 4 shards isn't getting more resources than a 1 shard job, it just uses up all the shares so no other job can use that GPU. Assuming 4 users knew how to work together, they each could use 15%, 35%, 40%, and 5% of the GPU. If you only had 4 shards configured, a 5th user could not run on the remaining 5%. 

MPS: https://slurm.schedmd.com/gres.html#MPS_Management

Do one of these options work for you?

Caden
Comment 2 Aravind 2023-07-04 12:44:53 MDT
Hi Caden,
Thank you so much for the info.
Just curious the MPS method does not need any extra licensing etc?
Also, aside from the vendor documentation, wondering if SchedMD has any guides for us.
Thanks
Comment 3 Caden Ellis 2023-07-06 10:57:51 MDT
We have the link I shared https://slurm.schedmd.com/gres.html. Other than that you'd need the vendor documentation. 

There is no other licensing for MPS. You just need the correct hardware.

Caden
Comment 4 Caden Ellis 2023-07-10 10:44:06 MDT
Do you have any other related questions? If not I will close this out.

Caden
Comment 5 Caden Ellis 2023-07-17 10:47:36 MDT
Closing