| Summary: | Options for aplitting a GPU | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Aravind <aravind.padmanabhan> |
| Component: | GPU | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 22.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Sick Kids | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Aravind
2023-06-28 10:15:11 MDT
This link is a good place to start: https://slurm.schedmd.com/gres.html You're options are MPS, MIGS, and shards. MIGS: If your GPUs are NVIDIA and are new enough, you may have the option of enabling MIGS. Here are the supported GPUS: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus You can effectively separate each GPU into up to 7 parts and schedule them independently. MIGS also have the advantage of fencing processes to keep users from using more than their share, like cgroups. Shards: Very similar to MIGs, but are an abstraction. They only allow the GPU to be shared but do not fence processes. This works best if your workflow is homogeneous. For example, you can split your GPU into 4 shards, but one job could take all of the GPU resources if they wanted. You are saying there are 4 shares of a GPU, but it doesn't fence the GPU into 4 equal parts. So a job requesting 4 shards isn't getting more resources than a 1 shard job, it just uses up all the shares so no other job can use that GPU. Assuming 4 users knew how to work together, they each could use 15%, 35%, 40%, and 5% of the GPU. If you only had 4 shards configured, a 5th user could not run on the remaining 5%. MPS: https://slurm.schedmd.com/gres.html#MPS_Management Do one of these options work for you? Caden Hi Caden, Thank you so much for the info. Just curious the MPS method does not need any extra licensing etc? Also, aside from the vendor documentation, wondering if SchedMD has any guides for us. Thanks We have the link I shared https://slurm.schedmd.com/gres.html. Other than that you'd need the vendor documentation. There is no other licensing for MPS. You just need the correct hardware. Caden Do you have any other related questions? If not I will close this out. Caden Closing |