| Summary: | Allocate pairs of GPUs with NVLINK | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Prabhjyot Saluja <prabhjyot_saluja> |
| Component: | GPU | Assignee: | Ben Glines <ben.glines> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | ben.glines |
| Version: | 22.05.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Brown Univ | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | gres.conf | ||
Hi, Slurm will try to schedule gpus that are linked together first, but if there aren't any, it will schedule whatever gpus are available. For example, if a job needs 2 gpus, but the only gpus available aren't linked together, Slurm will schedule 2 non-linked gpus anyways. This could be the reason why you saw two gpus allocated that weren't actually linked together. If you have a job that requests 2 gpus, and there are 2 gpus available that are linked together, but then Slurm chooses a different pair of gpus, then that is an issue. If you can reproduce such a case, please reply with the steps to reproduce. > 1. Make it mandatory to allocate pairs of GPUs that are connected via NVLINK > when requested. This isn't currently possible. The best you could do is using job_submit to ensure that only pairs of gpus are allocated. If all jobs request pairs of gpus, then Slurm should only schedule gpus that are linked together. When jobs requesting single gpus come into play, that can result in available gpus that are not linked together. You could make it mandatory to only request pairs of gpus on nodes using job_submit (if a job requests an odd number of gpus, make it even. e.g. --gpus=1 -> --gpus=2), which should then ensure that gpus on a job are connected via NVLink. This would mean though that any job requesting an odd number of gpus would be allocated an extra gpu. > 2. Ensure that all allocated GPUs are on the same CPU socket This can be achieved with the --gres-flags=enforce-binding option. https://slurm.schedmd.com/sbatch.html#OPT_enforce-binding (In reply to Prabhjyot Saluja from comment #0) > Created attachment 28786 [details] > gres.conf > > Hi - > We have a combination of GPU nodes with pairs of NVLINK between 2-GPUs. Each > node has 8-GPUs I am noticing now after looking closer at your gres.conf that these nodes you are talking about are DGX nodes? If so, you can ignore the part of my reply about using job_submit to force jobs to allocate an even number of nodes. If you are requesting gpus on the same socket on a DGX node (with --gres-flags=enforce-binding), then all of the gpus should be connected via NVLink. So you shouldn't have to worry about requiring the gpus to be connected via NVLink since being on the same socket should be enough. The only time that they wouldn't be connected is when they are on different sockets, and even then, some are still connected across the sockets. Have you been able to do any testing with the --gres-flags=enforce-binding option to see if this solves your issue? Hi Ben, Apologies for the late reply. We are still having this issue on the non-DGX nodes. Here is what I did: 1. Request an interactive session with two GPUs (it did NOT allocate a pair with NVLINK connected) salloc -J interact -N 1-1 -n 1 --time=30:00 --gres=gpu:2 --mem=4g -p gpu-he --gres-flags=enforce-binding srun --pty bash psaluja@gpu1504:~ $ nvidia-smi topo -m GPU0 GPU1 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity GPU0 X SYS SYS SYS SYS SYS 0 0-1 GPU1 SYS X SYS SYS SYS SYS 0 0-1 mlx5_0 SYS SYS X PIX SYS SYS mlx5_1 SYS SYS PIX X SYS SYS mlx5_2 SYS SYS SYS SYS X PIX mlx5_3 SYS SYS SYS SYS PIX X 2. If I ssh into the node without SLURM then nvidia-topo -m shows npsaluja@gpu1504:~ $ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity GPU0 X NV4 NODE NODE SYS SYS SYS SYS NODE NODE SYS SYS 0-23 0 GPU1 NV4 X NODE NODE SYS SYS SYS SYS PHB PHB SYS SYS 0-23 0 GPU2 NODE NODE X NV4 SYS SYS SYS SYS NODE NODE SYS SYS 0-23 0 GPU3 NODE NODE NV4 X SYS SYS SYS SYS NODE NODE SYS SYS 0-23 0 GPU4 SYS SYS SYS SYS X NV4 NODE NODE SYS SYS NODE NODE 24-47 1 GPU5 SYS SYS SYS SYS NV4 X NODE NODE SYS SYS NODE NODE 24-47 1 GPU6 SYS SYS SYS SYS NODE NODE X NV4 SYS SYS PHB PHB 24-47 1 GPU7 SYS SYS SYS SYS NODE NODE NV4 X SYS SYS NODE NODE 24-47 1 Please let me know if you need any additional details. Does this happen every time you request a job w/ 2 gpus on those nodes, or only sometimes? When that job was running, were there any other jobs already running on that node? If so, could you try running the job by itself? Could you also try requesting all 8 gpus as well as 4 gpus on the node and replying with the nvidia-topo -m output? So in summary: 1. nvidia-topo -m output w/ only one job requesting 2 gpus on the node 2. nvidia-topo -m output w/ only one job requesting 4 gpus on the node 3. nvidia-topo -m output w/ only one job requesting 8 gpus on the node Also w/ --gres-flags=enforce-binding on each of those jobs. Hi Ben, It only happens when already there is a job running, but requesting 4 & 8 GPUs SLURM always seems to allocate GPUs with NVLINK enabled. And that makes sense. This ticket should be all set. Thank you so much for all your help. Singh (In reply to Prabhjyot Saluja from comment #6) > Hi Ben, > > It only happens when already there is a job running, but requesting 4 & 8 > GPUs SLURM always seems to allocate GPUs with NVLINK enabled. And that makes > sense. This ticket should be all set. Thank you so much for all your help. > > Singh Sounds good! Closing this now |
Created attachment 28786 [details] gres.conf Hi - We have a combination of GPU nodes with pairs of NVLINK between 2-GPUs. Each node has 8-GPUs Issue: Running 'nvidia-smi topo -m" without SLURM allocation reports (GPU0,GPU1) have NVLINK connectivity. But while requesting resources via SLURM the 'nvidia-smi topo -m' reports SYS (connection traversing via PCIe) instead of NVLINK. We compared the individual GPU UUIDs (via nvidia-smi -L) and sometimes the allocated GPU pairs aren't even on the same CPU socket. We have AutoDetect=nvml enabled in gres.conf (attached). We would like to know if it is possible to: 1. Make it mandatory to allocate pairs of GPUs that are connected via NVLINK when requested. 2. Ensure that all allocated GPUs are on the same CPU socket Please let me know if you need any further information Singh