Ticket 15995

Summary: Allocate pairs of GPUs with NVLINK
Product: Slurm Reporter: Prabhjyot Saluja <prabhjyot_saluja>
Component: GPUAssignee: Ben Glines <ben.glines>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: ben.glines
Version: 22.05.7   
Hardware: Linux   
OS: Linux   
Site: Brown Univ Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: gres.conf

Description Prabhjyot Saluja 2023-02-09 13:31:56 MST
Created attachment 28786 [details]
gres.conf

Hi -
We have a combination of GPU nodes with pairs of NVLINK between 2-GPUs. Each node has 8-GPUs

Issue:
Running 'nvidia-smi topo -m" without SLURM allocation reports (GPU0,GPU1) have NVLINK connectivity. But while requesting resources via SLURM the 'nvidia-smi topo -m' reports SYS (connection traversing via PCIe) instead of NVLINK. We compared the individual GPU UUIDs (via nvidia-smi -L) and sometimes the allocated GPU pairs aren't even on the same CPU socket. We have AutoDetect=nvml enabled in gres.conf (attached). 

We would like to know if it is possible to:
1. Make it mandatory to allocate pairs of GPUs that are connected via NVLINK when requested.
2. Ensure that all allocated GPUs are on the same CPU socket

Please let me know if you need any further information

Singh
Comment 1 Ben Glines 2023-02-13 13:46:10 MST
Hi,

Slurm will try to schedule gpus that are linked together first, but if there aren't any, it will schedule whatever gpus are available. For example, if a job needs 2 gpus, but the only gpus available aren't linked together, Slurm will schedule 2 non-linked gpus anyways. This could be the reason why you saw two gpus allocated that weren't actually linked together.

If you have a job that requests 2 gpus, and there are 2 gpus available that are linked together, but then Slurm chooses a different pair of gpus, then that is an issue. If you can reproduce such a case, please reply with the steps to reproduce.


> 1. Make it mandatory to allocate pairs of GPUs that are connected via NVLINK
> when requested.
This isn't currently possible. The best you could do is using job_submit to ensure that only pairs of gpus are allocated.

If all jobs request pairs of gpus, then Slurm should only schedule gpus that are linked together. When jobs requesting single gpus come into play, that can result in available gpus that are not linked together. You could make it mandatory to only request pairs of gpus on nodes using job_submit (if a job requests an odd number of gpus, make it even. e.g. --gpus=1 -> --gpus=2), which should then ensure that gpus on a job are connected via NVLink. This would mean though that any job requesting an odd number of gpus would be allocated an extra gpu.


> 2. Ensure that all allocated GPUs are on the same CPU socket
This can be achieved with the --gres-flags=enforce-binding option.
https://slurm.schedmd.com/sbatch.html#OPT_enforce-binding
Comment 2 Ben Glines 2023-02-14 09:16:50 MST
(In reply to Prabhjyot Saluja from comment #0)
> Created attachment 28786 [details]
> gres.conf
> 
> Hi -
> We have a combination of GPU nodes with pairs of NVLINK between 2-GPUs. Each
> node has 8-GPUs

I am noticing now after looking closer at your gres.conf that these nodes you are talking about are DGX nodes? If so, you can ignore the part of my reply about using job_submit to force jobs to allocate an even number of nodes.

If you are requesting gpus on the same socket on a DGX node (with --gres-flags=enforce-binding), then all of the gpus should be connected via NVLink. So you shouldn't have to worry about requiring the gpus to be connected via NVLink since being on the same socket should be enough. The only time that they wouldn't be connected is when they are on different sockets, and even then, some are still connected across the sockets.
Comment 3 Ben Glines 2023-02-23 09:38:51 MST
Have you been able to do any testing with the --gres-flags=enforce-binding option to see if this solves your issue?
Comment 4 Prabhjyot Saluja 2023-02-23 09:59:27 MST
Hi Ben,

Apologies for the late reply. We are  still having this issue on the non-DGX nodes. Here is what I did:

1. Request an interactive session with two GPUs (it did NOT allocate a pair with NVLINK connected)
salloc -J interact -N 1-1 -n 1 --time=30:00 --gres=gpu:2 --mem=4g -p gpu-he --gres-flags=enforce-binding srun --pty bash
psaluja@gpu1504:~ $ nvidia-smi topo -m
	GPU0	GPU1	mlx5_0	mlx5_1	mlx5_2	mlx5_3	CPU Affinity	NUMA Affinity
GPU0	 X 	SYS	SYS	SYS	SYS	SYS	0	0-1
GPU1	SYS	 X 	SYS	SYS	SYS	SYS	0	0-1
mlx5_0	SYS	SYS	 X 	PIX	SYS	SYS
mlx5_1	SYS	SYS	PIX	 X 	SYS	SYS
mlx5_2	SYS	SYS	SYS	SYS	 X 	PIX
mlx5_3	SYS	SYS	SYS	SYS	PIX	 X

2. If I ssh into the node without SLURM then nvidia-topo -m shows 
npsaluja@gpu1504:~ $ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx5_0	mlx5_1	mlx5_2	mlx5_3	CPU Affinity	NUMA Affinity
GPU0	 X 	NV4	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	SYS	SYS	0-23	0
GPU1	NV4	 X 	NODE	NODE	SYS	SYS	SYS	SYS	PHB	PHB	SYS	SYS	0-23	0
GPU2	NODE	NODE	 X 	NV4	SYS	SYS	SYS	SYS	NODE	NODE	SYS	SYS	0-23	0
GPU3	NODE	NODE	NV4	 X 	SYS	SYS	SYS	SYS	NODE	NODE	SYS	SYS	0-23	0
GPU4	SYS	SYS	SYS	SYS	 X 	NV4	NODE	NODE	SYS	SYS	NODE	NODE	24-47	1
GPU5	SYS	SYS	SYS	SYS	NV4	 X 	NODE	NODE	SYS	SYS	NODE	NODE	24-47	1
GPU6	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NV4	SYS	SYS	PHB	PHB	24-47	1
GPU7	SYS	SYS	SYS	SYS	NODE	NODE	NV4	 X 	SYS	SYS	NODE	NODE	24-47	1

Please let me know if you need any additional details.
Comment 5 Ben Glines 2023-02-24 10:44:43 MST
Does this happen every time you request a job w/ 2 gpus on those nodes, or only sometimes?

When that job was running, were there any other jobs already running on that node? If so, could you try running the job by itself?

Could you also try requesting all 8 gpus as well as 4 gpus on the node and replying with the nvidia-topo -m output?

So in summary:
1. nvidia-topo -m output w/ only one job requesting 2 gpus on the node
2. nvidia-topo -m output w/ only one job requesting 4 gpus on the node
3. nvidia-topo -m output w/ only one job requesting 8 gpus on the node

Also w/ --gres-flags=enforce-binding on each of those jobs.
Comment 6 Prabhjyot Saluja 2023-02-28 10:47:45 MST
Hi Ben,

It only happens when already there is a job running, but requesting 4 & 8 GPUs SLURM always seems to allocate GPUs with NVLINK enabled. And that makes sense. This ticket should be all set. Thank you so much for all your help.

Singh
Comment 7 Ben Glines 2023-02-28 11:33:20 MST
(In reply to Prabhjyot Saluja from comment #6)
> Hi Ben,
> 
> It only happens when already there is a job running, but requesting 4 & 8
> GPUs SLURM always seems to allocate GPUs with NVLINK enabled. And that makes
> sense. This ticket should be all set. Thank you so much for all your help.
> 
> Singh

Sounds good! Closing this now