Ticket 12921

Summary: Allocating pairs of GPUs tied by NVLinks
Product: Slurm Reporter: Jeff Haferman <jlhaferm>
Component: GPUAssignee: Tim McMullan <mcmullan>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.02.5   
Hardware: Linux   
OS: Linux   
Site: NPS HPC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Jeff Haferman 2021-11-29 10:44:15 MST
Hi -
We have a couple of GPU nodes that have 8 Quadro RTX 8000 GPUs. Each pair of GPUs is linked together by an NVLink bridge. So, as the system ("nvidia-smi") sees them, the following pairs are physically adjacent and linked by a bridge: (0,1), (2,3), (3,4), (5,6), (7,8).

I see a related issue at https://lists.schedmd.com/pipermail/slurm-users/2019-June/003645.html, but I haven't found very clear documentation.

Our "gres.conf" for those nodes looks like:
NodeName=compute-8-5 Name=gpu Type=rtx8000 file=/dev/nvidia[0-7]
NodeName=compute-9-5 Name=gpu Type=rtx8000 file=/dev/nvidia[0-7]

and a GPU allocation can be requested using something like
--gres=gpu:2

What I want to know is, is there a way to tell slurm that pairs of these GPUs are linked together?  Or, at the very least, is there a way for the user (having the knowledge that pairs are linked together) can request pairs that are linked by an NVLink?

This gets a bit complicated in the event that a user has an existing 1-GPU job running on one of these nodes (or just an odd-number of GPUs). 

Does my question make sense, and if so, can you assist or at least point to some documentation that would help?
Comment 1 Jeff Haferman 2021-11-29 10:49:15 MST
I can't edit my previous comment, by the following pairs are linked:
bridge: (0,1), (2,3), (4,5), (6,7).
Comment 4 Tim McMullan 2021-11-29 12:42:39 MST
There are a couple ways to make Slurm aware of the NVLink configuration.

The easiest way would be to use autodetect:

> NodeName=compute-8-5 Autodetect=nvml
> NodeName=compute-9-5 Autodetect=nvml

This should pick up the links and set that information for you.  You can set SLurmdDebug=debug2 and look at the log as the slurmd starts up to verify that what it detected is correct.

Alternatively, you could use the manual "Links" option in the gres.conf, however its a little more tedious.
From the gres.conf man page (https://slurm.schedmd.com/gres.conf.html#OPT_Links):

> Links  A comma-delimited list of numbers identifying the number of connections between this device and other devices to allow coscheduling of better connected devices.   This  is
>        an  ordered  list in which the number of connections this specific device has to device number 0 would be in the first position, the number of connections it has to device
>        number 1 in the second position, etc.  A -1 indicates the device itself and a 0 indicates no connection.  If specified, then this line can only contain a single  GRES  de‐
>        vice (i.e. can only contain a single file via File).
>        This is an optional value and is usually automatically determined if AutoDetect is enabled.  A typical use case would be to identify GPUs having NVLink connectivity.  Note
>        that for GPUs, the minor number assigned by the OS and used in the device file (i.e. the X in /dev/nvidiaX) is not necessarily the same as the device number/index. The de‐
>        vice number is created by sorting the GPUs by PCI bus ID and then numbering them starting from the smallest bus ID.  See https://slurm.schedmd.com/gres.html#GPU_Management

Note that you would need to determine the correct ID, and then make entries in the gres.conf like:
> NodeName=compute-8-5 Name=gpu Type=rtx8000 file=/dev/nvidia0 Links=-1,1,0,0,0,0,0,0
> NodeName=compute-8-5 Name=gpu Type=rtx8000 file=/dev/nvidia1 Links=1,-1,0,0,0,0,0,0
> NodeName=compute-8-5 Name=gpu Type=rtx8000 file=/dev/nvidia2 Links=0,0,-1,1,0,0,0,0
> NodeName=compute-8-5 Name=gpu Type=rtx8000 file=/dev/nvidia3 Links=0,0,1,-1,0,0,0,0
> NodeName=compute-8-5 Name=gpu Type=rtx8000 file=/dev/nvidia4 Links=0,0,0,0,-1,1,0,0
> NodeName=compute-8-5 Name=gpu Type=rtx8000 file=/dev/nvidia5 Links=0,0,0,0,1,-1,0,0
> NodeName=compute-8-5 Name=gpu Type=rtx8000 file=/dev/nvidia6 Links=0,0,0,0,0,0,-1,1
> NodeName=compute-8-5 Name=gpu Type=rtx8000 file=/dev/nvidia7 Links=0,0,0,0,0,0,1,-1

If autodetect works for you, I would really suggest that method since it should be less finicky to set up.

Let me know if this helps!
--Tim
Comment 6 Jeff Haferman 2021-11-29 14:11:42 MST
Thank you Tim, I'm going to try out the "Autodetect" approach right now and keep my fingers crossed that it works as expected.

I appreciate the quick and detailed response!
Comment 7 Jeff Haferman 2021-11-29 14:15:29 MST
OH, I guess I do have one additional question: is it sufficient for NVML to be installed on these nodes, or would slurm have had to have linked it at build time?
Comment 8 Tim McMullan 2021-11-30 05:59:05 MST
(In reply to Jeff Haferman from comment #7)
> OH, I guess I do have one additional question: is it sufficient for NVML to
> be installed on these nodes, or would slurm have had to have linked it at
> build time?

To make use of it we need to link against it at build time.  It doesn't need to be present on nodes you don't use "Autodetect=nvml" for (as long as you don't set Autodetect=nvml globally).  Is your cuda installation in the default location?

Thanks!
--Tim
Comment 9 Jeff Haferman 2021-11-30 09:08:24 MST
Got it. We're planning on upgrading to 21.08.4 over winter break, so I think I'll just do the rebuild with NVML at that time and make sure the cuda versions are consistent across our GPU nodes.

I imagine that even for the "Links" option, slurm will have to be built against the NVML library?
Comment 10 Tim McMullan 2021-11-30 10:06:55 MST
(In reply to Jeff Haferman from comment #9)
> Got it. We're planning on upgrading to 21.08.4 over winter break, so I think
> I'll just do the rebuild with NVML at that time and make sure the cuda
> versions are consistent across our GPU nodes.

Sounds good!

> I imagine that even for the "Links" option, slurm will have to be built
> against the NVML library?

It shouldn't actually require it.  We only really hook up to NVML to get all the relevant GPU information with autodetect, if you provide the "Links" option Slurm should just trust that and use it.
Comment 11 Tim McMullan 2021-12-03 06:25:24 MST
Hey!  I just wanted to check in and see if you were planning on trying this now or waiting.  Also wanted to see if you needed any other information on this!

Thanks!
--Tim
Comment 12 Jeff Haferman 2021-12-03 08:12:24 MST
Thanks Tim,
We're going to wait until our Winter Maintenance in January, but I think I have enough info now, and you can close the ticket. Thank you!
Comment 13 Tim McMullan 2021-12-03 12:02:55 MST
(In reply to Jeff Haferman from comment #12)
> Thanks Tim,
> We're going to wait until our Winter Maintenance in January, but I think I
> have enough info now, and you can close the ticket. Thank you!

Thank you!  Sounds good, I'll close this for now but let us know if you need anything else!

--Tim