Ticket 12903

Summary: Question on hwloc 2.x and it's cuda support
Product: Slurm Reporter: Richard Lefebvre <richard.lefebvre>
Component: slurmdAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: kaizaad
Version: 21.08.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=10679
Site: Calcul Quebec McGill Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Richard Lefebvre 2021-11-23 12:57:33 MST
Hi,

does slurmd takes advantage of hwloc cuda support if hwloc 2.x is compiled with it:

configure --with-cuda=...

We are currently using Rocky linux 8.4 with it's own hwloc 2.4.1-3 which is not linked to cuda and slurm 21.08.x is compiled with that version. By using this are we missing any extra features?

The autodetect=nvml seems to be working fine with the current setup.

Richard
Comment 2 Marcin Stolarek 2021-11-24 01:20:26 MST
Richard,

Looking at cudart.h[1] or the interoperability doc[2], the functions declared there are:
- hwloc_cudart_get_device_pcidev()
- hwloc_cudart_get_device_pci_ids()
- hwloc_cudart_get_device_osdev_by_index()
- hwloc_cudart_get_device_cpuset()

Neither of those is currently in use by Slurm. The support for hwloc v2 we introduced is just to handle a breaking between v1 and v2, which was to take NUMA Nodes outside of the main hierarchy tree (for the reference 529edc1ed6[3]).

Let me know if you have any remaining questions.

cheers,
Marcin
[1]https://www.open-mpi.org/projects/hwloc/doc/v2.4.1/a00125_source.php
[2]https://www.open-mpi.org/projects/hwloc/doc/v2.4.1/a00190.php#gad8b701d9a34923e34bd58defd4c1e704
[3]https://github.com/open-mpi/hwloc/commit/529edc1ed63d0198f634ae5d951f0ef568e39ead
Comment 3 Marcin Stolarek 2021-11-29 05:56:21 MST
I'll go ahead and close this as information given.

Should you have any questions please reopen.

cheers,
Marcin