Ticket 11529 - Set CUDA_DEVICE_ORDER when AutoDetect=nvml is used
Summary: Set CUDA_DEVICE_ORDER when AutoDetect=nvml is used
Status: RESOLVED WONTFIX
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other tickets)
Version: 21.08.x
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-05-04 15:02 MDT by Michael Hinton
Modified: 2022-01-26 16:42 MST (History)
1 user (show)

See Also:
Site: SchedMD
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Michael Hinton 2021-05-04 15:02:13 MDT
It might be good if Slurm automatically sets CUDA_​DEVICE_​ORDER=PCI_BUS_ID as a convenience to make sure that CUDA applications are guaranteed to have the same GPU order as NVML/nvidia-smi (and eventually Slurm, pending progress in bug 10933).

Things we still need to think through:

1) Should setting CUDA_​DEVICE_​ORDER=PCI_BUS_ID be done automatically whenever any GPU is requested? Or only when AutoDetect=nvml is specified? I think the former makes the most sense.

2) Would there be any case where a user would want to override CUDA_​DEVICE_​ORDER to *not* be PCI_BUS_ID? If so, maybe we would need to check if it was set to anything else first before we blindly set it, and maybe emit a warning about it if it's set to something else.

See bug 10827 comment 83 for more context.
Comment 1 Michael Hinton 2022-01-26 16:42:47 MST
Hey Kilian,

We are going to go ahead and leave CUDA_DEVICE_ORDER alone. How this is set probably won't matter in most cases, and in the cases where it could matter, we have this documented:

"For this numbering to match the numbering reported by CUDA, the CUDA_DEVICE_ORDER environmental variable must be set to CUDA_DEVICE_ORDER=PCI_BUS_ID." 

The CUDA documentation also states that there are two possible values for CUDA_DEVICE_ORDER - FASTEST_FIRST and PCI_BUS_ID - and that the default is FASTEST_FIRST. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars.

So we are going to err on the side of flexibility and backwards compatibility and leave it up to the CUDA application developer to change CUDA_DEVICE_ORDER. Of course, if you have a compelling counterpoint, feel free to elaborate.

Thanks!
-Michael