11529 – Set CUDA_DEVICE_ORDER when AutoDetect=nvml is used

Ticket 11529 - Set CUDA_DEVICE_ORDER when AutoDetect=nvml is used

Summary: Set CUDA_DEVICE_ORDER when AutoDetect=nvml is used

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	GPU (show other tickets)
Version:	21.08.x
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-05-04 15:02 MDT by Michael Hinton
Modified:	2022-01-26 16:42 MST (History)
CC List:	1 user (show)

See Also:	10827 10933
Site:	SchedMD
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Michael Hinton 2021-05-04 15:02:13 MDT

It might be good if Slurm automatically sets CUDA_DEVICE_ORDER=PCI_BUS_ID as a convenience to make sure that CUDA applications are guaranteed to have the same GPU order as NVML/nvidia-smi (and eventually Slurm, pending progress in bug 10933).

Things we still need to think through:

1) Should setting CUDA_DEVICE_ORDER=PCI_BUS_ID be done automatically whenever any GPU is requested? Or only when AutoDetect=nvml is specified? I think the former makes the most sense.

2) Would there be any case where a user would want to override CUDA_DEVICE_ORDER to *not* be PCI_BUS_ID? If so, maybe we would need to check if it was set to anything else first before we blindly set it, and maybe emit a warning about it if it's set to something else.

See bug 10827 comment 83 for more context.

Comment 1 Michael Hinton 2022-01-26 16:42:47 MST

Hey Kilian,

We are going to go ahead and leave CUDA_DEVICE_ORDER alone. How this is set probably won't matter in most cases, and in the cases where it could matter, we have this documented:

"For this numbering to match the numbering reported by CUDA, the CUDA_DEVICE_ORDER environmental variable must be set to CUDA_DEVICE_ORDER=PCI_BUS_ID." 

The CUDA documentation also states that there are two possible values for CUDA_DEVICE_ORDER - FASTEST_FIRST and PCI_BUS_ID - and that the default is FASTEST_FIRST. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars.

So we are going to err on the side of flexibility and backwards compatibility and leave it up to the CUDA application developer to change CUDA_DEVICE_ORDER. Of course, if you have a compelling counterpoint, feel free to elaborate.

Thanks!
-Michael