It might be good if Slurm automatically sets CUDA_DEVICE_ORDER=PCI_BUS_ID as a convenience to make sure that CUDA applications are guaranteed to have the same GPU order as NVML/nvidia-smi (and eventually Slurm, pending progress in bug 10933). Things we still need to think through: 1) Should setting CUDA_DEVICE_ORDER=PCI_BUS_ID be done automatically whenever any GPU is requested? Or only when AutoDetect=nvml is specified? I think the former makes the most sense. 2) Would there be any case where a user would want to override CUDA_DEVICE_ORDER to *not* be PCI_BUS_ID? If so, maybe we would need to check if it was set to anything else first before we blindly set it, and maybe emit a warning about it if it's set to something else. See bug 10827 comment 83 for more context.
Hey Kilian, We are going to go ahead and leave CUDA_DEVICE_ORDER alone. How this is set probably won't matter in most cases, and in the cases where it could matter, we have this documented: "For this numbering to match the numbering reported by CUDA, the CUDA_DEVICE_ORDER environmental variable must be set to CUDA_DEVICE_ORDER=PCI_BUS_ID." The CUDA documentation also states that there are two possible values for CUDA_DEVICE_ORDER - FASTEST_FIRST and PCI_BUS_ID - and that the default is FASTEST_FIRST. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars. So we are going to err on the side of flexibility and backwards compatibility and leave it up to the CUDA application developer to change CUDA_DEVICE_ORDER. Of course, if you have a compelling counterpoint, feel free to elaborate. Thanks! -Michael