Created attachment 34257 [details] Excerpt of slurmd.log for the 3 jobs described above After updating to 23.11.1, we are experiencing issues where some GPUs are inaccessible (= cannot be used by CUDA code, in our case PyTorch) when the allocated GPUs are not "contiguous" and "aligned". For example, if a new job requests 4 GPUs and receives 1, 2, 3, and 7, then GPU 7 is inaccessible. Another scenario that does not work is 2-5. However, if the job is allocated GPUs 0-3 or 4-7, all are accessible, and all 8 GPUs in one job also works. In case GPUs are inaccessible, I noticed odd values for CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL, and SLURM_STEP_GPUS env vars. E.g. with the following srun command, I see the following values for all tasks: srun -K --partition=RTXA6000 --nodes=1 --ntasks=4 --gpus-per-task=1 --gpu-bind=none --cpus-per-task=8 test.sh CUDA_VISIBLE_DEVICES=1,2,3 GPU_DEVICE_ORDINAL=1,2,3 SLURM_STEP_GPUS=2,3,7 As you can see, the local device 0 (GPU 1 on the host) is missing from the listings. However, as mentioned before, in this case device 3 (GPU 7) cannot be accessed. If I remove --gpu-bind=none, then all tasks can access exactly 1 GPU, and they are all different. We don't want to use this config, though, as it interferes with NCCL. Finally, simply specifying 4 GPUs and 4 tasks works as expected: srun -K --partition=RTXA6000 --nodes=1 --ntasks=4 --gpus=4 --cpus-per-task=8 test.sh CUDA_VISIBLE_DEVICES=0,1,2,3 GPU_DEVICE_ORDINAL=0,1,2,3 SLURM_STEP_GPUS=1,2,3,7 Weirdly, nvidia-smi still lists the correct GPUs in all cases (either all 4 or just 1). I cross-referenced the PCI-IDs with the GPU IDX reported by scontrol to confirm it. For completeness, here is the gres.conf for this machine. I'm attaching an excerpt of the slurmd.log that spans 3 jobs with the mentioned commands. I'll also attach the slurm.conf in a comment. Name=gpu File=/dev/nvidia3 Cores=0-23 Name=gpu File=/dev/nvidia2 Cores=0-23 Name=gpu File=/dev/nvidia1 Cores=0-23 Name=gpu File=/dev/nvidia0 Cores=0-23 Name=gpu File=/dev/nvidia7 Cores=24-47 Name=gpu File=/dev/nvidia6 Cores=24-47 Name=gpu File=/dev/nvidia5 Cores=24-47 Name=gpu File=/dev/nvidia4 Cores=24-47 (This is a slightly weird machine, the odd ordering of the GPUs makes them line up with nvidia-smi and our monitoring)
Created attachment 34258 [details] slurm.conf used for testing
Created attachment 34259 [details] Output of test jobs in case I missed something
The issue persists after upgrading to 23.11.2.