Summary: | Some GPUs inaccessible with --gpus-per-task and --gpu-bind=none | ||
---|---|---|---|
Product: | Slurm | Reporter: | Joachim Folz <joachim.folz> |
Component: | Regression | Assignee: | Jacob Jenson <jacob> |
Status: | OPEN --- | QA Contact: | |
Severity: | 6 - No support contract | ||
Priority: | --- | ||
Version: | 23.11.2 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
Excerpt of slurmd.log for the 3 jobs described above
slurm.conf used for testing Output of test jobs in case I missed something |
Created attachment 34258 [details]
slurm.conf used for testing
Created attachment 34259 [details]
Output of test jobs in case I missed something
The issue persists after upgrading to 23.11.2. The issue persists after upgrading to 23.11.2. |
Created attachment 34257 [details] Excerpt of slurmd.log for the 3 jobs described above After updating to 23.11.1, we are experiencing issues where some GPUs are inaccessible (= cannot be used by CUDA code, in our case PyTorch) when the allocated GPUs are not "contiguous" and "aligned". For example, if a new job requests 4 GPUs and receives 1, 2, 3, and 7, then GPU 7 is inaccessible. Another scenario that does not work is 2-5. However, if the job is allocated GPUs 0-3 or 4-7, all are accessible, and all 8 GPUs in one job also works. In case GPUs are inaccessible, I noticed odd values for CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL, and SLURM_STEP_GPUS env vars. E.g. with the following srun command, I see the following values for all tasks: srun -K --partition=RTXA6000 --nodes=1 --ntasks=4 --gpus-per-task=1 --gpu-bind=none --cpus-per-task=8 test.sh CUDA_VISIBLE_DEVICES=1,2,3 GPU_DEVICE_ORDINAL=1,2,3 SLURM_STEP_GPUS=2,3,7 As you can see, the local device 0 (GPU 1 on the host) is missing from the listings. However, as mentioned before, in this case device 3 (GPU 7) cannot be accessed. If I remove --gpu-bind=none, then all tasks can access exactly 1 GPU, and they are all different. We don't want to use this config, though, as it interferes with NCCL. Finally, simply specifying 4 GPUs and 4 tasks works as expected: srun -K --partition=RTXA6000 --nodes=1 --ntasks=4 --gpus=4 --cpus-per-task=8 test.sh CUDA_VISIBLE_DEVICES=0,1,2,3 GPU_DEVICE_ORDINAL=0,1,2,3 SLURM_STEP_GPUS=1,2,3,7 Weirdly, nvidia-smi still lists the correct GPUs in all cases (either all 4 or just 1). I cross-referenced the PCI-IDs with the GPU IDX reported by scontrol to confirm it. For completeness, here is the gres.conf for this machine. I'm attaching an excerpt of the slurmd.log that spans 3 jobs with the mentioned commands. I'll also attach the slurm.conf in a comment. Name=gpu File=/dev/nvidia3 Cores=0-23 Name=gpu File=/dev/nvidia2 Cores=0-23 Name=gpu File=/dev/nvidia1 Cores=0-23 Name=gpu File=/dev/nvidia0 Cores=0-23 Name=gpu File=/dev/nvidia7 Cores=24-47 Name=gpu File=/dev/nvidia6 Cores=24-47 Name=gpu File=/dev/nvidia5 Cores=24-47 Name=gpu File=/dev/nvidia4 Cores=24-47 (This is a slightly weird machine, the odd ordering of the GPUs makes them line up with nvidia-smi and our monitoring)