Ticket 18756

Summary: Some GPUs inaccessible with --gpus-per-task and --gpu-bind=none
Product: Slurm Reporter: Joachim Folz <joachim.folz>
Component: RegressionAssignee: Jacob Jenson <jacob>
Status: OPEN --- QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 23.11.2   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Excerpt of slurmd.log for the 3 jobs described above
slurm.conf used for testing
Output of test jobs in case I missed something

Description Joachim Folz 2024-01-23 11:19:47 MST
Created attachment 34257 [details]
Excerpt of slurmd.log for the 3 jobs described above

After updating to 23.11.1, we are experiencing issues where some GPUs are inaccessible (= cannot be used by CUDA code, in our case PyTorch) when the allocated GPUs are not "contiguous" and "aligned".

For example, if a new job requests 4 GPUs and receives 1, 2, 3, and 7, then GPU 7 is inaccessible.
Another scenario that does not work is 2-5.
However, if the job is allocated GPUs 0-3 or 4-7, all are accessible, and all 8 GPUs in one job also works.

In case GPUs are inaccessible, I noticed odd values for CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL, and SLURM_STEP_GPUS env vars.

E.g. with the following srun command, I see the following values for all tasks:

srun -K --partition=RTXA6000 --nodes=1 --ntasks=4 --gpus-per-task=1 --gpu-bind=none --cpus-per-task=8 test.sh

CUDA_VISIBLE_DEVICES=1,2,3
GPU_DEVICE_ORDINAL=1,2,3
SLURM_STEP_GPUS=2,3,7

As you can see, the local device 0 (GPU 1 on the host) is missing from the listings. However, as mentioned before, in this case device 3 (GPU 7) cannot be accessed.

If I remove --gpu-bind=none, then all tasks can access exactly 1 GPU, and they are all different. We don't want to use this config, though, as it interferes with NCCL.

Finally, simply specifying 4 GPUs and 4 tasks works as expected:

srun -K --partition=RTXA6000 --nodes=1 --ntasks=4 --gpus=4 --cpus-per-task=8 test.sh

CUDA_VISIBLE_DEVICES=0,1,2,3
GPU_DEVICE_ORDINAL=0,1,2,3
SLURM_STEP_GPUS=1,2,3,7


Weirdly, nvidia-smi still lists the correct GPUs in all cases (either all 4 or just 1). I cross-referenced the PCI-IDs with the GPU IDX reported by scontrol to confirm it.

For completeness, here is the gres.conf for this machine. I'm attaching an excerpt of the slurmd.log that spans 3 jobs with the mentioned commands. I'll also attach the slurm.conf in a comment.

Name=gpu File=/dev/nvidia3 Cores=0-23
Name=gpu File=/dev/nvidia2 Cores=0-23
Name=gpu File=/dev/nvidia1 Cores=0-23
Name=gpu File=/dev/nvidia0 Cores=0-23
Name=gpu File=/dev/nvidia7 Cores=24-47
Name=gpu File=/dev/nvidia6 Cores=24-47
Name=gpu File=/dev/nvidia5 Cores=24-47
Name=gpu File=/dev/nvidia4 Cores=24-47

(This is a slightly weird machine, the odd ordering of the GPUs makes them line up with nvidia-smi and our monitoring)
Comment 1 Joachim Folz 2024-01-23 11:20:49 MST
Created attachment 34258 [details]
slurm.conf used for testing
Comment 2 Joachim Folz 2024-01-23 11:21:43 MST
Created attachment 34259 [details]
Output of test jobs in case I missed something
Comment 3 Joachim Folz 2024-01-24 04:59:52 MST
The issue persists after upgrading to 23.11.2.
Comment 4 Joachim Folz 2024-01-24 05:01:16 MST
The issue persists after upgrading to 23.11.2.