Ticket 18756 - Some GPUs inaccessible with --gpus-per-task and --gpu-bind=none
Summary: Some GPUs inaccessible with --gpus-per-task and --gpu-bind=none
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Regression (show other tickets)
Version: 23.11.2
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-01-23 11:19 MST by Joachim Folz
Modified: 2024-01-24 05:01 MST (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Excerpt of slurmd.log for the 3 jobs described above (78.06 KB, application/x-gzip)
2024-01-23 11:19 MST, Joachim Folz
Details
slurm.conf used for testing (2.10 KB, text/plain)
2024-01-23 11:20 MST, Joachim Folz
Details
Output of test jobs in case I missed something (39.06 KB, text/plain)
2024-01-23 11:21 MST, Joachim Folz
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Joachim Folz 2024-01-23 11:19:47 MST
Created attachment 34257 [details]
Excerpt of slurmd.log for the 3 jobs described above

After updating to 23.11.1, we are experiencing issues where some GPUs are inaccessible (= cannot be used by CUDA code, in our case PyTorch) when the allocated GPUs are not "contiguous" and "aligned".

For example, if a new job requests 4 GPUs and receives 1, 2, 3, and 7, then GPU 7 is inaccessible.
Another scenario that does not work is 2-5.
However, if the job is allocated GPUs 0-3 or 4-7, all are accessible, and all 8 GPUs in one job also works.

In case GPUs are inaccessible, I noticed odd values for CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL, and SLURM_STEP_GPUS env vars.

E.g. with the following srun command, I see the following values for all tasks:

srun -K --partition=RTXA6000 --nodes=1 --ntasks=4 --gpus-per-task=1 --gpu-bind=none --cpus-per-task=8 test.sh

CUDA_VISIBLE_DEVICES=1,2,3
GPU_DEVICE_ORDINAL=1,2,3
SLURM_STEP_GPUS=2,3,7

As you can see, the local device 0 (GPU 1 on the host) is missing from the listings. However, as mentioned before, in this case device 3 (GPU 7) cannot be accessed.

If I remove --gpu-bind=none, then all tasks can access exactly 1 GPU, and they are all different. We don't want to use this config, though, as it interferes with NCCL.

Finally, simply specifying 4 GPUs and 4 tasks works as expected:

srun -K --partition=RTXA6000 --nodes=1 --ntasks=4 --gpus=4 --cpus-per-task=8 test.sh

CUDA_VISIBLE_DEVICES=0,1,2,3
GPU_DEVICE_ORDINAL=0,1,2,3
SLURM_STEP_GPUS=1,2,3,7


Weirdly, nvidia-smi still lists the correct GPUs in all cases (either all 4 or just 1). I cross-referenced the PCI-IDs with the GPU IDX reported by scontrol to confirm it.

For completeness, here is the gres.conf for this machine. I'm attaching an excerpt of the slurmd.log that spans 3 jobs with the mentioned commands. I'll also attach the slurm.conf in a comment.

Name=gpu File=/dev/nvidia3 Cores=0-23
Name=gpu File=/dev/nvidia2 Cores=0-23
Name=gpu File=/dev/nvidia1 Cores=0-23
Name=gpu File=/dev/nvidia0 Cores=0-23
Name=gpu File=/dev/nvidia7 Cores=24-47
Name=gpu File=/dev/nvidia6 Cores=24-47
Name=gpu File=/dev/nvidia5 Cores=24-47
Name=gpu File=/dev/nvidia4 Cores=24-47

(This is a slightly weird machine, the odd ordering of the GPUs makes them line up with nvidia-smi and our monitoring)
Comment 1 Joachim Folz 2024-01-23 11:20:49 MST
Created attachment 34258 [details]
slurm.conf used for testing
Comment 2 Joachim Folz 2024-01-23 11:21:43 MST
Created attachment 34259 [details]
Output of test jobs in case I missed something
Comment 3 Joachim Folz 2024-01-24 04:59:52 MST
The issue persists after upgrading to 23.11.2.
Comment 4 Joachim Folz 2024-01-24 05:01:16 MST
The issue persists after upgrading to 23.11.2.