18756 – Some GPUs inaccessible with --gpus-per-task and --gpu-bind=none

Ticket 18756 - Some GPUs inaccessible with --gpus-per-task and --gpu-bind=none

Summary: Some GPUs inaccessible with --gpus-per-task and --gpu-bind=none

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	23.11.2
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2024-01-23 11:19 MST by Joachim Folz
Modified:	2025-07-02 08:57 MDT (History)
CC List:	1 user (show)

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Excerpt of slurmd.log for the 3 jobs described above (78.06 KB, application/x-gzip) 2024-01-23 11:19 MST, Joachim Folz	Details
slurm.conf used for testing (2.10 KB, text/plain) 2024-01-23 11:20 MST, Joachim Folz	Details
Output of test jobs in case I missed something (39.06 KB, text/plain) 2024-01-23 11:21 MST, Joachim Folz	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Joachim Folz 2024-01-23 11:19:47 MST

Created attachment 34257 [details]
Excerpt of slurmd.log for the 3 jobs described above

After updating to 23.11.1, we are experiencing issues where some GPUs are inaccessible (= cannot be used by CUDA code, in our case PyTorch) when the allocated GPUs are not "contiguous" and "aligned".

For example, if a new job requests 4 GPUs and receives 1, 2, 3, and 7, then GPU 7 is inaccessible.
Another scenario that does not work is 2-5.
However, if the job is allocated GPUs 0-3 or 4-7, all are accessible, and all 8 GPUs in one job also works.

In case GPUs are inaccessible, I noticed odd values for CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL, and SLURM_STEP_GPUS env vars.

E.g. with the following srun command, I see the following values for all tasks:

srun -K --partition=RTXA6000 --nodes=1 --ntasks=4 --gpus-per-task=1 --gpu-bind=none --cpus-per-task=8 test.sh

CUDA_VISIBLE_DEVICES=1,2,3
GPU_DEVICE_ORDINAL=1,2,3
SLURM_STEP_GPUS=2,3,7

As you can see, the local device 0 (GPU 1 on the host) is missing from the listings. However, as mentioned before, in this case device 3 (GPU 7) cannot be accessed.

If I remove --gpu-bind=none, then all tasks can access exactly 1 GPU, and they are all different. We don't want to use this config, though, as it interferes with NCCL.

Finally, simply specifying 4 GPUs and 4 tasks works as expected:

srun -K --partition=RTXA6000 --nodes=1 --ntasks=4 --gpus=4 --cpus-per-task=8 test.sh

CUDA_VISIBLE_DEVICES=0,1,2,3
GPU_DEVICE_ORDINAL=0,1,2,3
SLURM_STEP_GPUS=1,2,3,7


Weirdly, nvidia-smi still lists the correct GPUs in all cases (either all 4 or just 1). I cross-referenced the PCI-IDs with the GPU IDX reported by scontrol to confirm it.

For completeness, here is the gres.conf for this machine. I'm attaching an excerpt of the slurmd.log that spans 3 jobs with the mentioned commands. I'll also attach the slurm.conf in a comment.

Name=gpu File=/dev/nvidia3 Cores=0-23
Name=gpu File=/dev/nvidia2 Cores=0-23
Name=gpu File=/dev/nvidia1 Cores=0-23
Name=gpu File=/dev/nvidia0 Cores=0-23
Name=gpu File=/dev/nvidia7 Cores=24-47
Name=gpu File=/dev/nvidia6 Cores=24-47
Name=gpu File=/dev/nvidia5 Cores=24-47
Name=gpu File=/dev/nvidia4 Cores=24-47

(This is a slightly weird machine, the odd ordering of the GPUs makes them line up with nvidia-smi and our monitoring)

Comment 1 Joachim Folz 2024-01-23 11:20:49 MST

Created attachment 34258 [details]
slurm.conf used for testing

Comment 2 Joachim Folz 2024-01-23 11:21:43 MST

Created attachment 34259 [details]
Output of test jobs in case I missed something

Comment 3 Joachim Folz 2024-01-24 04:59:52 MST

The issue persists after upgrading to 23.11.2.

Comment 4 Joachim Folz 2024-01-24 05:01:16 MST

The issue persists after upgrading to 23.11.2.