Ticket 18756

Summary:	Some GPUs inaccessible with --gpus-per-task and --gpu-bind=none
Product:	Slurm	Reporter:	Joachim Folz <joachim.folz>
Component:	User Commands	Assignee:	Jacob Jenson <jacob>
Status:	OPEN ---	QA Contact:
Severity:	6 - No support contract
Priority:	---	CC:	albert.gil
Version:	23.11.2
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Excerpt of slurmd.log for the 3 jobs described above slurm.conf used for testing Output of test jobs in case I missed something

Description Joachim Folz 2024-01-23 11:19:47 MST

Created attachment 34257 [details]
Excerpt of slurmd.log for the 3 jobs described above

After updating to 23.11.1, we are experiencing issues where some GPUs are inaccessible (= cannot be used by CUDA code, in our case PyTorch) when the allocated GPUs are not "contiguous" and "aligned".

For example, if a new job requests 4 GPUs and receives 1, 2, 3, and 7, then GPU 7 is inaccessible.
Another scenario that does not work is 2-5.
However, if the job is allocated GPUs 0-3 or 4-7, all are accessible, and all 8 GPUs in one job also works.

In case GPUs are inaccessible, I noticed odd values for CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL, and SLURM_STEP_GPUS env vars.

E.g. with the following srun command, I see the following values for all tasks:

srun -K --partition=RTXA6000 --nodes=1 --ntasks=4 --gpus-per-task=1 --gpu-bind=none --cpus-per-task=8 test.sh

CUDA_VISIBLE_DEVICES=1,2,3
GPU_DEVICE_ORDINAL=1,2,3
SLURM_STEP_GPUS=2,3,7

As you can see, the local device 0 (GPU 1 on the host) is missing from the listings. However, as mentioned before, in this case device 3 (GPU 7) cannot be accessed.

If I remove --gpu-bind=none, then all tasks can access exactly 1 GPU, and they are all different. We don't want to use this config, though, as it interferes with NCCL.

Finally, simply specifying 4 GPUs and 4 tasks works as expected:

srun -K --partition=RTXA6000 --nodes=1 --ntasks=4 --gpus=4 --cpus-per-task=8 test.sh

CUDA_VISIBLE_DEVICES=0,1,2,3
GPU_DEVICE_ORDINAL=0,1,2,3
SLURM_STEP_GPUS=1,2,3,7


Weirdly, nvidia-smi still lists the correct GPUs in all cases (either all 4 or just 1). I cross-referenced the PCI-IDs with the GPU IDX reported by scontrol to confirm it.

For completeness, here is the gres.conf for this machine. I'm attaching an excerpt of the slurmd.log that spans 3 jobs with the mentioned commands. I'll also attach the slurm.conf in a comment.

Name=gpu File=/dev/nvidia3 Cores=0-23
Name=gpu File=/dev/nvidia2 Cores=0-23
Name=gpu File=/dev/nvidia1 Cores=0-23
Name=gpu File=/dev/nvidia0 Cores=0-23
Name=gpu File=/dev/nvidia7 Cores=24-47
Name=gpu File=/dev/nvidia6 Cores=24-47
Name=gpu File=/dev/nvidia5 Cores=24-47
Name=gpu File=/dev/nvidia4 Cores=24-47

(This is a slightly weird machine, the odd ordering of the GPUs makes them line up with nvidia-smi and our monitoring)

Comment 1 Joachim Folz 2024-01-23 11:20:49 MST

Created attachment 34258 [details]
slurm.conf used for testing

Comment 2 Joachim Folz 2024-01-23 11:21:43 MST

Created attachment 34259 [details]
Output of test jobs in case I missed something

Comment 3 Joachim Folz 2024-01-24 04:59:52 MST

The issue persists after upgrading to 23.11.2.

Comment 4 Joachim Folz 2024-01-24 05:01:16 MST

The issue persists after upgrading to 23.11.2.