12975 – --gpu-bind=closest appears to result in wrong bindings for our hardware unless I lie to Slurm in gres.conf

Ticket 12975 - --gpu-bind=closest appears to result in wrong bindings for our hardware unless I lie to Slurm in gres.conf

Summary: --gpu-bind=closest appears to result in wrong bindings for our hardware unles...

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	GPU (show other tickets)
Version:	20.11.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-12-06 22:47 MST by Chris Samuel (NERSC)
Modified:	2021-12-15 22:46 MST (History)
CC List:	3 users (show)

See Also:	10933 10827
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
20.11 v1 (25.35 KB, patch) 2021-12-14 11:14 MST, Michael Hinton	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Chris Samuel (NERSC) 2021-12-06 22:47:16 MST

Hi there,

We've got an odd issue reported by users on our new systems where Slurm seems to be getting the binding wrong if I either let it detect them via NVML or tell it the information as reported by `slurmd -G`. Basically the users reported the ordering was backwards from what was expected, and they could force the right binding with `--gpu-bind=map_gpu:3,2,1,0`

NOTE: the output here is scavenged from the last 24 hours of debugging, so some arguments to commands may not appear consistent across the many runs.

So this was backwards:

> srun --gpus-per-task=1 -C gpu -A nstaff_g -N 1 -n 4 --ntasks-per-node=4 -c 32 --cpu-bind=cores  --gpu-bind=closest  -l ./gpus_for_tasks 2>&1 | sort
0: mpi=0 CUDA_VISIBLE_DEVICES=0 gpu=0000:03:00.0 cpu=74 core_affinity=0-15,64-79
1: mpi=1 CUDA_VISIBLE_DEVICES=1 gpu=0000:41:00.0 cpu=88 core_affinity=16-31,80-95
2: mpi=2 CUDA_VISIBLE_DEVICES=2 gpu=0000:81:00.0 cpu=41 core_affinity=32-47,96-111
3: mpi=3 CUDA_VISIBLE_DEVICES=3 gpu=0000:C1:00.0 cpu=54 core_affinity=48-63,112-127

Whereas this is the expected ordering:

> srun --gpus-per-task=1 -C gpu -A nstaff_g -N 1 -n 4 --ntasks-per-node=4 -c 32 --cpu-bind=cores  --gpu-bind=map_gpu:3,2,1,0  -l ./gpus_for_tasks 2>&1 | sort
0: mpi=0 CUDA_VISIBLE_DEVICES=3 gpu=0000:C1:00.0 cpu=64 core_affinity=0-15,64-79
1: mpi=1 CUDA_VISIBLE_DEVICES=2 gpu=0000:81:00.0 cpu=89 core_affinity=16-31,80-95
2: mpi=2 CUDA_VISIBLE_DEVICES=1 gpu=0000:41:00.0 cpu=44 core_affinity=32-47,96-111
3: mpi=3 CUDA_VISIBLE_DEVICES=0 gpu=0000:03:00.0 cpu=54 core_affinity=48-63,112-127

Based on this layout:

muller:nid001032:~ # lstopo -p | egrep 'NUMA|3D|PU'
      NUMANode P#0 (62GB)
          PU P#0
          PU P#64
[...]
          PU P#15
          PU P#79
          PCI c1:00.0 (3D)
      NUMANode P#1 (63GB)
          PU P#16
          PU P#80
[..]
          PU P#31
          PU P#95
          PCI 81:00.0 (3D)
      NUMANode P#2 (63GB)
          PU P#32
          PU P#96
[...]
          PU P#47
          PU P#111
          PCI 41:00.0 (3D)
      NUMANode P#3 (63GB)
          PU P#48
          PU P#112
[...]
          PU P#63
          PU P#127
          PCI 03:00.0 (3D)

From what I can tell the cause appears to be because the minor device ordering is the opposite of what you would expect, in that /dev/nvidia0 is GPU 3, /dev/nvidia1 is GPU 2, /dev/nvidia2 is GPU 1 and /dev/nvidia3 is GPU 0.

You can see that like this:

muller:nid001032:~ # nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-f6737b69-921c-f96d-9979-fe5e670e09f6)
GPU 1: A100-SXM4-40GB (UUID: GPU-32765db8-5655-53b9-a7bd-74a2e7b922a9)
GPU 2: A100-SXM4-40GB (UUID: GPU-e4fb74a0-04d0-6aad-cc93-2c370974ddc4)
GPU 3: A100-SXM4-40GB (UUID: GPU-ae3fe0da-7887-b12a-4ba0-e68f1ab60178)

muller:nid001032:~ # nvidia-smi -q | egrep 'Minor|^GPU|GPU-'
GPU 00000000:03:00.0
    GPU UUID                              : GPU-f6737b69-921c-f96d-9979-fe5e670e09f6
    Minor Number                          : 3
GPU 00000000:41:00.0
    GPU UUID                              : GPU-32765db8-5655-53b9-a7bd-74a2e7b922a9
    Minor Number                          : 2
GPU 00000000:81:00.0
    GPU UUID                              : GPU-e4fb74a0-04d0-6aad-cc93-2c370974ddc4
    Minor Number                          : 1
GPU 00000000:C1:00.0
    GPU UUID                              : GPU-ae3fe0da-7887-b12a-4ba0-e68f1ab60178
    Minor Number                          : 0


This is what slurmd found on startup (with SlurmdDebug=debug3):

[2021-12-06T06:27:35.980] debug:  gres/gpu: init: loaded
[2021-12-06T06:27:35.980] debug:  gpu/nvml: init: init: GPU NVML plugin loaded
[2021-12-06T06:27:35.983] debug2: gpu/nvml: _nvml_init: Successfully initialized NVML
[2021-12-06T06:27:35.983] debug:  gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 450.162
[2021-12-06T06:27:35.983] debug:  gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.450.162
[2021-12-06T06:27:35.983] debug2: gpu/nvml: _get_system_gpu_list_nvml: Total CPU count: 128
[2021-12-06T06:27:35.983] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device count: 4
[2021-12-06T06:27:36.056] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0:
[2021-12-06T06:27:36.056] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Name: a100-sxm4-40gb
[2021-12-06T06:27:36.056] debug2: gpu/nvml: _get_system_gpu_list_nvml:     UUID: GPU-6ffb943c-acff-44ef-7c83-22f8cf492985
[2021-12-06T06:27:36.056] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Domain/Bus/Device: 0:2:0
[2021-12-06T06:27:36.056] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Bus ID: 00000000:02:00.0
[2021-12-06T06:27:36.056] debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: -1,4,4,4
[2021-12-06T06:27:36.057] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia3
[2021-12-06T06:27:36.057] debug:  gpu/nvml: _get_system_gpu_list_nvml: Note: GPU index 0 is different from minor number 3
[2021-12-06T06:27:36.057] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 48-63,112-127
[2021-12-06T06:27:36.057] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 48-63
[2021-12-06T06:27:36.057] debug2: Possible GPU Memory Frequencies (1):
[2021-12-06T06:27:36.057] debug2: -------------------------------
[2021-12-06T06:27:36.057] debug2:     *1215 MHz [0]
[2021-12-06T06:27:36.057] debug2:         Possible GPU Graphics Frequencies (81):
[2021-12-06T06:27:36.057] debug2:         ---------------------------------
[2021-12-06T06:27:36.057] debug2:           *1410 MHz [0]
[2021-12-06T06:27:36.057] debug2:           *1395 MHz [1]
[2021-12-06T06:27:36.057] debug2:           ...
[2021-12-06T06:27:36.057] debug2:           *810 MHz [40]
[2021-12-06T06:27:36.057] debug2:           ...
[2021-12-06T06:27:36.057] debug2:           *225 MHz [79]
[2021-12-06T06:27:36.057] debug2:           *210 MHz [80]
[2021-12-06T06:27:36.073] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 1:
[2021-12-06T06:27:36.073] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Name: a100-sxm4-40gb
[2021-12-06T06:27:36.073] debug2: gpu/nvml: _get_system_gpu_list_nvml:     UUID: GPU-db2e1169-e39a-7ff4-8be6-be7f3e48d4e5
[2021-12-06T06:27:36.073] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Domain/Bus/Device: 0:65:0
[2021-12-06T06:27:36.073] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Bus ID: 00000000:41:00.0
[2021-12-06T06:27:36.073] debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 4,-1,4,4
[2021-12-06T06:27:36.073] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia2
[2021-12-06T06:27:36.073] debug:  gpu/nvml: _get_system_gpu_list_nvml: Note: GPU index 1 is different from minor number 2
[2021-12-06T06:27:36.073] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 32-47,96-111
[2021-12-06T06:27:36.073] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 32-47
[2021-12-06T06:27:36.074] debug2: Possible GPU Memory Frequencies (1):
[2021-12-06T06:27:36.074] debug2: -------------------------------
[2021-12-06T06:27:36.074] debug2:     *1215 MHz [0]
[2021-12-06T06:27:36.074] debug2:         Possible GPU Graphics Frequencies (81):
[2021-12-06T06:27:36.074] debug2:         ---------------------------------
[2021-12-06T06:27:36.074] debug2:           *1410 MHz [0]
[2021-12-06T06:27:36.074] debug2:           *1395 MHz [1]
[2021-12-06T06:27:36.074] debug2:           ...
[2021-12-06T06:27:36.074] debug2:           *810 MHz [40]
[2021-12-06T06:27:36.074] debug2:           ...
[2021-12-06T06:27:36.074] debug2:           *225 MHz [79]
[2021-12-06T06:27:36.074] debug2:           *210 MHz [80]
[2021-12-06T06:27:36.090] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 2:
[2021-12-06T06:27:36.090] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Name: a100-sxm4-40gb
[2021-12-06T06:27:36.091] debug2: gpu/nvml: _get_system_gpu_list_nvml:     UUID: GPU-f0625f4a-057c-ef58-f1f7-0e948229bd56
[2021-12-06T06:27:36.091] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Domain/Bus/Device: 0:129:0
[2021-12-06T06:27:36.091] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Bus ID: 00000000:81:00.0
[2021-12-06T06:27:36.091] debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 4,4,-1,4
[2021-12-06T06:27:36.091] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia1
[2021-12-06T06:27:36.091] debug:  gpu/nvml: _get_system_gpu_list_nvml: Note: GPU index 2 is different from minor number 1
[2021-12-06T06:27:36.091] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 16-31,80-95
[2021-12-06T06:27:36.091] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 16-31
[2021-12-06T06:27:36.091] debug2: Possible GPU Memory Frequencies (1):
[2021-12-06T06:27:36.091] debug2: -------------------------------
[2021-12-06T06:27:36.091] debug2:     *1215 MHz [0]
[2021-12-06T06:27:36.091] debug2:         Possible GPU Graphics Frequencies (81):
[2021-12-06T06:27:36.091] debug2:         ---------------------------------
[2021-12-06T06:27:36.091] debug2:           *1410 MHz [0]
[2021-12-06T06:27:36.091] debug2:           *1395 MHz [1]
[2021-12-06T06:27:36.091] debug2:           ...
[2021-12-06T06:27:36.091] debug2:           *810 MHz [40]
[2021-12-06T06:27:36.091] debug2:           ...
[2021-12-06T06:27:36.091] debug2:           *225 MHz [79]
[2021-12-06T06:27:36.091] debug2:           *210 MHz [80]
[2021-12-06T06:27:36.107] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 3:
[2021-12-06T06:27:36.107] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Name: a100-sxm4-40gb
[2021-12-06T06:27:36.107] debug2: gpu/nvml: _get_system_gpu_list_nvml:     UUID: GPU-9095f99e-231d-e235-0d16-12b74b7d4776
[2021-12-06T06:27:36.107] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Domain/Bus/Device: 0:193:0
[2021-12-06T06:27:36.107] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Bus ID: 00000000:C1:00.0
[2021-12-06T06:27:36.107] debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 4,4,4,-1
[2021-12-06T06:27:36.107] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia0
[2021-12-06T06:27:36.107] debug:  gpu/nvml: _get_system_gpu_list_nvml: Note: GPU index 3 is different from minor number 0
[2021-12-06T06:27:36.107] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 0-15,64-79
[2021-12-06T06:27:36.107] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 0-15
[2021-12-06T06:27:36.108] debug2: Possible GPU Memory Frequencies (1):
[2021-12-06T06:27:36.108] debug2: -------------------------------
[2021-12-06T06:27:36.108] debug2:     *1215 MHz [0]
[2021-12-06T06:27:36.108] debug2:         Possible GPU Graphics Frequencies (81):
[2021-12-06T06:27:36.108] debug2:         ---------------------------------
[2021-12-06T06:27:36.108] debug2:           *1410 MHz [0]
[2021-12-06T06:27:36.108] debug2:           *1395 MHz [1]
[2021-12-06T06:27:36.108] debug2:           ...
[2021-12-06T06:27:36.108] debug2:           *810 MHz [40]
[2021-12-06T06:27:36.108] debug2:           ...
[2021-12-06T06:27:36.108] debug2:           *225 MHz [79]
[2021-12-06T06:27:36.108] debug2:           *210 MHz [80]
[2021-12-06T06:27:36.108] debug2: gpu/nvml: _nvml_shutdown: Successfully shut down NVML
[2021-12-06T06:27:36.108] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
[2021-12-06T06:27:36.108] debug:  Gres GPU plugin: Normalizing gres.conf with system GPUs
[2021-12-06T06:27:36.108] debug2: gres/gpu: _normalize_gres_conf: gres_list_conf:
[2021-12-06T06:27:36.108] debug2:     GRES[gpu] Type:a100 Count:1 Cores(128):(null)  Links:(null) Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-06T06:27:36.108] debug2:     GRES[gpu] Type:a100 Count:1 Cores(128):(null)  Links:(null) Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-12-06T06:27:36.108] debug2:     GRES[gpu] Type:a100 Count:1 Cores(128):(null)  Links:(null) Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-12-06T06:27:36.108] debug2:     GRES[gpu] Type:a100 Count:1 Cores(128):(null)  Links:(null) Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-12-06T06:27:36.108] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-12-06T06:27:36.108] debug:      GRES[gpu] Type:a100 Count:1 Cores(128):48-63  Links:-1,4,4,4 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-12-06T06:27:36.108] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-12-06T06:27:36.108] debug:      GRES[gpu] Type:a100 Count:1 Cores(128):32-47  Links:4,-1,4,4 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-12-06T06:27:36.108] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-12-06T06:27:36.108] debug:      GRES[gpu] Type:a100 Count:1 Cores(128):16-31  Links:4,4,-1,4 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-12-06T06:27:36.108] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-12-06T06:27:36.108] debug:      GRES[gpu] Type:a100 Count:1 Cores(128):0-15  Links:4,4,4,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-06T06:27:36.108] debug2: gres/gpu: _normalize_gres_conf: gres_list_gpu
[2021-12-06T06:27:36.108] debug2:     GRES[gpu] Type:a100 Count:1 Cores(128):0-15  Links:4,4,4,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-06T06:27:36.108] debug2:     GRES[gpu] Type:a100 Count:1 Cores(128):16-31  Links:4,4,-1,4 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-12-06T06:27:36.108] debug2:     GRES[gpu] Type:a100 Count:1 Cores(128):32-47  Links:4,-1,4,4 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-12-06T06:27:36.108] debug2:     GRES[gpu] Type:a100 Count:1 Cores(128):48-63  Links:-1,4,4,4 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-12-06T06:27:36.108] debug:  Gres GPU plugin: Final normalized gres.conf list:
[2021-12-06T06:27:36.108] debug:      GRES[gpu] Type:a100 Count:1 Cores(128):0-15  Links:4,4,4,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-12-06T06:27:36.108] debug:      GRES[gpu] Type:a100 Count:1 Cores(128):16-31  Links:4,4,-1,4 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-12-06T06:27:36.108] debug:      GRES[gpu] Type:a100 Count:1 Cores(128):32-47  Links:4,-1,4,4 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-12-06T06:27:36.108] debug:      GRES[gpu] Type:a100 Count:1 Cores(128):48-63  Links:-1,4,4,4 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-12-06T06:27:36.108] Gres Name=gpu Type=a100 Count=1
[2021-12-06T06:27:36.108] Gres Name=gpu Type=a100 Count=1
[2021-12-06T06:27:36.108] Gres Name=gpu Type=a100 Count=1
[2021-12-06T06:27:36.108] Gres Name=gpu Type=a100 Count=1


It does seem to spot that something is odd, but doesn't seem overly concerned:

muller:nid001053:~ # fgrep 'GPU index' /var/spool/slurmd/nid001053.log
[2021-12-06T06:27:36.056] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0:
[2021-12-06T06:27:36.057] debug:  gpu/nvml: _get_system_gpu_list_nvml: Note: GPU index 0 is different from minor number 3
[2021-12-06T06:27:36.073] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 1:
[2021-12-06T06:27:36.073] debug:  gpu/nvml: _get_system_gpu_list_nvml: Note: GPU index 1 is different from minor number 2
[2021-12-06T06:27:36.090] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 2:
[2021-12-06T06:27:36.091] debug:  gpu/nvml: _get_system_gpu_list_nvml: Note: GPU index 2 is different from minor number 1
[2021-12-06T06:27:36.107] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 3:
[2021-12-06T06:27:36.107] debug:  gpu/nvml: _get_system_gpu_list_nvml: Note: GPU index 3 is different from minor number 0


Digging into what was being set in the environment showed:

> srun  --gpus-per-task=1 -C gpu -A nstaff_g -N 1 -n 4 --ntasks-per-node=4 -c 1 --cpu-bind=cores  --gpu-bind=closest  -l env 2>&1 | sort | egrep 'SLURM_STEP_GPUS|CUDA_VISIBLE'
0: CUDA_VISIBLE_DEVICES=0
0: SLURM_STEP_GPUS=0
1: CUDA_VISIBLE_DEVICES=1
1: SLURM_STEP_GPUS=1
2: CUDA_VISIBLE_DEVICES=2
2: SLURM_STEP_GPUS=2
3: CUDA_VISIBLE_DEVICES=3
3: SLURM_STEP_GPUS=3


Wondering if perhaps something was getting confused reporting back to slurmctld I hardwired the information that slurmd -G reported into gres.conf (and disabling NVML autodetection) with the following, but that resulted in the same behaviour.

NodeName=$NODES Name=gpu Type=a100 File=/dev/nvidia0 Cores=0-15 Links=4,4,4,-1
NodeName=$NODES Name=gpu Type=a100 File=/dev/nvidia1 Cores=16-31 Links=4,4,-1,4
NodeName=$NODES Name=gpu Type=a100 File=/dev/nvidia2 Cores=32-47 Links=4,-1,4,4
NodeName=$NODES Name=gpu Type=a100 File=/dev/nvidia3 Cores=48-63 Links=-1,4,4,4

I got partway to a solution by deceiving Slurm by reversing the numbering of the device files:

NodeName=$NODES Name=gpu Type=a100 File=/dev/nvidia3 Cores=0-15 Links=4,4,4,-1
NodeName=$NODES Name=gpu Type=a100 File=/dev/nvidia2 Cores=16-31 Links=4,4,-1,4
NodeName=$NODES Name=gpu Type=a100 File=/dev/nvidia1 Cores=32-47 Links=4,-1,4,4
NodeName=$NODES Name=gpu Type=a100 File=/dev/nvidia0 Cores=48-63 Links=-1,4,4,4

That got the SLURM_STEP_GPUS value correct, but the CUDA_VISIBLE_DEVICES values were still going the wrong way:

> srun  --reservation=nid001053 --gpus-per-task=1 -C gpu -A nstaff_g -N 1 -n 4 --ntasks-per-node=4 -c 32 --cpu-bind=cores  --gpu-bind=closest  -l env 2>&1 | sort | egrep 'SLURM_STEP_GPUS|CUDA_VISIBLE'
0: CUDA_VISIBLE_DEVICES=0
0: SLURM_STEP_GPUS=3
1: CUDA_VISIBLE_DEVICES=1
1: SLURM_STEP_GPUS=2
2: CUDA_VISIBLE_DEVICES=2
2: SLURM_STEP_GPUS=1
3: CUDA_VISIBLE_DEVICES=3
3: SLURM_STEP_GPUS=0

So I ended up overriding those values via the task prolog by inserting:

# Make CUDA_VISIBLE_DEVICES match SLURM_STEP_GPUS
if [ -n "${SLURM_STEP_GPUS}" ]; then
	echo export CUDA_VISIBLE_DEVICES=${SLURM_STEP_GPUS}
fi

which finally resulted in:

> srun  --gpus-per-task=1 -C gpu -A nstaff_g -N 1 -n 4 --ntasks-per-node=4 -c 32 --cpu-bind=cores  --gpu-bind=closest  -l env 2>&1 | sort | egrep 'SLURM_STEP_GPUS|CUDA_VISIBLE'
0: CUDA_VISIBLE_DEVICES=3
0: SLURM_STEP_GPUS=3
1: CUDA_VISIBLE_DEVICES=2
1: SLURM_STEP_GPUS=2
2: CUDA_VISIBLE_DEVICES=1
2: SLURM_STEP_GPUS=1
3: CUDA_VISIBLE_DEVICES=0
3: SLURM_STEP_GPUS=0

and then the test application supplied by the user seemed to concur:

U> srun --gpus-per-task=1 -C gpu -A nstaff_g -N 1 -n 4 --ntasks-per-node=4 -c 32 --cpu-bind=cores --gpu-bind=closest -l ./gpus_for_tasks 2>&1 | sort
0: mpi=0 CUDA_VISIBLE_DEVICES=3 gpu=0000:C1:00.0 cpu=64 core_affinity=0-15,64-79
1: mpi=1 CUDA_VISIBLE_DEVICES=2 gpu=0000:81:00.0 cpu=89 core_affinity=16-31,80-95
2: mpi=2 CUDA_VISIBLE_DEVICES=1 gpu=0000:41:00.0 cpu=44 core_affinity=32-47,96-111
3: mpi=3 CUDA_VISIBLE_DEVICES=0 gpu=0000:02:00.0 cpu=54 core_affinity=48-63,112-127


So I think I have a workaround (which will go into production tomorrow) but I thought I should report the issue to get it looked at!

All the best,
Chris

Comment 1 Michael Hinton 2021-12-07 10:15:38 MST

Hi Chris,

(In reply to Chris Samuel (NERSC) from comment #0)
> From what I can tell the cause appears to be because the minor device
> ordering is the opposite of what you would expect, in that /dev/nvidia0 is
> GPU 3, /dev/nvidia1 is GPU 2, /dev/nvidia2 is GPU 1 and /dev/nvidia3 is GPU
> 0.
I believe this is a known set of issues with 20.11 and earlier that were fixed in 21.08 and beyond with commits https://github.com/SchedMD/slurm/commit/0ebfd37834 and https://github.com/SchedMD/slurm/commit/f589b480d8. I worked with Kilian on these issues in bugs 10827 and 10933 - bug 10827 is public, but bug 10933 is private.

Slurm's GPU code has always assumed that the minor numbering is in the same order as the device order as detected by NVML (i.e. PCI bus ID order; nvidia-smi shows this NVML order). This assumption is what the "Note: GPU index X is different from minor number Y" warning was alluding to. (Note that this NVML device order will also match the CUDA device order if CUDA_DEVICE_ORDER=PCI_BUS_ID.)

However, sometimes the NVML device order and the minor number order are not the same, as you have seen. This seems to frequently be the case on newer AMD systems with NVIDIA GPUs, for whatever reason. AutoDetect exacerbated this issue, since it does a bunch of internal sorting, causing the GPU order in Slurm to be changed in unexpected ways.

But these issues should now all be fixed with the commits above in 21.08. Would you be willing to see if things are fixed for you on 21.08?

Thanks!
-Michael

P.S. Slurm further assumed that the trailing number in the device filename was equivalent to the minor number (e.g. X in /dev/nvidiaX). This is a bad assumption with AMD GPUs, so that issue was fixed as well with the above commits, since the device order was decoupled from the minor number and the device filename.

Comment 2 Chris Samuel (NERSC) 2021-12-10 15:03:20 MST

Hi Michael,

Thanks for the info, that's really useful. Sadly due to time pressure on us we're not going to have an opportunity to go to 21.08 this year but I'm hoping to make that jump on Perlmutter at least very early next year.

In the meantime I'll see if I can backport these 2 commits, do you think that's feasible or do they rely on too many other changes to work?  I can tell at least the first one doesn't apply cleanly to 20.11. :-)

All the best,
Chris

Comment 3 Michael Hinton 2021-12-10 15:29:00 MST

You aren't using any AMD GPUs, right?

Comment 4 Michael Hinton 2021-12-10 15:31:57 MST

(In reply to Chris Samuel (NERSC) from comment #2)
> In the meantime I'll see if I can backport these 2 commits, do you think
> that's feasible or do they rely on too many other changes to work?  I can
> tell at least the first one doesn't apply cleanly to 20.11. :-)
I can do that for you. I think it's possible to backport to 20.11, but it apparently needs more than those two commits, as I am finding out.

Comment 5 Chris Samuel (NERSC) 2021-12-10 15:46:23 MST

(In reply to Michael Hinton from comment #4)

> I can do that for you. I think it's possible to backport to 20.11, but it
> apparently needs more than those two commits, as I am finding out.

Oh fantastic, thank you so much!

All the best,
Chris

Comment 6 Michael Hinton 2021-12-14 11:14:31 MST

Created attachment 22673 [details]
20.11 v1

Chris, can you try out 20.11 v1 and see if it solves the issue? I believe it should, but I might have missed something. Thanks!

Comment 7 Chris Samuel (NERSC) 2021-12-14 12:33:25 MST

(In reply to Michael Hinton from comment #6)

> Chris, can you try out 20.11 v1 and see if it solves the issue? I believe it
> should, but I might have missed something. Thanks!

Thanks Michael!

Will get some RPMs built later this afternoon, much obliged!

Comment 8 Michael Hinton 2021-12-14 13:18:39 MST

(In reply to Chris Samuel (NERSC) from comment #0)
> So I think I have a workaround (which will go into production tomorrow) but
> I thought I should report the issue to get it looked at!
What was your workaround, btw? Did it work?

Comment 9 Chris Samuel (NERSC) 2021-12-14 17:14:56 MST

(In reply to Michael Hinton from comment #8)
> (In reply to Chris Samuel (NERSC) from comment #0)
> > So I think I have a workaround (which will go into production tomorrow) but
> > I thought I should report the issue to get it looked at!
> What was your workaround, btw? Did it work?

Oh sorry! No it didn't - the workaround was in the description and whilst it did fix the test case I got from the user when I ran our reframe tests to confirm it didn't have wider reaching impacts things exploded messily so I had to back it out.. :-(

All the best,
Chris

Comment 10 Chris Samuel (NERSC) 2021-12-14 18:28:25 MST

(In reply to Michael Hinton from comment #6)

> Chris, can you try out 20.11 v1 and see if it solves the issue? I believe it
> should, but I might have missed something. Thanks!

Looks good Michael, thank you!

Binding appears good.

> srun --gpus-per-task=1 -C gpu -A nstaff_g -N 1 -n 4 --ntasks-per-node=4 -c 32 --cpu-bind=cores --gpu-bind=closest -l ./gpus_for_tasks 2>&1 | sort
0: mpi=0 CUDA_VISIBLE_DEVICES=3 gpu=0000:C1:00.0 cpu=14 core_affinity=0-15,64-79
1: mpi=1 CUDA_VISIBLE_DEVICES=2 gpu=0000:81:00.0 cpu=24 core_affinity=16-31,80-95
2: mpi=2 CUDA_VISIBLE_DEVICES=1 gpu=0000:41:00.0 cpu=107 core_affinity=32-47,96-111
3: mpi=3 CUDA_VISIBLE_DEVICES=0 gpu=0000:03:00.0 cpu=115 core_affinity=48-63,112-127

Our reframe tests pass:

[2021-12-14T17:19:25-08:00] [  PASSED  ] Ran 47/47 test case(s) from 33 check(s) (0 failure(s), 0 skipped)


Much obliged!

Comment 11 Michael Hinton 2021-12-15 10:05:27 MST

(In reply to Chris Samuel (NERSC) from comment #10)
> Looks good Michael, thank you!
> 
> Binding appears good.
> 
> > srun --gpus-per-task=1 -C gpu -A nstaff_g -N 1 -n 4 --ntasks-per-node=4 -c 32 --cpu-bind=cores --gpu-bind=closest -l ./gpus_for_tasks 2>&1 | sort
> 0: mpi=0 CUDA_VISIBLE_DEVICES=3 gpu=0000:C1:00.0 cpu=14
> core_affinity=0-15,64-79
> 1: mpi=1 CUDA_VISIBLE_DEVICES=2 gpu=0000:81:00.0 cpu=24
> core_affinity=16-31,80-95
> 2: mpi=2 CUDA_VISIBLE_DEVICES=1 gpu=0000:41:00.0 cpu=107
> core_affinity=32-47,96-111
> 3: mpi=3 CUDA_VISIBLE_DEVICES=0 gpu=0000:03:00.0 cpu=115
> core_affinity=48-63,112-127
> 
> Our reframe tests pass:
> 
> [2021-12-14T17:19:25-08:00] [  PASSED  ] Ran 47/47 test case(s) from 33
> check(s) (0 failure(s), 0 skipped)
> 
> 
> Much obliged!
Excellent! I'm glad it's working, and that we have a patch for anyone else running into this issue on 20.11.

I'll go ahead and close this out. Thanks!
-Michael

Comment 12 Chris Samuel (NERSC) 2021-12-15 22:46:52 MST

Thanks Michael!  This should go on to Perlmutter next week.