Ticket 15614

Summary:	Missing CPU cores in the autodetected GPU config
Product:	Slurm	Reporter:	hpc-ops
Component:	slurmd	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	cinek
Version:	22.05.6
Hardware:	Linux
OS:	Linux
Site:	Ghent	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf slurmd log (debug4)

Description hpc-ops 2022-12-13 07:21:40 MST

Hi,


I'm looking at the GPU config output slurmd shows as we're moving from autodetect=nvml to a static configuration, since the autodetect sometimes seems to fail after we updated GPU drivers.


I get


[2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0:
[2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Name: nvidia_a100-sxm4-80gb
[2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml:     UUID: GPU-d09a4dcf-6935-5ebe-3f1a-53f21c0b3377
[2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Domain/Bus/Device: 0:1:0
[2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Bus ID: 00000000:01:00.0
[2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: -1,4,4,4
[2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia0
[2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 18-23
[2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 18-23
[2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml:     MIG mode: disabled


[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 1:
[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Name: nvidia_a100-sxm4-80gb
[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml:     UUID: GPU-d9c5a758-9ba5-dbef-2414-959b143e17b3
[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Domain/Bus/Device: 0:65:0
[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Bus ID: 00000000:41:00.0
[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 4,-1,4,4
[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia1
[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 6-11
[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 6-11
[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml:     MIG mode: disabled


[2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 2:
[2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Name: nvidia_a100-sxm4-80gb
[2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml:     UUID: GPU-cea97538-e5c2-7045-5da9-6568a583441b
[2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Domain/Bus/Device: 0:129:0
[2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Bus ID: 00000000:81:00.0
[2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 4,4,-1,4
[2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia2
[2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 42-47
[2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 42-47
[2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml:     MIG mode: disabled




[2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 3:
[2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Name: nvidia_a100-sxm4-80gb
[2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml:     UUID: GPU-8c83fe78-bda0-38c1-e2c3-9e2ea8bd33d8
[2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Domain/Bus/Device: 0:193:0
[2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Bus ID: 00000000:C1:00.0
[2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 4,4,4,-1
[2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia3
[2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 30-35
[2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 30-35
[2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml:     MIG mode: disabled


Any idea why we only see part of the total number of cores listed in the range?


We have 48 cores in the machine, but somehow, slurm only seems to see 24 for GPU  assignment.


Kind regards,
-- Andy

Comment 1 hpc-ops 2022-12-13 08:51:55 MST

Hi,

As a followup question, different GPU cluster, we have 4 GPUs per node and get:


GPU 0:

[2022-12-13T16:00:02.070] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
[2022-12-13T16:00:02.070] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 0-15

GPU 1:

[2022-12-13T16:00:02.071] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
[2022-12-13T16:00:02.071] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 0-15

GPU 2:

[2022-12-13T16:00:02.073] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
[2022-12-13T16:00:02.073] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 16-31

GPU 3:

[2022-12-13T16:00:02.074] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
[2022-12-13T16:00:02.074] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 16-31


So here we have 16 cores per GPU, instead of the expected 8. 

How should we interpret this to generate gres.conf?


-- Andy

Comment 2 Marcin Stolarek 2022-12-13 09:25:32 MST

Could you please share the output of `lstopo-no-graphics`?

>How should we interpret this to generate gres.conf?
You can rely on autodetect, without the need to list the GPU manually in the gres.conf

cheers,
Marcin

Comment 3 hpc-ops 2022-12-13 23:58:05 MST

Hi,


The problem is that with autodetect we had these in the logs:

[2022-12-13T09:00:01.362] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
[2022-12-13T09:00:01.362] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `tesla_v100-sxm2-32gb`. Setting system GRES type to NULL
[2022-12-13T09:00:01.362] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `tesla_v100-sxm2-32gb`. Setting system GRES type to NULL
[2022-12-13T09:00:01.362] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `tesla_v100-sxm2-32gb`. Setting system GRES type to NULL
[2022-12-13T09:00:01.362] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `tesla_v100-sxm2-32gb`. Setting system GRES type to NULL


And we could not get jobs to find the GPUs.

This is the output for the first cluster (accelgor) (so the first original question)

Machine (503GB total)
  Package L#0
    L3 L#0 (32MB)
      NUMANode L#0 (P#0 62GB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
      L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
      HostBridge
        PCIBridge
          PCIBridge
            PCI 62:00.0 (VGA)
    L3 L#1 (32MB)
      NUMANode L#1 (P#1 63GB)
      L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
      L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
      L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
      L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
      L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
      L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
      HostBridge
        PCIBridge
          PCI 41:00.0 (3D)
    L3 L#2 (32MB)
      NUMANode L#2 (P#2 63GB)
      L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
      L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
      L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
      L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
      L2 L#16 (512KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
      L2 L#17 (512KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
      HostBridge
        PCIBridge
          PCI 21:00.0 (InfiniBand)
            Net "ib0"
            OpenFabrics "mlx5_0"
    L3 L#3 (32MB)
      NUMANode L#3 (P#3 63GB)
      L2 L#18 (512KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
      L2 L#19 (512KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
      L2 L#20 (512KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
      L2 L#21 (512KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
      L2 L#22 (512KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
      L2 L#23 (512KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
      HostBridge
        PCIBridge
          PCI 01:00.0 (3D)
  Package L#1
    L3 L#4 (32MB)
      NUMANode L#4 (P#4 63GB)
      L2 L#24 (512KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24)
      L2 L#25 (512KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25)
      L2 L#26 (512KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26)
      L2 L#27 (512KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27)
      L2 L#28 (512KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#28)
      L2 L#29 (512KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#29)
      HostBridge
        PCIBridge
          PCI e2:00.0 (RAID)
            Block(Disk) "sda"
        PCIBridge
          PCI e1:00.0 (Ethernet)
            Net "em1"
          PCI e1:00.1 (Ethernet)
            Net "em2"
    L3 L#5 (32MB)
      NUMANode L#5 (P#5 63GB)
      L2 L#30 (512KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#30)
      L2 L#31 (512KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31)
      L2 L#32 (512KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#32)
      L2 L#33 (512KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#33)
      L2 L#34 (512KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#34)
      L2 L#35 (512KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#35)
      HostBridge
        PCIBridge
          PCI c1:00.0 (3D)
        PCIBridge
          PCI c4:00.0 (SATA)
    L3 L#6 (32MB)
      NUMANode L#6 (P#6 63GB)
      L2 L#36 (512KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#36)
      L2 L#37 (512KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#37)
      L2 L#38 (512KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#38)
      L2 L#39 (512KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#39)
      L2 L#40 (512KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#40)
      L2 L#41 (512KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#41)
    L3 L#7 (32MB)
      NUMANode L#7 (P#7 63GB)
      L2 L#42 (512KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 (P#42)
      L2 L#43 (512KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 (P#43)
      L2 L#44 (512KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 (P#44)
      L2 L#45 (512KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 (P#45)
      L2 L#46 (512KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 (P#46)
      L2 L#47 (512KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 (P#47)
      HostBridge
        PCIBridge
          PCI 81:00.0 (3D)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)

This is the output for the second cluster (joltik) (the followup question):

Machine (376GB total)
  Package L#0
    NUMANode L#0 (P#0 187GB)
    L3 L#0 (22MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#4)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#6)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#8)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#10)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#12)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#14)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#16)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#18)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#20)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#22)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#24)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#26)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#28)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#30)
    HostBridge
      PCI 00:11.5 (SATA)
      PCIBridge
        PCI 02:00.0 (Ethernet)
          Net "em3"
        PCI 02:00.1 (Ethernet)
          Net "em4"
      PCIBridge
        PCIBridge
          PCI 05:00.0 (VGA)
      PCIBridge
        PCI 06:00.0 (SATA)
          Block(Disk) "sdb"
          Block(Disk) "sda"
      PCIBridge
        PCI 01:00.0 (Ethernet)
          Net "em1"
        PCI 01:00.1 (Ethernet)
          Net "em2"
    HostBridge
      PCIBridge
        PCI 18:00.0 (3D)
    HostBridge
      PCIBridge
        PCI 3b:00.0 (3D)
    HostBridge
      PCIBridge
        PCI 5e:00.0 (InfiniBand)
          Net "ib0"
          OpenFabrics "mlx5_0"
  Package L#1
    NUMANode L#1 (P#1 189GB)
    L3 L#1 (22MB)
      L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#1)
      L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#3)
      L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#5)
      L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#7)
      L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#9)
      L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#11)
      L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#13)
      L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#15)
      L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#17)
      L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#19)
      L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#21)
      L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#23)
      L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#25)
      L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#27)
      L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#29)
      L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31)
    HostBridge
      PCIBridge
        PCI 86:00.0 (3D)
    HostBridge
      PCIBridge
        PCI af:00.0 (3D)
    HostBridge
      PCIBridge
        PCI d8:00.0 (InfiniBand)
          Net "ib1"
          OpenFabrics "mlx5_1"
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)

Comment 4 Marcin Stolarek 2022-12-14 00:59:34 MST

>[2022-12-13T09:00:01.362] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `tesla_v100-sxm2-32gb`. Setting system GRES type to NULL

How is the device configured in slurm.conf? Autodetect requires the deice name to match the name returned by NVML, as documented[1]:
>NOTE: If using autodetect functionality and defining the Type in your gres.conf >file, the Type specified should match or be a substring of the value that is >detected, using an underscore in lieu of any spaces.

The result of lstopo is consistent with the cores shown by autodetect. Take a look at "3D" devices, like:
>PCI 41:00.0 (3D)
the closest cores are:
>      L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
>      L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
>      L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
>      L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
>      L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
>      L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)

which matches:
>[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia1
>[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 6-11
>[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 6-11

Is that more clear for you now?

cheers,
Marcin
[1]https://slurm.schedmd.com/gres.conf.html#OPT_Type

Comment 5 Marcin Stolarek 2022-12-20 09:40:59 MST

Is there anything else I can help you with in the ticket?

Comment 6 hpc-ops 2022-12-20 09:48:44 MST

Hi Marcin,

We are back to using autodetect and checking what we get core-wise in a job, using the latest slurm-22.05 branch. 

So far, we seem to be getting the wrong cores. I'll update the ticket as we learn more.


-- Andy

Comment 7 hpc-ops 2022-12-20 09:57:28 MST

Hi,


Looking at what we get when asking for 1,2,3 and 4 GPUs:

For clarity, the qsub wrapper generates this:


/usr/bin/salloc --reservation=maintenance2022Q4 --account=gvo00002 --cpus-per-gpu=48 --gres=gpu:1 --job-name=INTERACTIVE --mail-type=NONE --nodes=1 --ntasks-per-node=48 --ntasks=48 --time=01:00:00 /usr/bin/srun --chdir=/user/gent/400/vsc40075 --cpu-bind=none --export=USER,HOME,TERM --mem=0 --mpi=none --nodes=1 --ntasks=1 --pty /bin/bash -i -l

when asking for 1 GPU

and ... 


/usr/bin/salloc --reservation=maintenance2022Q4 --account=gvo00002 --cpus-per-gpu=12 --gres=gpu:4 --job-name=INTERACTIVE --mail-type=NONE --nodes=1 --ntasks-per-node=48 --ntasks=48 --time=01:00:00 /usr/bin/srun --chdir=/user/gent/400/vsc40075 --cpu-bind=none --export=USER,HOME,TERM --mem=0 --mpi=none --nodes=1 --ntasks=1 --pty /bin/bash -i -l

when asking 4 GPUs




vsc40075 in < gligar07.gastly.os > ~ via ﷐[U+1F40D]﷑ v3.6.8 accelgor took 6m4s
➜ qsub -I -l nodes=1:ppn=all:gpus=1 -A gvo00002 --pass=reservation=maintenance2022Q4
salloc: Granted job allocation 15054539
salloc: Waiting for resource configuration
salloc: Nodes node3903.accelgor.os are ready for job

< node3903.accelgor.os > ~ via ﷐[U+1F40D]﷑ v3.6.8 accelgor
➜ taskset -cp $$
pid 273822's current affinity list: 0-11,18-23




➜ qsub -I -l nodes=1:ppn=all:gpus=2 -A gvo00002 --pass=reservation=maintenance2022Q4
salloc: Granted job allocation 15054540
salloc: Waiting for resource configuration
salloc: Nodes node3903.accelgor.os are ready for job

< node3903.accelgor.os > ~ via ﷐[U+1F40D]﷑ v3.6.8 accelgor
➜ taskset -cp $$
pid 274345's current affinity list: 18-23,42-47



➜ qsub -I -l nodes=1:ppn=all:gpus=3 -A gvo00002 --pass=reservation=maintenance2022Q4
salloc: Granted job allocation 15054541
salloc: Waiting for resource configuration
salloc: Nodes node3903.accelgor.os are ready for job

< node3903.accelgor.os > ~ via ﷐[U+1F40D]﷑ v3.6.8 accelgor
➜ taskset -cp $$
pid 274683's current affinity list: 6-11,18-23,42-47




❯ qsub -I -l nodes=1:ppn=all:gpus=4 -A gvo00002 --pass=reservation=maintenance2022Q4
salloc: Granted job allocation 15054542
salloc: Waiting for resource configuration
salloc: Nodes node3903.accelgor.os are ready for job

< node3903.accelgor.os > ~ via ﷐[U+1F40D]﷑ v3.6.8 accelgor
➜ taskset -cp $$
pid 275136's current affinity list: 6-11,18-23,30-35,42-47

Comment 8 Marcin Stolarek 2022-12-20 11:49:30 MST

Please upload your slurm.conf. 
cheers,
Marcin

Comment 9 hpc-ops 2022-12-20 13:28:09 MST

Created attachment 28259 [details]
slurm.conf

Comment 10 hpc-ops 2022-12-20 13:28:38 MST

Hi,

Additionally:

[root@node3903 ~]# cat /etc/slurm/gres.conf
AutoDetect=nvml


Kidn regards,
-- Andy

Comment 11 hpc-ops 2022-12-21 01:19:11 MST

Hi,

Not sure if this helps, but looking at the cgroups we see (asking for 4 GPUs)


[root@node3903 job_15054547]# cat cpuset.cpus
0-47
[root@node3903 job_15054547]# cat step_0/cpuset.cpus
6-11,18-23,30-35,42-47
[root@node3903 job_15054547]# cat step_extern/cpuset.cpus
0-47


-- Andy

Comment 12 hpc-ops 2022-12-21 01:24:09 MST

Created attachment 28266 [details]
slurmd log (debug4)

This is the log from starting job 15054547.

Comment 13 Marcin Stolarek 2022-12-21 02:24:57 MST

Two major things I see looking at the configuration and nodes topology, are:
1) By default Slurm treats NUMA nodes as sockets. You can change that behavior using Ignore_NUMA[1] - if you run OS with hwloc prior to 2.0. For systems with hwloc >= 2.0 you can use different SlurmdParameters[2].
2) If you want salloc to result in an "interactive" step running on the node. The recomended way in Slurm is to use_interactive_step[3].

I'd recommend switching to those and verifying if that works as expected for you. On the other hand, I'm not sure why/how you end up with the following set of options (I left only options affecting allocation, step and binding):

>salloc --cpus-per-gpu=48 --gres=gpu:1 --nodes=1 --ntasks-per-node=48 --ntasks=48 \
>srun --cpu-bind=none --nodes=1 --ntasks=1 --pty /bin/bash -i -l
What is the reason to ask salloc for 48 tasks and srun for only 1 task? If you want to get the whole node allocated more appropriate way to do that is --exclusive[4] and then if the goal is to share resources between steps in the job allocation use --overlap[5]. Finally (I hope I understood the goals correctly) I'd recommend: 
-) add Ignore_NUMA to SchedulerParameters
-) set LaunchParameters=use_interactive_step 
-) use salloc --exclusive -N1 to get an interactive step with all resources of the node or if you want that to be enforced for every job set Oversubscribe=EXCLUSIVE[6] on the partition.

cheers,
Marcin

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_Ignore_NUMA
[2]https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdParameters
[3]https://slurm.schedmd.com/slurm.conf.html#OPT_use_interactive_step
[4]https://slurm.schedmd.com/salloc.html#OPT_exclusive
[5]https://slurm.schedmd.com/srun.html#OPT_overlap
[6]https://slurm.schedmd.com/slurm.conf.html#OPT_EXCLUSIVE

Comment 14 Marcin Stolarek 2023-01-01 22:17:22 MST

Is there anything else I can help you with in the case?

Comment 15 hpc-ops 2023-01-11 08:40:06 MST

hi marcin,

we updated to 22.05.7 and set the numa_node_as_socket in the slurmdparams. this fixes the issue when running via a job

the interactive job behaviour still is mindboggling complex. i think i understand the mechanisms behind it, but still very annoying users have to choose between exclusive or overlap upfront.

we as admins (and also most users) would like to use some form of interactive job option combo that gives us the same environment as a regular jobscript, typically to debug jobscripts (setting overlap is hardly the same imho). for that we eg now start tmux via a job and connect to the tmux session, but that is a tad complicated as well.

stijn

Comment 16 Marcin Stolarek 2023-01-12 02:00:57 MST

Stijn,

Could you please open a separate ticket to discuss the details of "interactive 
 job" - do you mean LaunchParameters=use_interactive_step?

It looks to me like it goes into different direction then comment 0 and we'd like to keep the case focused on a single topic. This way it's easier to review the case if a fix/change is required and we can make sure appropriate resources are assigned.

cheers,
Marcin

Comment 17 hpc-ops 2023-01-13 04:42:51 MST

ok, i'll open a new ticket. you can close this one