Hi, I'm looking at the GPU config output slurmd shows as we're moving from autodetect=nvml to a static configuration, since the autodetect sometimes seems to fail after we updated GPU drivers. I get [2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0: [2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb [2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-d09a4dcf-6935-5ebe-3f1a-53f21c0b3377 [2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:1:0 [2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:01:00.0 [2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: -1,4,4,4 [2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia0 [2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 18-23 [2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 18-23 [2022-12-13T15:00:54.833] debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled [2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 1: [2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb [2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-d9c5a758-9ba5-dbef-2414-959b143e17b3 [2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:65:0 [2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:41:00.0 [2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 4,-1,4,4 [2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia1 [2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 6-11 [2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 6-11 [2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled [2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 2: [2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb [2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-cea97538-e5c2-7045-5da9-6568a583441b [2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:129:0 [2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:81:00.0 [2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 4,4,-1,4 [2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia2 [2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 42-47 [2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 42-47 [2022-12-13T15:00:54.896] debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled [2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 3: [2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb [2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-8c83fe78-bda0-38c1-e2c3-9e2ea8bd33d8 [2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:193:0 [2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:C1:00.0 [2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 4,4,4,-1 [2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia3 [2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 30-35 [2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 30-35 [2022-12-13T15:00:54.928] debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled Any idea why we only see part of the total number of cores listed in the range? We have 48 cores in the machine, but somehow, slurm only seems to see 24 for GPU assignment. Kind regards, -- Andy
Hi, As a followup question, different GPU cluster, we have 4 GPUs per node and get: GPU 0: [2022-12-13T16:00:02.070] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 [2022-12-13T16:00:02.070] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 0-15 GPU 1: [2022-12-13T16:00:02.071] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 [2022-12-13T16:00:02.071] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 0-15 GPU 2: [2022-12-13T16:00:02.073] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31 [2022-12-13T16:00:02.073] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 16-31 GPU 3: [2022-12-13T16:00:02.074] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31 [2022-12-13T16:00:02.074] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 16-31 So here we have 16 cores per GPU, instead of the expected 8. How should we interpret this to generate gres.conf? -- Andy
Could you please share the output of `lstopo-no-graphics`? >How should we interpret this to generate gres.conf? You can rely on autodetect, without the need to list the GPU manually in the gres.conf cheers, Marcin
Hi, The problem is that with autodetect we had these in the logs: [2022-12-13T09:00:01.362] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected [2022-12-13T09:00:01.362] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `tesla_v100-sxm2-32gb`. Setting system GRES type to NULL [2022-12-13T09:00:01.362] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `tesla_v100-sxm2-32gb`. Setting system GRES type to NULL [2022-12-13T09:00:01.362] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `tesla_v100-sxm2-32gb`. Setting system GRES type to NULL [2022-12-13T09:00:01.362] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `tesla_v100-sxm2-32gb`. Setting system GRES type to NULL And we could not get jobs to find the GPUs. This is the output for the first cluster (accelgor) (so the first original question) Machine (503GB total) Package L#0 L3 L#0 (32MB) NUMANode L#0 (P#0 62GB) L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5) HostBridge PCIBridge PCIBridge PCI 62:00.0 (VGA) L3 L#1 (32MB) NUMANode L#1 (P#1 63GB) L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11) HostBridge PCIBridge PCI 41:00.0 (3D) L3 L#2 (32MB) NUMANode L#2 (P#2 63GB) L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12) L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13) L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14) L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15) L2 L#16 (512KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16) L2 L#17 (512KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17) HostBridge PCIBridge PCI 21:00.0 (InfiniBand) Net "ib0" OpenFabrics "mlx5_0" L3 L#3 (32MB) NUMANode L#3 (P#3 63GB) L2 L#18 (512KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18) L2 L#19 (512KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19) L2 L#20 (512KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20) L2 L#21 (512KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21) L2 L#22 (512KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22) L2 L#23 (512KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23) HostBridge PCIBridge PCI 01:00.0 (3D) Package L#1 L3 L#4 (32MB) NUMANode L#4 (P#4 63GB) L2 L#24 (512KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24) L2 L#25 (512KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25) L2 L#26 (512KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26) L2 L#27 (512KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27) L2 L#28 (512KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#28) L2 L#29 (512KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#29) HostBridge PCIBridge PCI e2:00.0 (RAID) Block(Disk) "sda" PCIBridge PCI e1:00.0 (Ethernet) Net "em1" PCI e1:00.1 (Ethernet) Net "em2" L3 L#5 (32MB) NUMANode L#5 (P#5 63GB) L2 L#30 (512KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#30) L2 L#31 (512KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31) L2 L#32 (512KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#32) L2 L#33 (512KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#33) L2 L#34 (512KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#34) L2 L#35 (512KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#35) HostBridge PCIBridge PCI c1:00.0 (3D) PCIBridge PCI c4:00.0 (SATA) L3 L#6 (32MB) NUMANode L#6 (P#6 63GB) L2 L#36 (512KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#36) L2 L#37 (512KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#37) L2 L#38 (512KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#38) L2 L#39 (512KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#39) L2 L#40 (512KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#40) L2 L#41 (512KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#41) L3 L#7 (32MB) NUMANode L#7 (P#7 63GB) L2 L#42 (512KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 (P#42) L2 L#43 (512KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 (P#43) L2 L#44 (512KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 (P#44) L2 L#45 (512KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 (P#45) L2 L#46 (512KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 (P#46) L2 L#47 (512KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 (P#47) HostBridge PCIBridge PCI 81:00.0 (3D) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) This is the output for the second cluster (joltik) (the followup question): Machine (376GB total) Package L#0 NUMANode L#0 (P#0 187GB) L3 L#0 (22MB) L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2) L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#4) L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#6) L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#8) L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#10) L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#12) L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#14) L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#16) L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#18) L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#20) L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#22) L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#24) L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#26) L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#28) L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#30) HostBridge PCI 00:11.5 (SATA) PCIBridge PCI 02:00.0 (Ethernet) Net "em3" PCI 02:00.1 (Ethernet) Net "em4" PCIBridge PCIBridge PCI 05:00.0 (VGA) PCIBridge PCI 06:00.0 (SATA) Block(Disk) "sdb" Block(Disk) "sda" PCIBridge PCI 01:00.0 (Ethernet) Net "em1" PCI 01:00.1 (Ethernet) Net "em2" HostBridge PCIBridge PCI 18:00.0 (3D) HostBridge PCIBridge PCI 3b:00.0 (3D) HostBridge PCIBridge PCI 5e:00.0 (InfiniBand) Net "ib0" OpenFabrics "mlx5_0" Package L#1 NUMANode L#1 (P#1 189GB) L3 L#1 (22MB) L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#1) L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#3) L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#5) L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#7) L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#9) L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#11) L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#13) L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#15) L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#17) L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#19) L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#21) L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#23) L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#25) L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#27) L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#29) L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31) HostBridge PCIBridge PCI 86:00.0 (3D) HostBridge PCIBridge PCI af:00.0 (3D) HostBridge PCIBridge PCI d8:00.0 (InfiniBand) Net "ib1" OpenFabrics "mlx5_1" Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule)
>[2022-12-13T09:00:01.362] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `tesla_v100-sxm2-32gb`. Setting system GRES type to NULL How is the device configured in slurm.conf? Autodetect requires the deice name to match the name returned by NVML, as documented[1]: >NOTE: If using autodetect functionality and defining the Type in your gres.conf >file, the Type specified should match or be a substring of the value that is >detected, using an underscore in lieu of any spaces. The result of lstopo is consistent with the cores shown by autodetect. Take a look at "3D" devices, like: >PCI 41:00.0 (3D) the closest cores are: > L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6) > L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7) > L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8) > L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9) > L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10) > L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11) which matches: >[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia1 >[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 6-11 >[2022-12-13T15:00:54.865] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 6-11 Is that more clear for you now? cheers, Marcin [1]https://slurm.schedmd.com/gres.conf.html#OPT_Type
Is there anything else I can help you with in the ticket?
Hi Marcin, We are back to using autodetect and checking what we get core-wise in a job, using the latest slurm-22.05 branch. So far, we seem to be getting the wrong cores. I'll update the ticket as we learn more. -- Andy
Hi, Looking at what we get when asking for 1,2,3 and 4 GPUs: For clarity, the qsub wrapper generates this: /usr/bin/salloc --reservation=maintenance2022Q4 --account=gvo00002 --cpus-per-gpu=48 --gres=gpu:1 --job-name=INTERACTIVE --mail-type=NONE --nodes=1 --ntasks-per-node=48 --ntasks=48 --time=01:00:00 /usr/bin/srun --chdir=/user/gent/400/vsc40075 --cpu-bind=none --export=USER,HOME,TERM --mem=0 --mpi=none --nodes=1 --ntasks=1 --pty /bin/bash -i -l when asking for 1 GPU and ... /usr/bin/salloc --reservation=maintenance2022Q4 --account=gvo00002 --cpus-per-gpu=12 --gres=gpu:4 --job-name=INTERACTIVE --mail-type=NONE --nodes=1 --ntasks-per-node=48 --ntasks=48 --time=01:00:00 /usr/bin/srun --chdir=/user/gent/400/vsc40075 --cpu-bind=none --export=USER,HOME,TERM --mem=0 --mpi=none --nodes=1 --ntasks=1 --pty /bin/bash -i -l when asking 4 GPUs vsc40075 in < gligar07.gastly.os > ~ via [U+1F40D] v3.6.8 accelgor took 6m4s ➜ qsub -I -l nodes=1:ppn=all:gpus=1 -A gvo00002 --pass=reservation=maintenance2022Q4 salloc: Granted job allocation 15054539 salloc: Waiting for resource configuration salloc: Nodes node3903.accelgor.os are ready for job < node3903.accelgor.os > ~ via [U+1F40D] v3.6.8 accelgor ➜ taskset -cp $$ pid 273822's current affinity list: 0-11,18-23 ➜ qsub -I -l nodes=1:ppn=all:gpus=2 -A gvo00002 --pass=reservation=maintenance2022Q4 salloc: Granted job allocation 15054540 salloc: Waiting for resource configuration salloc: Nodes node3903.accelgor.os are ready for job < node3903.accelgor.os > ~ via [U+1F40D] v3.6.8 accelgor ➜ taskset -cp $$ pid 274345's current affinity list: 18-23,42-47 ➜ qsub -I -l nodes=1:ppn=all:gpus=3 -A gvo00002 --pass=reservation=maintenance2022Q4 salloc: Granted job allocation 15054541 salloc: Waiting for resource configuration salloc: Nodes node3903.accelgor.os are ready for job < node3903.accelgor.os > ~ via [U+1F40D] v3.6.8 accelgor ➜ taskset -cp $$ pid 274683's current affinity list: 6-11,18-23,42-47 ❯ qsub -I -l nodes=1:ppn=all:gpus=4 -A gvo00002 --pass=reservation=maintenance2022Q4 salloc: Granted job allocation 15054542 salloc: Waiting for resource configuration salloc: Nodes node3903.accelgor.os are ready for job < node3903.accelgor.os > ~ via [U+1F40D] v3.6.8 accelgor ➜ taskset -cp $$ pid 275136's current affinity list: 6-11,18-23,30-35,42-47
Please upload your slurm.conf. cheers, Marcin
Created attachment 28259 [details] slurm.conf
Hi, Additionally: [root@node3903 ~]# cat /etc/slurm/gres.conf AutoDetect=nvml Kidn regards, -- Andy
Hi, Not sure if this helps, but looking at the cgroups we see (asking for 4 GPUs) [root@node3903 job_15054547]# cat cpuset.cpus 0-47 [root@node3903 job_15054547]# cat step_0/cpuset.cpus 6-11,18-23,30-35,42-47 [root@node3903 job_15054547]# cat step_extern/cpuset.cpus 0-47 -- Andy
Created attachment 28266 [details] slurmd log (debug4) This is the log from starting job 15054547.
Two major things I see looking at the configuration and nodes topology, are: 1) By default Slurm treats NUMA nodes as sockets. You can change that behavior using Ignore_NUMA[1] - if you run OS with hwloc prior to 2.0. For systems with hwloc >= 2.0 you can use different SlurmdParameters[2]. 2) If you want salloc to result in an "interactive" step running on the node. The recomended way in Slurm is to use_interactive_step[3]. I'd recommend switching to those and verifying if that works as expected for you. On the other hand, I'm not sure why/how you end up with the following set of options (I left only options affecting allocation, step and binding): >salloc --cpus-per-gpu=48 --gres=gpu:1 --nodes=1 --ntasks-per-node=48 --ntasks=48 \ >srun --cpu-bind=none --nodes=1 --ntasks=1 --pty /bin/bash -i -l What is the reason to ask salloc for 48 tasks and srun for only 1 task? If you want to get the whole node allocated more appropriate way to do that is --exclusive[4] and then if the goal is to share resources between steps in the job allocation use --overlap[5]. Finally (I hope I understood the goals correctly) I'd recommend: -) add Ignore_NUMA to SchedulerParameters -) set LaunchParameters=use_interactive_step -) use salloc --exclusive -N1 to get an interactive step with all resources of the node or if you want that to be enforced for every job set Oversubscribe=EXCLUSIVE[6] on the partition. cheers, Marcin [1]https://slurm.schedmd.com/slurm.conf.html#OPT_Ignore_NUMA [2]https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdParameters [3]https://slurm.schedmd.com/slurm.conf.html#OPT_use_interactive_step [4]https://slurm.schedmd.com/salloc.html#OPT_exclusive [5]https://slurm.schedmd.com/srun.html#OPT_overlap [6]https://slurm.schedmd.com/slurm.conf.html#OPT_EXCLUSIVE
Is there anything else I can help you with in the case?
hi marcin, we updated to 22.05.7 and set the numa_node_as_socket in the slurmdparams. this fixes the issue when running via a job the interactive job behaviour still is mindboggling complex. i think i understand the mechanisms behind it, but still very annoying users have to choose between exclusive or overlap upfront. we as admins (and also most users) would like to use some form of interactive job option combo that gives us the same environment as a regular jobscript, typically to debug jobscripts (setting overlap is hardly the same imho). for that we eg now start tmux via a job and connect to the tmux session, but that is a tad complicated as well. stijn
Stijn, Could you please open a separate ticket to discuss the details of "interactive job" - do you mean LaunchParameters=use_interactive_step? It looks to me like it goes into different direction then comment 0 and we'd like to keep the case focused on a single topic. This way it's easier to review the case if a fix/change is required and we can make sure appropriate resources are assigned. cheers, Marcin
ok, i'll open a new ticket. you can close this one