Summary: | GRES cores doesn't match socket boundaries | ||
---|---|---|---|
Product: | Slurm | Reporter: | Yann <yann.sagon> |
Component: | GPU | Assignee: | Oriol Vilarrubi <jvilarru> |
Status: | OPEN --- | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | jvilarru, niccolo.tosato, ricard, shai.haim, tal.friedman |
Version: | 24.11.1 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://support.schedmd.com/show_bug.cgi?id=22741 | ||
Site: | Université de Genève | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 25.05.0 | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Yann
2025-04-03 05:51:53 MDT
I'm able to resume the node if I restart slurmd. Hi Yann, Does this happen when you issue an scontrol reconfigure, and in nodes where you have either cpuspeclist or corespeccount? Regards. I tried: (baobab)-[root@admin1 users] (master)$ scontrol reconfigure gpu012 (baobab)-[root@admin1 users] (master)$ scontrol show node gpu012 NodeName=gpu012 Arch=x86_64 CoresPerSocket=12 CPUAlloc=10 CPUEfctv=22 CPUTot=24 CPULoad=1.89 AvailableFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING,SIMPLE_PRECISION_GPU,COMPUTE_MODEL_RTX_2080_11G ActiveFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING,SIMPLE_PRECISION_GPU,COMPUTE_MODEL_RTX_2080_11G Gres=gpu:nvidia_geforce_rtx_2080_ti:8(S:0-1),VramPerGpu:no_consume:11G NodeAddr=gpu012 NodeHostName=gpu012 Version=24.11.1 OS=Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15 12:04:32 UTC 2024 RealMemory=257000 AllocMem=122880 FreeMem=182572 Sockets=2 Boards=1 CoreSpecCount=2 CPUSpecList=11,23 State=MIXED+DRAIN+INVALID_REG ThreadsPerCore=1 TmpDisk=300000 Weight=30 Owner=N/A MCS_label=N/A Partitions=shared-gpu,private-dpnc-gpu BootTime=2025-03-24T17:21:12 SlurmdStartTime=2025-04-03T15:48:07 LastBusyTime=2025-04-03T15:48:07 ResumeAfterTime=None CfgTRES=cpu=22,mem=257000M,billing=108,gres/gpu=8,gres/gpu:nvidia_geforce_rtx_2080_ti=8 AllocTRES=cpu=10,mem=120G,gres/gpu=1,gres/gpu:nvidia_geforce_rtx_2080_ti=1 CurrentWatts=0 AveWatts=0 Reason=gres/gpu GRES core specification 0-10 doesn't match socket boundaries. (Socket 0 is cores 0-12) [slurm@2025-04-03T15:48:07] So yes this does trigger t sorry, wrong copy past. This does trigger the issue and yes we have two CoreSpecCount per gpu node. Good, we are already working in the solution for that, what is happening here is the following: - slurmd start normally after for example a systemctl restart slurmd. - slurmd gets the "GPU relationship to core" information based on the cpus it sees, at this moment of the initialization all of them. - slurmd applies the restriction of CoreSpecCount to itself, or better said to the cgroup that is in. - slurmd receives an "scontrol reconfigure event" from the controller and spawns a copy of himself in the same cgroup that is in. - the new slurmd tries to get the socket to core information, as it's father did, but unlike his father he is in a not-new cgroup, which already has the cpu visibility limited by his father, so the information reported by nvml, which can see all the cores, is not consistent with what the slurmd sees, which is the limited version by CoreSpecCount, thus making the new slurmd process fail. I will keep you updated in regards of the progress of the solution, at this moment there is a patch that is in review phase. Regards. Hello Yann, We have included the fix for cgroup/v2 for this issue in the following commits: 1e5795ba - cgroup/v2 - _unset_cpuset_mem_limits do not reset untouched limits 14d789f7 - cgroup/v2 - xfree a missing field in common_cgroup_ns_destroy 440a22f3 - cgroup/v2 - Store the init cgroup path in the cgroup namespace cb10bc46 - cgroup/v2 - Fix slurmd reconfig not working when removing CoreSpecLimits 71d3ab39 - cgroup/v2 - Add log flag to reset memory.max limits 47bd81ab - cgroup/v2 - Fix slurmd reconfig not resetting CoreSpec limits with systemd 9a173446 - Merge branch 'cherrypick-995-24.11' into 'slurm-24.11' These will be shipped in the following slurm 24.11 release. In case you are using cgroup/v1, we are still working on the fix for it. Regards. Hi, We are experiencing a similar issue with cgroup v1, and seeing this case we would like to join the ticket. Is there an expected timeline for a fix at cgroup v1? It is currently blocking our upgrade to 24.11.x, so we would appreciate if there is anything you could do to push for a solution. Regards, Tal Dear all, We want to join this ticket because we encountered similar problems when updating from version 24.05.X to version 24.11.4. In particular, the error message is the following: Reason=gres/gpu GRES autodetected core affinity 48-63 on node dgx001 doesn't match socket boundaries. (Socket 0 is cores 0-63). Consider setting SlurmdParameters=l3cache_as_socket (recommended) or override this by manually specifying core affinity in gres.conf. [slurm@2025-04-23T22:20:20] That is thrown by the _check_core_range_matches_sock function in src/interfaces/gres.c. This function was introduced as part of the last minor update. We looked at the code, and the incriminated function performs a check against the affinity provided by nvml that should be aligned with the socket boundaries. We believe this alignment is unlikely to happen in several modern hardware topologies. For instance, focusing on the DGX A100, this kind of device is equipped with two sockets with 64 cores each and multi-threading enabled. The 16 L3 caches of each socket are shared among groups of 4 cores. We structured the system (as per default) to have eight numa regions, four per socket. The eight GPUs are bound to 16 cores each (32 hwthreads) as reported by the Nvidia-smi utils according to this table: | GPU | pcore | Numa | |------|-----------------|------| | GPU0 | 48-63,176-191 | 3 | | GPU1 | 48-63,176-191 | 3 | | GPU2 | 16-31,144-159 | 1 | | GPU3 | 16-31,144-159 | 1 | | GPU4 | 112-127,240-255 | 7 | | GPU5 | 112-127,240-255 | 7 | | GPU6 | 80-95,208-223 | 5 | | GPU7 | 80-95,208-223 | 5 | As can be seen, the condition you are requesting: int first = i * rebuild_topo->cores_per_sock; int last = (i + 1) * rebuild_topo->cores_per_sock; /* bit_set_count_range * Count the number of bits set in a range of bitstring. * b (IN) bitstring to check * start (IN) first bit to check * end (IN) last bit to check+1 * RETURN count of set bits */ int core_cnt = bit_set_count_range(tmp_bitmap, first, last); if (core_cnt && (core_cnt != rebuild_topo->cores_per_sock)) {....} will always be met, making the check fail because the core_cnt will always be different from the core_per_sock. With our current configuration, we expect the value to be always lower (16 != 64). Discussing the suggested workaround, setting the l3cache_as_socket flag seems unreasonable. At a glance, it looked like the flag applies to the whole slurm cluster, and if it does what the name suggests, it will change the number definition of the socket from package (as per hwloc) to "l3 cache region," confusing users. We could not test it, but in the specific case, we expect it to split each socket into 16 fake sockets of 4 cups each (as suggested by lscpu output, reported below). In this case, the test above should succeed (16!=4), triggering the error. $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Vendor ID: AuthenticAMD Model name: AMD EPYC 7742 64-Core Processor CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 2 Caches (sum of all): L1d: 4 MiB (128 instances) L1i: 4 MiB (128 instances) L2: 64 MiB (128 instances) L3: 512 MiB (32 instances) NUMA: NUMA node(s): 8 NUMA node0 CPU(s): 0-15,128-143 NUMA node1 CPU(s): 16-31,144-159 NUMA node2 CPU(s): 32-47,160-175 NUMA node3 CPU(s): 48-63,176-191 NUMA node4 CPU(s): 64-79,192-207 NUMA node5 CPU(s): 80-95,208-223 NUMA node6 CPU(s): 96-111,224-239 NUMA node7 CPU(s): 112-127,240-255 For now, we rolled out the latest update by just patching the function and making it return successful regardless of the input in the slurmctld server, but we would like it to be addressed correctly. Hello Tal, The fix for cgroup/v1 is in progress, I cannot give a date on when it will be there, but we are working on it, if you can I would suggest to migrate also to cgroup/v2, are there are some features of slurm in upcoming versions that will only be available for cgroup/v2. Niccolo, would you please open a separate ticket for that? This way we can keep the issues separated from one another. Regards. Hi all, We have fixed also this issue for cgroup/v1 in this commit: 7e9871de54 It is available in current master's HEAD (thus in 25.05.0 release) I'm closing this ticket as resolved/fixed. Thanks @schedmd! Dear team, I would like to have this ticket reopened please. We are using slurm 24.11.05 and we still have the issue.. or different! "scontrol reconfigure gpu00X" doesn't trigger the issue anymore which is very good. But I have a gpu node that I'm unable to resume. (baobab)-[root@gpu030 ~]$ slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected slurmd: Gres Name=gpu Type=nvidia_a100-pcie-40gb Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=48-63 CoreCnt=64 Links=-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=nvidia_a100-pcie-40gb Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=32-47 CoreCnt=64 Links=0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=nvidia_a100-pcie-40gb Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=16-31 CoreCnt=64 Links=0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=nvidia_a100-pcie-40gb Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-15 CoreCnt=64 Links=0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=VramPerGpu Type=(null) Count=42949672960 ID=3033812246 Links=(null) Flags=CountOnly (baobab)-[root@gpu030 ~]$ scontrol show node gpu030 NodeName=gpu030 Arch=x86_64 CoresPerSocket=64 CPUAlloc=0 CPUEfctv=62 CPUTot=64 CPULoad=0.84 AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_8_0,COMPUTE_TYPE_AMPERE,DOUBLE_PRECISION_GPU,COMPUTE_MODEL_A100_40G ActiveFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_8_0,COMPUTE_TYPE_AMPERE,DOUBLE_PRECISION_GPU,COMPUTE_MODEL_A100_40G Gres=gpu:1,VramPerGpu:no_consume:40G NodeAddr=gpu030 NodeHostName=gpu030 Version=24.11.5 OS=Linux 5.14.0-503.40.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 30 17:38:54 UTC 2025 RealMemory=256000 AllocMem=0 FreeMem=252650 Sockets=1 Boards=1 CoreSpecCount=2 CPUSpecList=62-63 State=IDLE+DRAIN+INVALID_REG ThreadsPerCore=1 TmpDisk=1500000 Weight=60 Owner=N/A MCS_label=N/A Partitions=shared-gpu,private-kruse-gpu BootTime=2025-06-10T08:54:00 SlurmdStartTime=2025-06-25T15:12:15 LastBusyTime=2025-06-25T15:12:15 ResumeAfterTime=None CfgTRES=cpu=62,mem=250G,billing=148,gres/gpu=4,gres/gpu:nvidia_a100-pcie-40gb=4 AllocTRES= CurrentWatts=0 AveWatts=0 Reason=gres/gpu GRES autodetected core affinity 48-63 on node gpu030 doesn't match socket boundaries. (Socket 0 is cores 0-63). Consider setting SlurmdParameters=l3cache_as_socket (recommended) or override this by manually specifying core affinity in gres.conf. [slurm@2025-06-06T12:15:17] (baobab)-[root@gpu030 ~]$ scontrol update node=gpu030 state=resume slurm_update error: Invalid node state specified (baobab)-[root@gpu030 ~]$ rpm -qa | grep slurm slurm-24.11.5-1.unige.el9.x86_64 slurm-contribs-24.11.5-1.unige.el9.x86_64 slurm-libpmi-24.11.5-1.unige.el9.x86_64 slurm-perlapi-24.11.5-1.unige.el9.x86_64 slurm-example-configs-24.11.5-1.unige.el9.x86_64 slurm-slurmd-24.11.5-1.unige.el9.x86_64 slurm-pam_slurm-24.11.5-1.unige.el9.x86_64 Log in slurmctld: [2025-06-25T15:16:13.687] Invalid node state transition requested for node gpu030 from=INVAL to=RESUME [2025-06-25T15:16:13.687] _slurm_rpc_update_node for gpu030: Invalid node state specified Hi Yann, The fix fro cgroup/v1 is in slurm 25.05, not 24.11.5, so I guess then that you are using cgroup/v2. If that is the case, then please also attach your gres.conf Thanks. I think we are using cgroup/v2. We didn't set CgroupPlugin but by default it is indicated it is "autodetect". To be sure, I've set it in slurmd to cgroup/v2. Extract of cgroup.conf [...] NodeName=gpu009 Name=gpu AutoDetect=nvml NodeName=gpu010 Name=gpu AutoDetect=nvml NodeName=gpu011 Name=gpu AutoDetect=nvml File=/dev/nvidia[0-1] NodeName=gpu012 Name=gpu AutoDetect=nvml NodeName=gpu013 Name=gpu AutoDetect=nvml NodeName=gpu014 Name=gpu AutoDetect=nvml NodeName=gpu015 Name=gpu AutoDetect=nvml NodeName=gpu016 Name=gpu AutoDetect=nvml NodeName=gpu017 Name=gpu AutoDetect=nvml NodeName=gpu018 Name=gpu AutoDetect=nvml NodeName=gpu019 Name=gpu AutoDetect=nvml NodeName=gpu020 Name=gpu AutoDetect=nvml NodeName=gpu021 Name=gpu AutoDetect=nvml NodeName=gpu022 Name=gpu AutoDetect=nvml NodeName=gpu023 Name=gpu AutoDetect=nvml NodeName=gpu024 Name=gpu AutoDetect=nvml NodeName=gpu025 Name=gpu AutoDetect=nvml NodeName=gpu026 Name=gpu AutoDetect=nvml NodeName=gpu027 Name=gpu AutoDetect=nvml File=/dev/nvidia0 NodeName=gpu027 Name=gpu AutoDetect=nvml File=/dev/nvidia[1-2] NodeName=gpu028 Name=gpu AutoDetect=nvml NodeName=gpu029 Name=gpu AutoDetect=nvml NodeName=gpu030 Name=gpu AutoDetect=nvml File=/dev/nvidia[0-3] NodeName=gpu031 Name=gpu AutoDetect=nvml [...] Maybe worth to notice: we have issue with gpu030 and gpu011 only. I've tried to remove the File pragma from gpu030 and restart slurmctld and slurmd, same issue. I've try to do scontrol reconfigure gpu027 and it stay idle. Another interesting things I just figured out: gpu027 and gpu030 doesn't have the same number of NUMA nodes even if it is the same GPU. GPU030 has 4 NUMA nodes and gpu027 one. I don't know why, maybe this is something that can be modified in the bios, as both nodes are strictly identical regarding the OS. (baobab)-[root@login1 ~]$ diff -u <(ssh gpu030 lscpu) <(ssh gpu027 lscpu) --- /dev/fd/63 2025-06-25 22:18:25.661150407 +0200 +++ /dev/fd/62 2025-06-25 22:18:25.661150407 +0200 @@ -18,18 +18,15 @@ CPU(s) scaling MHz: 100% CPU max MHz: 2250.0000 CPU min MHz: 1500.0000 -BogoMIPS: 4499.95 +BogoMIPS: 4499.53 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es Virtualization: AMD-V L1d cache: 2 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 32 MiB (64 instances) L3 cache: 256 MiB (16 instances) -NUMA node(s): 4 -NUMA node0 CPU(s): 0-15 -NUMA node1 CPU(s): 16-31 -NUMA node2 CPU(s): 32-47 -NUMA node3 CPU(s): 48-63 +NUMA node(s): 1 +NUMA node0 CPU(s): 0-63 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected I've modified the NPS parameter from 4 to 1 and I can now resume the node. By the way, do you have a performance guide for such things (NPS parameter and numa_node_as_socket). As we have a lot of AMD compute nodes, we would like to try the numa_node_as_socke parameter from slurm but as this is a global flag, we'll try that during our next maintenance. Hello, anybody out there? I still have the issue on gpu011 and on this node it isn't possible to set NPS=1 and anyway this isn't probably what we want to do. Please reopen the issue. Sorry I wasn't aware I can reopen the issue myself, this is done now. Hello Yann, Sorry for not answering earlier I saw the ticket as closed and that is why I did not see your last update, sorry about that. By NPS you mean mps? Or you are refering to a different thing? Let me see if is there a way to not specify numa_node_as_socket as a whole. Normally what slurmd -C shows on the command line is what we recommend to set on the nodes, but in the AMD case as you have seen things change, let me see if we have something in relation to that. Regards. By NPS I mean NPS:) => Nodes per Socket (NPS) Hi Yann, I have been looking and only found these links: https://slurm.schedmd.com/mc_support.html https://slurm.schedmd.com/cpu_management.html But what we generally recommend is to execute slurmd -C on the compute node and use that to populate the slurm.conf. Would you mind sending me the output of lstopo-no-graphics, so that I can add a note into these links about when to include this type of parameter (numa_node_as_socket), in you case you need it 100% sure, as all the AMD Epyc we have seen so far need it. At this moment there is no way to specify this option specifically to a node. Regards. Here is the output of lstopo-no-graphics Machine (252GB total) Package L#0 Die L#0 NUMANode L#0 (P#0 31GB) L3 L#0 (8192KB) L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3) L3 L#1 (8192KB) L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5) L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7) HostBridge PCIBridge PCI 01:00.0 (Ethernet) Net "enp1s0f0" PCI 01:00.1 (Ethernet) Net "enp1s0f1" PCIBridge PCIBridge PCI 04:00.0 (VGA) PCIBridge PCI 06:00.2 (SATA) Block(Disk) "sda" Die L#1 NUMANode L#1 (P#1 31GB) L3 L#2 (8192KB) L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (64KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (64KB) + Core L#11 + PU L#11 (P#11) L3 L#3 (8192KB) L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (64KB) + Core L#12 + PU L#12 (P#12) L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (64KB) + Core L#13 + PU L#13 (P#13) L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (64KB) + Core L#14 + PU L#14 (P#14) L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (64KB) + Core L#15 + PU L#15 (P#15) HostBridge PCIBridge PCI 12:00.2 (SATA) Die L#2 NUMANode L#2 (P#2 31GB) L3 L#4 (8192KB) L2 L#16 (512KB) + L1d L#16 (32KB) + L1i L#16 (64KB) + Core L#16 + PU L#16 (P#16) L2 L#17 (512KB) + L1d L#17 (32KB) + L1i L#17 (64KB) + Core L#17 + PU L#17 (P#17) L2 L#18 (512KB) + L1d L#18 (32KB) + L1i L#18 (64KB) + Core L#18 + PU L#18 (P#18) L2 L#19 (512KB) + L1d L#19 (32KB) + L1i L#19 (64KB) + Core L#19 + PU L#19 (P#19) L3 L#5 (8192KB) L2 L#20 (512KB) + L1d L#20 (32KB) + L1i L#20 (64KB) + Core L#20 + PU L#20 (P#20) L2 L#21 (512KB) + L1d L#21 (32KB) + L1i L#21 (64KB) + Core L#21 + PU L#21 (P#21) L2 L#22 (512KB) + L1d L#22 (32KB) + L1i L#22 (64KB) + Core L#22 + PU L#22 (P#22) L2 L#23 (512KB) + L1d L#23 (32KB) + L1i L#23 (64KB) + Core L#23 + PU L#23 (P#23) HostBridge PCIBridge PCI 21:00.0 (VGA) Die L#3 NUMANode L#3 (P#3 31GB) L3 L#6 (8192KB) L2 L#24 (512KB) + L1d L#24 (32KB) + L1i L#24 (64KB) + Core L#24 + PU L#24 (P#24) L2 L#25 (512KB) + L1d L#25 (32KB) + L1i L#25 (64KB) + Core L#25 + PU L#25 (P#25) L2 L#26 (512KB) + L1d L#26 (32KB) + L1i L#26 (64KB) + Core L#26 + PU L#26 (P#26) L2 L#27 (512KB) + L1d L#27 (32KB) + L1i L#27 (64KB) + Core L#27 + PU L#27 (P#27) L3 L#7 (8192KB) L2 L#28 (512KB) + L1d L#28 (32KB) + L1i L#28 (64KB) + Core L#28 + PU L#28 (P#28) L2 L#29 (512KB) + L1d L#29 (32KB) + L1i L#29 (64KB) + Core L#29 + PU L#29 (P#29) L2 L#30 (512KB) + L1d L#30 (32KB) + L1i L#30 (64KB) + Core L#30 + PU L#30 (P#30) L2 L#31 (512KB) + L1d L#31 (32KB) + L1i L#31 (64KB) + Core L#31 + PU L#31 (P#31) Package L#1 Die L#4 NUMANode L#4 (P#4 31GB) L3 L#8 (8192KB) L2 L#32 (512KB) + L1d L#32 (32KB) + L1i L#32 (64KB) + Core L#32 + PU L#32 (P#32) L2 L#33 (512KB) + L1d L#33 (32KB) + L1i L#33 (64KB) + Core L#33 + PU L#33 (P#33) L2 L#34 (512KB) + L1d L#34 (32KB) + L1i L#34 (64KB) + Core L#34 + PU L#34 (P#34) L2 L#35 (512KB) + L1d L#35 (32KB) + L1i L#35 (64KB) + Core L#35 + PU L#35 (P#35) L3 L#9 (8192KB) L2 L#36 (512KB) + L1d L#36 (32KB) + L1i L#36 (64KB) + Core L#36 + PU L#36 (P#36) L2 L#37 (512KB) + L1d L#37 (32KB) + L1i L#37 (64KB) + Core L#37 + PU L#37 (P#37) L2 L#38 (512KB) + L1d L#38 (32KB) + L1i L#38 (64KB) + Core L#38 + PU L#38 (P#38) L2 L#39 (512KB) + L1d L#39 (32KB) + L1i L#39 (64KB) + Core L#39 + PU L#39 (P#39) HostBridge PCIBridge PCI 42:00.2 (SATA) Die L#5 NUMANode L#5 (P#5 31GB) L3 L#10 (8192KB) L2 L#40 (512KB) + L1d L#40 (32KB) + L1i L#40 (64KB) + Core L#40 + PU L#40 (P#40) L2 L#41 (512KB) + L1d L#41 (32KB) + L1i L#41 (64KB) + Core L#41 + PU L#41 (P#41) L2 L#42 (512KB) + L1d L#42 (32KB) + L1i L#42 (64KB) + Core L#42 + PU L#42 (P#42) L2 L#43 (512KB) + L1d L#43 (32KB) + L1i L#43 (64KB) + Core L#43 + PU L#43 (P#43) L3 L#11 (8192KB) L2 L#44 (512KB) + L1d L#44 (32KB) + L1i L#44 (64KB) + Core L#44 + PU L#44 (P#44) L2 L#45 (512KB) + L1d L#45 (32KB) + L1i L#45 (64KB) + Core L#45 + PU L#45 (P#45) L2 L#46 (512KB) + L1d L#46 (32KB) + L1i L#46 (64KB) + Core L#46 + PU L#46 (P#46) L2 L#47 (512KB) + L1d L#47 (32KB) + L1i L#47 (64KB) + Core L#47 + PU L#47 (P#47) Die L#6 NUMANode L#6 (P#6 31GB) L3 L#12 (8192KB) L2 L#48 (512KB) + L1d L#48 (32KB) + L1i L#48 (64KB) + Core L#48 + PU L#48 (P#48) L2 L#49 (512KB) + L1d L#49 (32KB) + L1i L#49 (64KB) + Core L#49 + PU L#49 (P#49) L2 L#50 (512KB) + L1d L#50 (32KB) + L1i L#50 (64KB) + Core L#50 + PU L#50 (P#50) L2 L#51 (512KB) + L1d L#51 (32KB) + L1i L#51 (64KB) + Core L#51 + PU L#51 (P#51) L3 L#13 (8192KB) L2 L#52 (512KB) + L1d L#52 (32KB) + L1i L#52 (64KB) + Core L#52 + PU L#52 (P#52) L2 L#53 (512KB) + L1d L#53 (32KB) + L1i L#53 (64KB) + Core L#53 + PU L#53 (P#53) L2 L#54 (512KB) + L1d L#54 (32KB) + L1i L#54 (64KB) + Core L#54 + PU L#54 (P#54) L2 L#55 (512KB) + L1d L#55 (32KB) + L1i L#55 (64KB) + Core L#55 + PU L#55 (P#55) HostBridge PCIBridge PCI 61:00.0 (InfiniBand) Net "ibp97s0" OpenFabrics "mlx5_0" Die L#7 NUMANode L#7 (P#7 31GB) L3 L#14 (8192KB) L2 L#56 (512KB) + L1d L#56 (32KB) + L1i L#56 (64KB) + Core L#56 + PU L#56 (P#56) L2 L#57 (512KB) + L1d L#57 (32KB) + L1i L#57 (64KB) + Core L#57 + PU L#57 (P#57) L2 L#58 (512KB) + L1d L#58 (32KB) + L1i L#58 (64KB) + Core L#58 + PU L#58 (P#58) L2 L#59 (512KB) + L1d L#59 (32KB) + L1i L#59 (64KB) + Core L#59 + PU L#59 (P#59) L3 L#15 (8192KB) L2 L#60 (512KB) + L1d L#60 (32KB) + L1i L#60 (64KB) + Core L#60 + PU L#60 (P#60) L2 L#61 (512KB) + L1d L#61 (32KB) + L1i L#61 (64KB) + Core L#61 + PU L#61 (P#61) L2 L#62 (512KB) + L1d L#62 (32KB) + L1i L#62 (64KB) + Core L#62 + PU L#62 (P#62) L2 L#63 (512KB) + L1d L#63 (32KB) + L1i L#63 (64KB) + Core L#63 + PU L#63 (P#63) HostBridge PCIBridge PCI 71:00.0 (VGA) This is the output of slurmd -C (baobab)-[root@gpu011 ~]$ slurmd -C NodeName=gpu011 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=1 RealMemory=257796 Gres=gpu:nvidia_geforce_rtx_2080_ti:2 Found gpu:nvidia_geforce_rtx_2080_ti:2 with Autodetect=nvml (Substring of gpu name may be used instead) UpTime=0-22:38:18 Is it possible to tell the slurm that the compute node has 8 sockets and 8 cores per socket? I've tried to modify the node definition but this is what appears in slurmd.log when I start the node: Node reconfigured socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=8:32(hw) If not possible it seems my next step would be to enable numa_node_as_socket globally. I've seen this slurm option: https://slurm.schedmd.com/slurm.conf.html#OPT_Ignore_NUMA I'm not sure I understand it correctly. Slurm consider NUMA nodes as socket for some hardware by default and we can override that with this parameter? If yes, to which hardware does it apply? Hi Yann, This option only applies when your hwloc is older than 2.0, otherwise it cannot be used, in most modern linux distros the included hwloc is already bigger that 2.0, so there are great chances that this does not apply to you. This was the only solution we had previously to hwloc 2 to do the l3cache_as_socket or numa_node_as_socket slurmdparameters. Thanks for sharing your lstopo information, with that information I will be improving the links I have sent you to better inform in which hardware to use it. Regards |