22498 – GRES cores doesn't match socket boundaries

Ticket 22498 - GRES cores doesn't match socket boundaries

Summary: GRES cores doesn't match socket boundaries

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	GPU (show other tickets)
Version:	24.11.1
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Oriol Vilarrubi
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2025-04-03 05:51 MDT by Yann
Modified:	2025-09-18 08:42 MDT (History)
CC List:	5 users (show)

See Also:	22741
Site:	Université de Genève
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	25.05.0
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Yann 2025-04-03 05:51:53 MDT

Dear team,

From time to time we have GPU nodes going into state drain with the reason "gres/gpu GRES core specification 16-23 doesn't match socket boundaries. (Socket 0 is cores 0-32)".

The GPU was working fine and it went into drain immediately when we restarted slurmctld. We have the same issue on another gpu node. and no issue on all the other GPUs nodes.


nothing relevant in slurmd logs.

On slurmctld we have this entries related with the issue:

[2025-04-02T15:47:44.078] error: _foreach_rebuild_topo: gres/gpu GRES core specification 16-23 doesn't match socket boundaries. (Socket 0 is cores 0-32)
[2025-04-02T15:47:44.078] error: Setting node gpu011 state to INVAL with reason:gres/gpu GRES core specification 16-23 doesn't match socket boundaries. (Socket 0 is cores 0-32)
[2025-04-02T15:47:44.078] drain_nodes: node gpu011 state set to DRAIN
[2025-04-02T15:47:44.078] error: _slurm_rpc_node_registration node=gpu011: Invalid argument

This is the output of slurd -G -D

slurmd: gpu/nvml: _get_system_gpu_list_nvml: 2 GPU system device(s) detected
slurmd: Gres Name=gpu Type=nvidia_geforce_rtx_2080_ti Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=16-23 CoreCnt=64 Links=-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=nvidia_geforce_rtx_2080_ti Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=56-63 CoreCnt=64 Links=0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=VramPerGpu Type=(null) Count=11811160064 ID=3033812246 Links=(null) Flags=CountOnly



This is the output of lscpu


(baobab)-[root@gpu011 ~]$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          43 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   64
  On-line CPU(s) list:    0-63
Vendor ID:                AuthenticAMD
  BIOS Vendor ID:         Advanced Micro Devices, Inc.
  Model name:             AMD EPYC 7601 32-Core Processor
    BIOS Model name:      AMD EPYC 7601 32-Core Processor
    CPU family:           23
    Model:                1
    Thread(s) per core:   1
    Core(s) per socket:   32
    Socket(s):            2
    Stepping:             2
    Frequency boost:      enabled
    CPU(s) scaling MHz:   100%
    CPU max MHz:          2200.0000
    CPU min MHz:          1200.0000
    BogoMIPS:             4399.73
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm
                          aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext
                           perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm
                          _lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization features:
  Virtualization:         AMD-V
Caches (sum of all):
  L1d:                    2 MiB (64 instances)
  L1i:                    4 MiB (64 instances)
  L2:                     32 MiB (64 instances)
  L3:                     128 MiB (16 instances)
NUMA:
  NUMA node(s):           8
  NUMA node0 CPU(s):      0-7
  NUMA node1 CPU(s):      8-15
  NUMA node2 CPU(s):      16-23
  NUMA node3 CPU(s):      24-31
  NUMA node4 CPU(s):      32-39
  NUMA node5 CPU(s):      40-47
  NUMA node6 CPU(s):      48-55
  NUMA node7 CPU(s):      56-63
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Mitigation; untrained return thunk; SMT disabled
  Spec rstack overflow:   Mitigation; SMT disabled
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected


We can't resume the gpu node.

Comment 1 Yann 2025-04-03 05:57:02 MDT

I'm able to resume the node if I restart slurmd.

Comment 2 Oriol Vilarrubi 2025-04-03 07:42:57 MDT

Hi Yann,

Does this happen when you issue an scontrol reconfigure, and in nodes where you have either cpuspeclist or corespeccount?

Regards.

Comment 4 Yann 2025-04-03 07:50:22 MDT

I tried:

(baobab)-[root@admin1 users] (master)$ scontrol reconfigure gpu012
(baobab)-[root@admin1 users] (master)$ scontrol show node gpu012
NodeName=gpu012 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=10 CPUEfctv=22 CPUTot=24 CPULoad=1.89
   AvailableFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING,SIMPLE_PRECISION_GPU,COMPUTE_MODEL_RTX_2080_11G
   ActiveFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING,SIMPLE_PRECISION_GPU,COMPUTE_MODEL_RTX_2080_11G
   Gres=gpu:nvidia_geforce_rtx_2080_ti:8(S:0-1),VramPerGpu:no_consume:11G
   NodeAddr=gpu012 NodeHostName=gpu012 Version=24.11.1
   OS=Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15 12:04:32 UTC 2024
   RealMemory=257000 AllocMem=122880 FreeMem=182572 Sockets=2 Boards=1
   CoreSpecCount=2 CPUSpecList=11,23
   State=MIXED+DRAIN+INVALID_REG ThreadsPerCore=1 TmpDisk=300000 Weight=30 Owner=N/A MCS_label=N/A
   Partitions=shared-gpu,private-dpnc-gpu
   BootTime=2025-03-24T17:21:12 SlurmdStartTime=2025-04-03T15:48:07
   LastBusyTime=2025-04-03T15:48:07 ResumeAfterTime=None
   CfgTRES=cpu=22,mem=257000M,billing=108,gres/gpu=8,gres/gpu:nvidia_geforce_rtx_2080_ti=8
   AllocTRES=cpu=10,mem=120G,gres/gpu=1,gres/gpu:nvidia_geforce_rtx_2080_ti=1
   CurrentWatts=0 AveWatts=0

   Reason=gres/gpu GRES core specification 0-10 doesn't match socket boundaries. (Socket 0 is cores 0-12) [slurm@2025-04-03T15:48:07]

So yes this does trigger t

Comment 5 Yann 2025-04-03 07:51:25 MDT

sorry, wrong copy past. This does trigger the issue and yes we have two CoreSpecCount per gpu node.

Comment 6 Oriol Vilarrubi 2025-04-03 08:28:23 MDT

Good, we are already working in the solution for that, what is happening here is the following:

- slurmd start normally after for example a systemctl restart slurmd.
- slurmd gets the "GPU relationship to core" information based on the cpus it sees, at this moment of the initialization all of them.
- slurmd applies the restriction of CoreSpecCount to itself, or better said to the cgroup that is in.
- slurmd receives an "scontrol reconfigure event" from the controller and spawns a copy of himself in the same cgroup that is in.
- the new slurmd tries to get the socket to core information, as it's father did, but unlike his father he is in a not-new cgroup, which already has the cpu visibility limited by his father, so the information reported by nvml, which can see all the cores, is not consistent with what the slurmd sees, which is the limited version by CoreSpecCount, thus making the new slurmd process fail.

I will keep you updated in regards of the progress of the solution, at this moment there is a patch that is in review phase.

Regards.

Comment 7 Oriol Vilarrubi 2025-04-24 07:34:42 MDT

Hello Yann,

We have included the fix for cgroup/v2 for this issue in the following commits:

1e5795ba - cgroup/v2 - _unset_cpuset_mem_limits do not reset untouched limits
14d789f7 - cgroup/v2 - xfree a missing field in common_cgroup_ns_destroy
440a22f3 - cgroup/v2 - Store the init cgroup path in the cgroup namespace
cb10bc46 - cgroup/v2 - Fix slurmd reconfig not working when removing CoreSpecLimits
71d3ab39 - cgroup/v2 - Add log flag to reset memory.max limits
47bd81ab - cgroup/v2 - Fix slurmd reconfig not resetting CoreSpec limits with systemd
9a173446 - Merge branch 'cherrypick-995-24.11' into 'slurm-24.11'

These will be shipped in the following slurm 24.11 release.

In case you are using cgroup/v1, we are still working on the fix for it.

Regards.

Comment 8 Tal 2025-04-26 09:54:57 MDT

Hi,

We are experiencing a similar issue with cgroup v1, and seeing this case we would like to join the ticket.

Is there an expected timeline for a fix at cgroup v1?

It is currently blocking our upgrade to 24.11.x, so we would appreciate if there is anything you could do to push for a solution.

Regards, 
Tal

Comment 9 Niccolo Tosato 2025-05-05 02:12:56 MDT

Dear all,

We want to join this ticket because we encountered similar problems when updating from version 24.05.X to version 24.11.4. In particular, the error message is the following:

Reason=gres/gpu GRES autodetected core affinity 48-63 on node dgx001 doesn't match socket boundaries. (Socket 0 is cores 0-63). Consider setting SlurmdParameters=l3cache_as_socket (recommended) or override this by manually specifying core affinity in gres.conf. [slurm@2025-04-23T22:20:20]

That is thrown by the _check_core_range_matches_sock function in src/interfaces/gres.c. This function was introduced as part of the last minor update.

We looked at the code, and the incriminated function performs a check against the affinity provided by nvml that should be aligned with the socket boundaries. We believe this alignment is unlikely to happen in several modern hardware topologies.

For instance, focusing on the DGX A100, this kind of device is equipped with two sockets with 64 cores each and multi-threading enabled. The 16 L3 caches of each socket are shared among groups of 4 cores. We structured the system (as per default) to have eight numa regions, four per socket. The eight GPUs are bound to 16 cores each (32 hwthreads) as reported by the Nvidia-smi utils according to this table:

| GPU  | pcore           | Numa |
|------|-----------------|------|
| GPU0 | 48-63,176-191   | 3    |
| GPU1 | 48-63,176-191   | 3    |
| GPU2 | 16-31,144-159   | 1    |
| GPU3 | 16-31,144-159   | 1    |
| GPU4 | 112-127,240-255 | 7    |
| GPU5 | 112-127,240-255 | 7    |
| GPU6 | 80-95,208-223   | 5    |
| GPU7 | 80-95,208-223   | 5    |

As can be seen, the condition you are requesting:

int first = i * rebuild_topo->cores_per_sock;
int last = (i + 1) * rebuild_topo->cores_per_sock;
/* bit_set_count_range
 * Count the number of bits set in a range of bitstring.
 *   b (IN)    bitstring to check
 *   start (IN) first bit to check
 *   end (IN)  last bit to check+1
 *   RETURN    count of set bits
 */
int core_cnt = bit_set_count_range(tmp_bitmap, first, last);

if (core_cnt && (core_cnt != rebuild_topo->cores_per_sock)) {....}

will always be met, making the check fail because the core_cnt will always be different from the core_per_sock. With our current configuration, we expect the value to be always lower (16 != 64).

Discussing the suggested workaround, setting the l3cache_as_socket flag seems unreasonable. At a glance, it looked like the flag applies to the whole slurm cluster, and if it does what the name suggests, it will change the number definition of the socket from package (as per hwloc) to "l3 cache region," confusing users. We could not test it, but in the specific case, we expect it to split each socket into 16 fake sockets of 4 cups each (as suggested by lscpu output, reported below). In this case, the test above should succeed (16!=4), triggering the error.

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          43 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   256
  On-line CPU(s) list:    0-255
Vendor ID:                AuthenticAMD
  Model name:             AMD EPYC 7742 64-Core Processor
    CPU family:           23
    Model:                49
    Thread(s) per core:   2
    Core(s) per socket:   64
    Socket(s):            2
Caches (sum of all):
  L1d:                    4 MiB (128 instances)
  L1i:                    4 MiB (128 instances)
  L2:                     64 MiB (128 instances)
  L3:                     512 MiB (32 instances)
NUMA:
  NUMA node(s):           8
  NUMA node0 CPU(s):      0-15,128-143
  NUMA node1 CPU(s):      16-31,144-159
  NUMA node2 CPU(s):      32-47,160-175
  NUMA node3 CPU(s):      48-63,176-191
  NUMA node4 CPU(s):      64-79,192-207
  NUMA node5 CPU(s):      80-95,208-223
  NUMA node6 CPU(s):      96-111,224-239
  NUMA node7 CPU(s):      112-127,240-255

Comment 10 Niccolo Tosato 2025-05-05 05:07:13 MDT

For now, we rolled out the latest update by just patching the function and making it return successful regardless of the input in the slurmctld server, but we would like it to be addressed correctly.

Comment 11 Oriol Vilarrubi 2025-05-12 07:31:58 MDT

Hello Tal,

The fix for cgroup/v1 is in progress, I cannot give a date on when it will be there, but we are working on it, if you can I would suggest to migrate also to cgroup/v2, are there are some features of slurm in upcoming versions that will only be available for cgroup/v2.

Niccolo, would you please open a separate ticket for that? This way we can keep the issues separated from one another.

Regards.

Comment 12 Oriol Vilarrubi 2025-05-19 02:56:52 MDT

Hi all,

We have fixed also this issue for cgroup/v1 in this commit: 7e9871de54

It is available in current master's HEAD (thus in 25.05.0 release)

I'm closing this ticket as resolved/fixed.

Comment 13 Yann 2025-05-19 04:25:44 MDT

Thanks @schedmd!

Comment 14 Yann 2025-06-25 07:17:50 MDT

Dear team,

I would like to have this ticket reopened please.

We are using slurm 24.11.05 and we still have the issue.. or different!

"scontrol reconfigure gpu00X" doesn't trigger the issue anymore which is very good. But I have a gpu node that I'm unable to resume.

(baobab)-[root@gpu030 ~]$ slurmd -G
slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
slurmd: Gres Name=gpu Type=nvidia_a100-pcie-40gb Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=48-63 CoreCnt=64 Links=-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=nvidia_a100-pcie-40gb Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=32-47 CoreCnt=64 Links=0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=nvidia_a100-pcie-40gb Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=16-31 CoreCnt=64 Links=0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=nvidia_a100-pcie-40gb Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-15 CoreCnt=64 Links=0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=VramPerGpu Type=(null) Count=42949672960 ID=3033812246 Links=(null) Flags=CountOnly

(baobab)-[root@gpu030 ~]$ scontrol show node gpu030
NodeName=gpu030 Arch=x86_64 CoresPerSocket=64
   CPUAlloc=0 CPUEfctv=62 CPUTot=64 CPULoad=0.84
   AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_8_0,COMPUTE_TYPE_AMPERE,DOUBLE_PRECISION_GPU,COMPUTE_MODEL_A100_40G
   ActiveFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_8_0,COMPUTE_TYPE_AMPERE,DOUBLE_PRECISION_GPU,COMPUTE_MODEL_A100_40G
   Gres=gpu:1,VramPerGpu:no_consume:40G
   NodeAddr=gpu030 NodeHostName=gpu030 Version=24.11.5
   OS=Linux 5.14.0-503.40.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 30 17:38:54 UTC 2025
   RealMemory=256000 AllocMem=0 FreeMem=252650 Sockets=1 Boards=1
   CoreSpecCount=2 CPUSpecList=62-63
   State=IDLE+DRAIN+INVALID_REG ThreadsPerCore=1 TmpDisk=1500000 Weight=60 Owner=N/A MCS_label=N/A
   Partitions=shared-gpu,private-kruse-gpu
   BootTime=2025-06-10T08:54:00 SlurmdStartTime=2025-06-25T15:12:15
   LastBusyTime=2025-06-25T15:12:15 ResumeAfterTime=None
   CfgTRES=cpu=62,mem=250G,billing=148,gres/gpu=4,gres/gpu:nvidia_a100-pcie-40gb=4
   AllocTRES=
   CurrentWatts=0 AveWatts=0

   Reason=gres/gpu GRES autodetected core affinity 48-63 on node gpu030 doesn't match socket boundaries. (Socket 0 is cores 0-63). Consider setting SlurmdParameters=l3cache_as_socket (recommended) or override this by manually specifying core affinity in gres.conf. [slurm@2025-06-06T12:15:17]


(baobab)-[root@gpu030 ~]$ scontrol update node=gpu030 state=resume
slurm_update error: Invalid node state specified

(baobab)-[root@gpu030 ~]$ rpm -qa | grep slurm
slurm-24.11.5-1.unige.el9.x86_64
slurm-contribs-24.11.5-1.unige.el9.x86_64
slurm-libpmi-24.11.5-1.unige.el9.x86_64
slurm-perlapi-24.11.5-1.unige.el9.x86_64
slurm-example-configs-24.11.5-1.unige.el9.x86_64
slurm-slurmd-24.11.5-1.unige.el9.x86_64
slurm-pam_slurm-24.11.5-1.unige.el9.x86_64


Log in slurmctld:

[2025-06-25T15:16:13.687] Invalid node state transition requested for node gpu030 from=INVAL to=RESUME
[2025-06-25T15:16:13.687] _slurm_rpc_update_node for gpu030: Invalid node state specified

Comment 15 Oriol Vilarrubi 2025-06-25 12:18:41 MDT

Hi Yann,

The fix fro cgroup/v1 is in slurm 25.05, not 24.11.5, so I guess then that you are using cgroup/v2. If that is the case, then please also attach your gres.conf

Thanks.

Comment 16 Yann 2025-06-25 14:22:51 MDT

I think we are using cgroup/v2. We didn't set CgroupPlugin but by default it is indicated it is "autodetect". To be sure, I've set it in slurmd to cgroup/v2.

Extract of cgroup.conf
[...]
NodeName=gpu009 Name=gpu AutoDetect=nvml
NodeName=gpu010 Name=gpu AutoDetect=nvml
NodeName=gpu011 Name=gpu AutoDetect=nvml File=/dev/nvidia[0-1]
NodeName=gpu012 Name=gpu AutoDetect=nvml
NodeName=gpu013 Name=gpu AutoDetect=nvml
NodeName=gpu014 Name=gpu AutoDetect=nvml
NodeName=gpu015 Name=gpu AutoDetect=nvml
NodeName=gpu016 Name=gpu AutoDetect=nvml
NodeName=gpu017 Name=gpu AutoDetect=nvml
NodeName=gpu018 Name=gpu AutoDetect=nvml
NodeName=gpu019 Name=gpu AutoDetect=nvml
NodeName=gpu020 Name=gpu AutoDetect=nvml
NodeName=gpu021 Name=gpu AutoDetect=nvml
NodeName=gpu022 Name=gpu AutoDetect=nvml
NodeName=gpu023 Name=gpu AutoDetect=nvml
NodeName=gpu024 Name=gpu AutoDetect=nvml
NodeName=gpu025 Name=gpu AutoDetect=nvml
NodeName=gpu026 Name=gpu AutoDetect=nvml
NodeName=gpu027 Name=gpu AutoDetect=nvml File=/dev/nvidia0
NodeName=gpu027 Name=gpu AutoDetect=nvml File=/dev/nvidia[1-2]
NodeName=gpu028 Name=gpu AutoDetect=nvml
NodeName=gpu029 Name=gpu AutoDetect=nvml
NodeName=gpu030 Name=gpu AutoDetect=nvml File=/dev/nvidia[0-3]
NodeName=gpu031 Name=gpu AutoDetect=nvml
[...]

Maybe worth to notice: we have issue with gpu030 and gpu011 only.
I've tried to remove the File pragma from gpu030 and restart slurmctld and slurmd, same issue. I've try to do scontrol reconfigure gpu027 and it stay idle.

Another interesting things I just figured out: gpu027 and gpu030 doesn't have the same number of NUMA nodes even if it is the same GPU. GPU030 has 4 NUMA nodes and gpu027 one. I don't know why, maybe this is something that can be modified in the bios, as both nodes are strictly identical regarding the OS.


(baobab)-[root@login1 ~]$ diff -u  <(ssh gpu030 lscpu) <(ssh  gpu027 lscpu)
--- /dev/fd/63  2025-06-25 22:18:25.661150407 +0200
+++ /dev/fd/62  2025-06-25 22:18:25.661150407 +0200
@@ -18,18 +18,15 @@
 CPU(s) scaling MHz:                   100%
 CPU max MHz:                          2250.0000
 CPU min MHz:                          1500.0000
-BogoMIPS:                             4499.95
+BogoMIPS:                             4499.53
 Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
 Virtualization:                       AMD-V
 L1d cache:                            2 MiB (64 instances)
 L1i cache:                            2 MiB (64 instances)
 L2 cache:                             32 MiB (64 instances)
 L3 cache:                             256 MiB (16 instances)
-NUMA node(s):                         4
-NUMA node0 CPU(s):                    0-15
-NUMA node1 CPU(s):                    16-31
-NUMA node2 CPU(s):                    32-47
-NUMA node3 CPU(s):                    48-63
+NUMA node(s):                         1
+NUMA node0 CPU(s):                    0-63
 Vulnerability Gather data sampling:   Not affected
 Vulnerability Itlb multihit:          Not affected
 Vulnerability L1tf:                   Not affected

Comment 17 Yann 2025-06-26 01:53:50 MDT

I've modified the NPS parameter from 4 to 1 and I can now resume the node.

By the way, do you have a performance guide for such things (NPS parameter and numa_node_as_socket). As we have a lot of AMD compute nodes, we would like to try the numa_node_as_socke parameter from slurm but as this is a global flag, we'll try that during our next maintenance.

Comment 18 Yann 2025-07-01 00:32:40 MDT

Hello, anybody out there?

I still have the issue on gpu011 and on this node it isn't possible to set NPS=1 and anyway this isn't probably what we want to do. Please reopen the issue.

Comment 19 Yann 2025-07-01 00:33:27 MDT

Sorry I wasn't aware I can reopen the issue myself, this is done now.

Comment 20 Oriol Vilarrubi 2025-07-01 11:40:17 MDT

Hello Yann,

Sorry for not answering earlier I saw the ticket as closed and that is why I did not see your last update, sorry about that.

By NPS you mean mps? Or you are refering to a different thing? Let me see if is there a way to not specify numa_node_as_socket as a whole. 

Normally what slurmd -C shows on the command line is what we recommend to set on the nodes, but in the AMD case as you have seen things change, let me see if we have something in relation to that.

Regards.

Comment 21 Yann 2025-07-02 00:03:02 MDT

By NPS I mean NPS:) => Nodes per Socket (NPS)

Comment 22 Oriol Vilarrubi 2025-07-02 10:40:39 MDT

Hi Yann,

I have been looking and only found these links:

https://slurm.schedmd.com/mc_support.html
https://slurm.schedmd.com/cpu_management.html

But what we generally recommend is to execute slurmd -C on the compute node and use that to populate the slurm.conf.
Would you mind sending me the output of lstopo-no-graphics, so that I can add a note into these links about when to include this type of parameter (numa_node_as_socket), in you case you need it 100% sure, as all the AMD Epyc we have seen so far need it.
At this moment there is no way to specify this option specifically to a node.

Regards.

Comment 23 Yann 2025-07-11 00:39:58 MDT

Here is the output of lstopo-no-graphics

Machine (252GB total)
  Package L#0
    Die L#0
      NUMANode L#0 (P#0 31GB)
      L3 L#0 (8192KB)
        L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
        L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
        L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2)
        L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3)
      L3 L#1 (8192KB)
        L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4)
        L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5)
        L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6)
        L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7)
      HostBridge
        PCIBridge
          PCI 01:00.0 (Ethernet)
            Net "enp1s0f0"
          PCI 01:00.1 (Ethernet)
            Net "enp1s0f1"
        PCIBridge
          PCIBridge
            PCI 04:00.0 (VGA)
        PCIBridge
          PCI 06:00.2 (SATA)
            Block(Disk) "sda"
    Die L#1
      NUMANode L#1 (P#1 31GB)
      L3 L#2 (8192KB)
        L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 (P#8)
        L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 (P#9)
        L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (64KB) + Core L#10 + PU L#10 (P#10)
        L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (64KB) + Core L#11 + PU L#11 (P#11)
      L3 L#3 (8192KB)
        L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (64KB) + Core L#12 + PU L#12 (P#12)
        L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (64KB) + Core L#13 + PU L#13 (P#13)
        L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (64KB) + Core L#14 + PU L#14 (P#14)
        L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (64KB) + Core L#15 + PU L#15 (P#15)
      HostBridge
        PCIBridge
          PCI 12:00.2 (SATA)
    Die L#2
      NUMANode L#2 (P#2 31GB)
      L3 L#4 (8192KB)
        L2 L#16 (512KB) + L1d L#16 (32KB) + L1i L#16 (64KB) + Core L#16 + PU L#16 (P#16)
        L2 L#17 (512KB) + L1d L#17 (32KB) + L1i L#17 (64KB) + Core L#17 + PU L#17 (P#17)
        L2 L#18 (512KB) + L1d L#18 (32KB) + L1i L#18 (64KB) + Core L#18 + PU L#18 (P#18)
        L2 L#19 (512KB) + L1d L#19 (32KB) + L1i L#19 (64KB) + Core L#19 + PU L#19 (P#19)
      L3 L#5 (8192KB)
        L2 L#20 (512KB) + L1d L#20 (32KB) + L1i L#20 (64KB) + Core L#20 + PU L#20 (P#20)
        L2 L#21 (512KB) + L1d L#21 (32KB) + L1i L#21 (64KB) + Core L#21 + PU L#21 (P#21)
        L2 L#22 (512KB) + L1d L#22 (32KB) + L1i L#22 (64KB) + Core L#22 + PU L#22 (P#22)
        L2 L#23 (512KB) + L1d L#23 (32KB) + L1i L#23 (64KB) + Core L#23 + PU L#23 (P#23)
      HostBridge
        PCIBridge
          PCI 21:00.0 (VGA)
    Die L#3
      NUMANode L#3 (P#3 31GB)
      L3 L#6 (8192KB)
        L2 L#24 (512KB) + L1d L#24 (32KB) + L1i L#24 (64KB) + Core L#24 + PU L#24 (P#24)
        L2 L#25 (512KB) + L1d L#25 (32KB) + L1i L#25 (64KB) + Core L#25 + PU L#25 (P#25)
        L2 L#26 (512KB) + L1d L#26 (32KB) + L1i L#26 (64KB) + Core L#26 + PU L#26 (P#26)
        L2 L#27 (512KB) + L1d L#27 (32KB) + L1i L#27 (64KB) + Core L#27 + PU L#27 (P#27)
      L3 L#7 (8192KB)
        L2 L#28 (512KB) + L1d L#28 (32KB) + L1i L#28 (64KB) + Core L#28 + PU L#28 (P#28)
        L2 L#29 (512KB) + L1d L#29 (32KB) + L1i L#29 (64KB) + Core L#29 + PU L#29 (P#29)
        L2 L#30 (512KB) + L1d L#30 (32KB) + L1i L#30 (64KB) + Core L#30 + PU L#30 (P#30)
        L2 L#31 (512KB) + L1d L#31 (32KB) + L1i L#31 (64KB) + Core L#31 + PU L#31 (P#31)
  Package L#1
    Die L#4
      NUMANode L#4 (P#4 31GB)
      L3 L#8 (8192KB)
        L2 L#32 (512KB) + L1d L#32 (32KB) + L1i L#32 (64KB) + Core L#32 + PU L#32 (P#32)
        L2 L#33 (512KB) + L1d L#33 (32KB) + L1i L#33 (64KB) + Core L#33 + PU L#33 (P#33)
        L2 L#34 (512KB) + L1d L#34 (32KB) + L1i L#34 (64KB) + Core L#34 + PU L#34 (P#34)
        L2 L#35 (512KB) + L1d L#35 (32KB) + L1i L#35 (64KB) + Core L#35 + PU L#35 (P#35)
      L3 L#9 (8192KB)
        L2 L#36 (512KB) + L1d L#36 (32KB) + L1i L#36 (64KB) + Core L#36 + PU L#36 (P#36)
        L2 L#37 (512KB) + L1d L#37 (32KB) + L1i L#37 (64KB) + Core L#37 + PU L#37 (P#37)
        L2 L#38 (512KB) + L1d L#38 (32KB) + L1i L#38 (64KB) + Core L#38 + PU L#38 (P#38)
        L2 L#39 (512KB) + L1d L#39 (32KB) + L1i L#39 (64KB) + Core L#39 + PU L#39 (P#39)
      HostBridge
        PCIBridge
          PCI 42:00.2 (SATA)
    Die L#5
      NUMANode L#5 (P#5 31GB)
      L3 L#10 (8192KB)
        L2 L#40 (512KB) + L1d L#40 (32KB) + L1i L#40 (64KB) + Core L#40 + PU L#40 (P#40)
        L2 L#41 (512KB) + L1d L#41 (32KB) + L1i L#41 (64KB) + Core L#41 + PU L#41 (P#41)
        L2 L#42 (512KB) + L1d L#42 (32KB) + L1i L#42 (64KB) + Core L#42 + PU L#42 (P#42)
        L2 L#43 (512KB) + L1d L#43 (32KB) + L1i L#43 (64KB) + Core L#43 + PU L#43 (P#43)
      L3 L#11 (8192KB)
        L2 L#44 (512KB) + L1d L#44 (32KB) + L1i L#44 (64KB) + Core L#44 + PU L#44 (P#44)
        L2 L#45 (512KB) + L1d L#45 (32KB) + L1i L#45 (64KB) + Core L#45 + PU L#45 (P#45)
        L2 L#46 (512KB) + L1d L#46 (32KB) + L1i L#46 (64KB) + Core L#46 + PU L#46 (P#46)
        L2 L#47 (512KB) + L1d L#47 (32KB) + L1i L#47 (64KB) + Core L#47 + PU L#47 (P#47)
    Die L#6
      NUMANode L#6 (P#6 31GB)
      L3 L#12 (8192KB)
        L2 L#48 (512KB) + L1d L#48 (32KB) + L1i L#48 (64KB) + Core L#48 + PU L#48 (P#48)
        L2 L#49 (512KB) + L1d L#49 (32KB) + L1i L#49 (64KB) + Core L#49 + PU L#49 (P#49)
        L2 L#50 (512KB) + L1d L#50 (32KB) + L1i L#50 (64KB) + Core L#50 + PU L#50 (P#50)
        L2 L#51 (512KB) + L1d L#51 (32KB) + L1i L#51 (64KB) + Core L#51 + PU L#51 (P#51)
      L3 L#13 (8192KB)
        L2 L#52 (512KB) + L1d L#52 (32KB) + L1i L#52 (64KB) + Core L#52 + PU L#52 (P#52)
        L2 L#53 (512KB) + L1d L#53 (32KB) + L1i L#53 (64KB) + Core L#53 + PU L#53 (P#53)
        L2 L#54 (512KB) + L1d L#54 (32KB) + L1i L#54 (64KB) + Core L#54 + PU L#54 (P#54)
        L2 L#55 (512KB) + L1d L#55 (32KB) + L1i L#55 (64KB) + Core L#55 + PU L#55 (P#55)
      HostBridge
        PCIBridge
          PCI 61:00.0 (InfiniBand)
            Net "ibp97s0"
            OpenFabrics "mlx5_0"
    Die L#7
      NUMANode L#7 (P#7 31GB)
      L3 L#14 (8192KB)
        L2 L#56 (512KB) + L1d L#56 (32KB) + L1i L#56 (64KB) + Core L#56 + PU L#56 (P#56)
        L2 L#57 (512KB) + L1d L#57 (32KB) + L1i L#57 (64KB) + Core L#57 + PU L#57 (P#57)
        L2 L#58 (512KB) + L1d L#58 (32KB) + L1i L#58 (64KB) + Core L#58 + PU L#58 (P#58)
        L2 L#59 (512KB) + L1d L#59 (32KB) + L1i L#59 (64KB) + Core L#59 + PU L#59 (P#59)
      L3 L#15 (8192KB)
        L2 L#60 (512KB) + L1d L#60 (32KB) + L1i L#60 (64KB) + Core L#60 + PU L#60 (P#60)
        L2 L#61 (512KB) + L1d L#61 (32KB) + L1i L#61 (64KB) + Core L#61 + PU L#61 (P#61)
        L2 L#62 (512KB) + L1d L#62 (32KB) + L1i L#62 (64KB) + Core L#62 + PU L#62 (P#62)
        L2 L#63 (512KB) + L1d L#63 (32KB) + L1i L#63 (64KB) + Core L#63 + PU L#63 (P#63)
      HostBridge
        PCIBridge
          PCI 71:00.0 (VGA)

This is the output of slurmd -C
(baobab)-[root@gpu011 ~]$ slurmd -C
NodeName=gpu011 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=1 RealMemory=257796 Gres=gpu:nvidia_geforce_rtx_2080_ti:2
Found gpu:nvidia_geforce_rtx_2080_ti:2 with Autodetect=nvml (Substring of gpu name may be used instead)
UpTime=0-22:38:18


Is it possible to tell the slurm that the compute node has 8 sockets and 8 cores per socket? I've tried to modify the node definition but this is what appears in slurmd.log when I start the node:

Node reconfigured socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=8:32(hw)

If not possible it seems my next step would be to enable numa_node_as_socket globally.

Comment 24 Yann 2025-07-11 00:48:09 MDT

I've seen this slurm option: https://slurm.schedmd.com/slurm.conf.html#OPT_Ignore_NUMA

I'm not sure I understand it correctly. Slurm consider NUMA nodes as socket for some hardware by default and we can override that with this parameter? If yes, to which hardware does it apply?

Comment 25 Oriol Vilarrubi 2025-07-11 05:04:41 MDT

Hi Yann,

This option only applies when your hwloc is older than 2.0, otherwise it cannot be used, in most modern linux distros the included hwloc is already bigger that 2.0, so there are great chances that this does not apply to you. This was the only solution we had previously to hwloc 2 to do the l3cache_as_socket or numa_node_as_socket slurmdparameters.

Thanks for sharing your lstopo information, with that information I will be improving the links I have sent you to better inform in which hardware to use it.

Regards

Comment 26 Yann 2025-09-10 04:04:52 MDT

Hi,

we've enabled numa_as_socket in slurm.

For a gpu node with NPS=4 and one physical socket:

slurmd -G
[2025-09-10T11:40:40.681] Node reconfigured socket/core boundaries SocketsPerBoard=1:4(hw) CoresPerSocket=64:16(hw)
[2025-09-10T11:40:40.741] gpu/nvml: _get_system_gpu_list_nvml: 3 GPU system device(s) detected
[2025-09-10T11:40:40.741] Gres Name=gpu Type=nvidia_a100_80gb_pcie Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=32-47 CoreCnt=64 Links=-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
[2025-09-10T11:40:40.741] Gres Name=gpu Type=nvidia_a100_80gb_pcie Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=16-31 CoreCnt=64 Links=0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
[2025-09-10T11:40:40.741] Gres Name=gpu Type=nvidia_a100_80gb_pcie Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-15 CoreCnt=64 Links=0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
[2025-09-10T11:40:40.741] Gres Name=VramPerGpu Type=(null) Count=85899345920 ID=3033812246 Links=(null) Flags=CountOnly

Which seems to indicate we need to update slurm.conf with the new node definition: 4 sockets of 16 cores each.

What is strange is that slurmd -C doesn't change the output when this option is enabled. Is this done on purpose?

slurmd -C
NodeName=gpu032 CPUs=64 Boards=1 SocketsPerBoard=1 CoresPerSocket=64 ThreadsPerCore=1 RealMemory=515833 Gres=gpu:nvidia_a100_80gb_pcie:3
Found gpu:nvidia_a100_80gb_pcie:3 with Autodetect=nvml (Substring of gpu name may be used instead)
UpTime=0-00:14:02

What we plan to do:

1. enable NPS=4 on every GPU and CPU node
2. update the node definition in slurm.conf (4 x current socket, current cpu / 4)

Does it sounds correct?

Comment 27 Oriol Vilarrubi 2025-09-18 03:19:30 MDT

Hi Yann,

Sorry for the delay, yes, that is the proper way to go seeing your lstopo output, let me check about the slurmd -C and I'll come back to you.

Comment 28 Oriol Vilarrubi 2025-09-18 03:50:31 MDT

Hi Yann,

slurmd -C does not read the slurm.conf, so it does not have idea about numa_as_socket being set in the configuration and in consequence it cannot show the CPU topology using that manner.
This is done so that you can run slurmd -C in order to get the CPUtopology before sending the slurm.conf with the already set node CPU topology.

Comment 29 Yann 2025-09-18 08:42:58 MDT

Thanks for the answer, I can confirm we enabled numa_as_socket in one of our cluster and updated slurm node definition and it is working fine.

Best