Ticket 22498

Summary: GRES cores doesn't match socket boundaries
Product: Slurm Reporter: Yann <yann.sagon>
Component: GPUAssignee: Oriol Vilarrubi <jvilarru>
Status: OPEN --- QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: jvilarru, ricard, shai.haim, tal.friedman
Version: 24.11.1   
Hardware: Linux   
OS: Linux   
Site: Université de Genève Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Yann 2025-04-03 05:51:53 MDT
Dear team,

From time to time we have GPU nodes going into state drain with the reason "gres/gpu GRES core specification 16-23 doesn't match socket boundaries. (Socket 0 is cores 0-32)".

The GPU was working fine and it went into drain immediately when we restarted slurmctld. We have the same issue on another gpu node. and no issue on all the other GPUs nodes.


nothing relevant in slurmd logs.

On slurmctld we have this entries related with the issue:

[2025-04-02T15:47:44.078] error: _foreach_rebuild_topo: gres/gpu GRES core specification 16-23 doesn't match socket boundaries. (Socket 0 is cores 0-32)
[2025-04-02T15:47:44.078] error: Setting node gpu011 state to INVAL with reason:gres/gpu GRES core specification 16-23 doesn't match socket boundaries. (Socket 0 is cores 0-32)
[2025-04-02T15:47:44.078] drain_nodes: node gpu011 state set to DRAIN
[2025-04-02T15:47:44.078] error: _slurm_rpc_node_registration node=gpu011: Invalid argument

This is the output of slurd -G -D

slurmd: gpu/nvml: _get_system_gpu_list_nvml: 2 GPU system device(s) detected
slurmd: Gres Name=gpu Type=nvidia_geforce_rtx_2080_ti Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=16-23 CoreCnt=64 Links=-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=nvidia_geforce_rtx_2080_ti Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=56-63 CoreCnt=64 Links=0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=VramPerGpu Type=(null) Count=11811160064 ID=3033812246 Links=(null) Flags=CountOnly



This is the output of lscpu


(baobab)-[root@gpu011 ~]$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          43 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   64
  On-line CPU(s) list:    0-63
Vendor ID:                AuthenticAMD
  BIOS Vendor ID:         Advanced Micro Devices, Inc.
  Model name:             AMD EPYC 7601 32-Core Processor
    BIOS Model name:      AMD EPYC 7601 32-Core Processor
    CPU family:           23
    Model:                1
    Thread(s) per core:   1
    Core(s) per socket:   32
    Socket(s):            2
    Stepping:             2
    Frequency boost:      enabled
    CPU(s) scaling MHz:   100%
    CPU max MHz:          2200.0000
    CPU min MHz:          1200.0000
    BogoMIPS:             4399.73
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm
                          aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext
                           perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm
                          _lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization features:
  Virtualization:         AMD-V
Caches (sum of all):
  L1d:                    2 MiB (64 instances)
  L1i:                    4 MiB (64 instances)
  L2:                     32 MiB (64 instances)
  L3:                     128 MiB (16 instances)
NUMA:
  NUMA node(s):           8
  NUMA node0 CPU(s):      0-7
  NUMA node1 CPU(s):      8-15
  NUMA node2 CPU(s):      16-23
  NUMA node3 CPU(s):      24-31
  NUMA node4 CPU(s):      32-39
  NUMA node5 CPU(s):      40-47
  NUMA node6 CPU(s):      48-55
  NUMA node7 CPU(s):      56-63
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Mitigation; untrained return thunk; SMT disabled
  Spec rstack overflow:   Mitigation; SMT disabled
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected


We can't resume the gpu node.
Comment 1 Yann 2025-04-03 05:57:02 MDT
I'm able to resume the node if I restart slurmd.
Comment 2 Oriol Vilarrubi 2025-04-03 07:42:57 MDT
Hi Yann,

Does this happen when you issue an scontrol reconfigure, and in nodes where you have either cpuspeclist or corespeccount?

Regards.
Comment 4 Yann 2025-04-03 07:50:22 MDT
I tried:

(baobab)-[root@admin1 users] (master)$ scontrol reconfigure gpu012
(baobab)-[root@admin1 users] (master)$ scontrol show node gpu012
NodeName=gpu012 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=10 CPUEfctv=22 CPUTot=24 CPULoad=1.89
   AvailableFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING,SIMPLE_PRECISION_GPU,COMPUTE_MODEL_RTX_2080_11G
   ActiveFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING,SIMPLE_PRECISION_GPU,COMPUTE_MODEL_RTX_2080_11G
   Gres=gpu:nvidia_geforce_rtx_2080_ti:8(S:0-1),VramPerGpu:no_consume:11G
   NodeAddr=gpu012 NodeHostName=gpu012 Version=24.11.1
   OS=Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15 12:04:32 UTC 2024
   RealMemory=257000 AllocMem=122880 FreeMem=182572 Sockets=2 Boards=1
   CoreSpecCount=2 CPUSpecList=11,23
   State=MIXED+DRAIN+INVALID_REG ThreadsPerCore=1 TmpDisk=300000 Weight=30 Owner=N/A MCS_label=N/A
   Partitions=shared-gpu,private-dpnc-gpu
   BootTime=2025-03-24T17:21:12 SlurmdStartTime=2025-04-03T15:48:07
   LastBusyTime=2025-04-03T15:48:07 ResumeAfterTime=None
   CfgTRES=cpu=22,mem=257000M,billing=108,gres/gpu=8,gres/gpu:nvidia_geforce_rtx_2080_ti=8
   AllocTRES=cpu=10,mem=120G,gres/gpu=1,gres/gpu:nvidia_geforce_rtx_2080_ti=1
   CurrentWatts=0 AveWatts=0

   Reason=gres/gpu GRES core specification 0-10 doesn't match socket boundaries. (Socket 0 is cores 0-12) [slurm@2025-04-03T15:48:07]

So yes this does trigger t
Comment 5 Yann 2025-04-03 07:51:25 MDT
sorry, wrong copy past. This does trigger the issue and yes we have two CoreSpecCount per gpu node.
Comment 6 Oriol Vilarrubi 2025-04-03 08:28:23 MDT
Good, we are already working in the solution for that, what is happening here is the following:

- slurmd start normally after for example a systemctl restart slurmd.
- slurmd gets the "GPU relationship to core" information based on the cpus it sees, at this moment of the initialization all of them.
- slurmd applies the restriction of CoreSpecCount to itself, or better said to the cgroup that is in.
- slurmd receives an "scontrol reconfigure event" from the controller and spawns a copy of himself in the same cgroup that is in.
- the new slurmd tries to get the socket to core information, as it's father did, but unlike his father he is in a not-new cgroup, which already has the cpu visibility limited by his father, so the information reported by nvml, which can see all the cores, is not consistent with what the slurmd sees, which is the limited version by CoreSpecCount, thus making the new slurmd process fail.

I will keep you updated in regards of the progress of the solution, at this moment there is a patch that is in review phase.

Regards.
Comment 7 Oriol Vilarrubi 2025-04-24 07:34:42 MDT
Hello Yann,

We have included the fix for cgroup/v2 for this issue in the following commits:

1e5795ba - cgroup/v2 - _unset_cpuset_mem_limits do not reset untouched limits
14d789f7 - cgroup/v2 - xfree a missing field in common_cgroup_ns_destroy
440a22f3 - cgroup/v2 - Store the init cgroup path in the cgroup namespace
cb10bc46 - cgroup/v2 - Fix slurmd reconfig not working when removing CoreSpecLimits
71d3ab39 - cgroup/v2 - Add log flag to reset memory.max limits
47bd81ab - cgroup/v2 - Fix slurmd reconfig not resetting CoreSpec limits with systemd
9a173446 - Merge branch 'cherrypick-995-24.11' into 'slurm-24.11'

These will be shipped in the following slurm 24.11 release.

In case you are using cgroup/v1, we are still working on the fix for it.

Regards.
Comment 8 Tal 2025-04-26 09:54:57 MDT
Hi,

We are experiencing a similar issue with cgroup v1, and seeing this case we would like to join the ticket.

Is there an expected timeline for a fix at cgroup v1?

It is currently blocking our upgrade to 24.11.x, so we would appreciate if there is anything you could do to push for a solution.

Regards, 
Tal