Dear team, From time to time we have GPU nodes going into state drain with the reason "gres/gpu GRES core specification 16-23 doesn't match socket boundaries. (Socket 0 is cores 0-32)". The GPU was working fine and it went into drain immediately when we restarted slurmctld. We have the same issue on another gpu node. and no issue on all the other GPUs nodes. nothing relevant in slurmd logs. On slurmctld we have this entries related with the issue: [2025-04-02T15:47:44.078] error: _foreach_rebuild_topo: gres/gpu GRES core specification 16-23 doesn't match socket boundaries. (Socket 0 is cores 0-32) [2025-04-02T15:47:44.078] error: Setting node gpu011 state to INVAL with reason:gres/gpu GRES core specification 16-23 doesn't match socket boundaries. (Socket 0 is cores 0-32) [2025-04-02T15:47:44.078] drain_nodes: node gpu011 state set to DRAIN [2025-04-02T15:47:44.078] error: _slurm_rpc_node_registration node=gpu011: Invalid argument This is the output of slurd -G -D slurmd: gpu/nvml: _get_system_gpu_list_nvml: 2 GPU system device(s) detected slurmd: Gres Name=gpu Type=nvidia_geforce_rtx_2080_ti Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=16-23 CoreCnt=64 Links=-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=nvidia_geforce_rtx_2080_ti Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=56-63 CoreCnt=64 Links=0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=VramPerGpu Type=(null) Count=11811160064 ID=3033812246 Links=(null) Flags=CountOnly This is the output of lscpu (baobab)-[root@gpu011 ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: AuthenticAMD BIOS Vendor ID: Advanced Micro Devices, Inc. Model name: AMD EPYC 7601 32-Core Processor BIOS Model name: AMD EPYC 7601 32-Core Processor CPU family: 23 Model: 1 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 2 Stepping: 2 Frequency boost: enabled CPU(s) scaling MHz: 100% CPU max MHz: 2200.0000 CPU min MHz: 1200.0000 BogoMIPS: 4399.73 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm _lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es Virtualization features: Virtualization: AMD-V Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 4 MiB (64 instances) L2: 32 MiB (64 instances) L3: 128 MiB (16 instances) NUMA: NUMA node(s): 8 NUMA node0 CPU(s): 0-7 NUMA node1 CPU(s): 8-15 NUMA node2 CPU(s): 16-23 NUMA node3 CPU(s): 24-31 NUMA node4 CPU(s): 32-39 NUMA node5 CPU(s): 40-47 NUMA node6 CPU(s): 48-55 NUMA node7 CPU(s): 56-63 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Reg file data sampling: Not affected Retbleed: Mitigation; untrained return thunk; SMT disabled Spec rstack overflow: Mitigation; SMT disabled Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Srbds: Not affected Tsx async abort: Not affected We can't resume the gpu node.
I'm able to resume the node if I restart slurmd.
Hi Yann, Does this happen when you issue an scontrol reconfigure, and in nodes where you have either cpuspeclist or corespeccount? Regards.
I tried: (baobab)-[root@admin1 users] (master)$ scontrol reconfigure gpu012 (baobab)-[root@admin1 users] (master)$ scontrol show node gpu012 NodeName=gpu012 Arch=x86_64 CoresPerSocket=12 CPUAlloc=10 CPUEfctv=22 CPUTot=24 CPULoad=1.89 AvailableFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING,SIMPLE_PRECISION_GPU,COMPUTE_MODEL_RTX_2080_11G ActiveFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING,SIMPLE_PRECISION_GPU,COMPUTE_MODEL_RTX_2080_11G Gres=gpu:nvidia_geforce_rtx_2080_ti:8(S:0-1),VramPerGpu:no_consume:11G NodeAddr=gpu012 NodeHostName=gpu012 Version=24.11.1 OS=Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15 12:04:32 UTC 2024 RealMemory=257000 AllocMem=122880 FreeMem=182572 Sockets=2 Boards=1 CoreSpecCount=2 CPUSpecList=11,23 State=MIXED+DRAIN+INVALID_REG ThreadsPerCore=1 TmpDisk=300000 Weight=30 Owner=N/A MCS_label=N/A Partitions=shared-gpu,private-dpnc-gpu BootTime=2025-03-24T17:21:12 SlurmdStartTime=2025-04-03T15:48:07 LastBusyTime=2025-04-03T15:48:07 ResumeAfterTime=None CfgTRES=cpu=22,mem=257000M,billing=108,gres/gpu=8,gres/gpu:nvidia_geforce_rtx_2080_ti=8 AllocTRES=cpu=10,mem=120G,gres/gpu=1,gres/gpu:nvidia_geforce_rtx_2080_ti=1 CurrentWatts=0 AveWatts=0 Reason=gres/gpu GRES core specification 0-10 doesn't match socket boundaries. (Socket 0 is cores 0-12) [slurm@2025-04-03T15:48:07] So yes this does trigger t
sorry, wrong copy past. This does trigger the issue and yes we have two CoreSpecCount per gpu node.
Good, we are already working in the solution for that, what is happening here is the following: - slurmd start normally after for example a systemctl restart slurmd. - slurmd gets the "GPU relationship to core" information based on the cpus it sees, at this moment of the initialization all of them. - slurmd applies the restriction of CoreSpecCount to itself, or better said to the cgroup that is in. - slurmd receives an "scontrol reconfigure event" from the controller and spawns a copy of himself in the same cgroup that is in. - the new slurmd tries to get the socket to core information, as it's father did, but unlike his father he is in a not-new cgroup, which already has the cpu visibility limited by his father, so the information reported by nvml, which can see all the cores, is not consistent with what the slurmd sees, which is the limited version by CoreSpecCount, thus making the new slurmd process fail. I will keep you updated in regards of the progress of the solution, at this moment there is a patch that is in review phase. Regards.
Hello Yann, We have included the fix for cgroup/v2 for this issue in the following commits: 1e5795ba - cgroup/v2 - _unset_cpuset_mem_limits do not reset untouched limits 14d789f7 - cgroup/v2 - xfree a missing field in common_cgroup_ns_destroy 440a22f3 - cgroup/v2 - Store the init cgroup path in the cgroup namespace cb10bc46 - cgroup/v2 - Fix slurmd reconfig not working when removing CoreSpecLimits 71d3ab39 - cgroup/v2 - Add log flag to reset memory.max limits 47bd81ab - cgroup/v2 - Fix slurmd reconfig not resetting CoreSpec limits with systemd 9a173446 - Merge branch 'cherrypick-995-24.11' into 'slurm-24.11' These will be shipped in the following slurm 24.11 release. In case you are using cgroup/v1, we are still working on the fix for it. Regards.