Summary: | GRES cores doesn't match socket boundaries | ||
---|---|---|---|
Product: | Slurm | Reporter: | Yann <yann.sagon> |
Component: | GPU | Assignee: | Oriol Vilarrubi <jvilarru> |
Status: | OPEN --- | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | jvilarru, ricard, shai.haim, tal.friedman |
Version: | 24.11.1 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Université de Genève | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Yann
2025-04-03 05:51:53 MDT
I'm able to resume the node if I restart slurmd. Hi Yann, Does this happen when you issue an scontrol reconfigure, and in nodes where you have either cpuspeclist or corespeccount? Regards. I tried: (baobab)-[root@admin1 users] (master)$ scontrol reconfigure gpu012 (baobab)-[root@admin1 users] (master)$ scontrol show node gpu012 NodeName=gpu012 Arch=x86_64 CoresPerSocket=12 CPUAlloc=10 CPUEfctv=22 CPUTot=24 CPULoad=1.89 AvailableFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING,SIMPLE_PRECISION_GPU,COMPUTE_MODEL_RTX_2080_11G ActiveFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING,SIMPLE_PRECISION_GPU,COMPUTE_MODEL_RTX_2080_11G Gres=gpu:nvidia_geforce_rtx_2080_ti:8(S:0-1),VramPerGpu:no_consume:11G NodeAddr=gpu012 NodeHostName=gpu012 Version=24.11.1 OS=Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15 12:04:32 UTC 2024 RealMemory=257000 AllocMem=122880 FreeMem=182572 Sockets=2 Boards=1 CoreSpecCount=2 CPUSpecList=11,23 State=MIXED+DRAIN+INVALID_REG ThreadsPerCore=1 TmpDisk=300000 Weight=30 Owner=N/A MCS_label=N/A Partitions=shared-gpu,private-dpnc-gpu BootTime=2025-03-24T17:21:12 SlurmdStartTime=2025-04-03T15:48:07 LastBusyTime=2025-04-03T15:48:07 ResumeAfterTime=None CfgTRES=cpu=22,mem=257000M,billing=108,gres/gpu=8,gres/gpu:nvidia_geforce_rtx_2080_ti=8 AllocTRES=cpu=10,mem=120G,gres/gpu=1,gres/gpu:nvidia_geforce_rtx_2080_ti=1 CurrentWatts=0 AveWatts=0 Reason=gres/gpu GRES core specification 0-10 doesn't match socket boundaries. (Socket 0 is cores 0-12) [slurm@2025-04-03T15:48:07] So yes this does trigger t sorry, wrong copy past. This does trigger the issue and yes we have two CoreSpecCount per gpu node. Good, we are already working in the solution for that, what is happening here is the following: - slurmd start normally after for example a systemctl restart slurmd. - slurmd gets the "GPU relationship to core" information based on the cpus it sees, at this moment of the initialization all of them. - slurmd applies the restriction of CoreSpecCount to itself, or better said to the cgroup that is in. - slurmd receives an "scontrol reconfigure event" from the controller and spawns a copy of himself in the same cgroup that is in. - the new slurmd tries to get the socket to core information, as it's father did, but unlike his father he is in a not-new cgroup, which already has the cpu visibility limited by his father, so the information reported by nvml, which can see all the cores, is not consistent with what the slurmd sees, which is the limited version by CoreSpecCount, thus making the new slurmd process fail. I will keep you updated in regards of the progress of the solution, at this moment there is a patch that is in review phase. Regards. Hello Yann, We have included the fix for cgroup/v2 for this issue in the following commits: 1e5795ba - cgroup/v2 - _unset_cpuset_mem_limits do not reset untouched limits 14d789f7 - cgroup/v2 - xfree a missing field in common_cgroup_ns_destroy 440a22f3 - cgroup/v2 - Store the init cgroup path in the cgroup namespace cb10bc46 - cgroup/v2 - Fix slurmd reconfig not working when removing CoreSpecLimits 71d3ab39 - cgroup/v2 - Add log flag to reset memory.max limits 47bd81ab - cgroup/v2 - Fix slurmd reconfig not resetting CoreSpec limits with systemd 9a173446 - Merge branch 'cherrypick-995-24.11' into 'slurm-24.11' These will be shipped in the following slurm 24.11 release. In case you are using cgroup/v1, we are still working on the fix for it. Regards. Hi, We are experiencing a similar issue with cgroup v1, and seeing this case we would like to join the ticket. Is there an expected timeline for a fix at cgroup v1? It is currently blocking our upgrade to 24.11.x, so we would appreciate if there is anything you could do to push for a solution. Regards, Tal |