we expanded our cluster and added a new partition for testing for now. the issue is that i can;t seem to submit anything to it though it seems like everything should be working. here are the parititons and the simple srun command i am trying to run: [naveed@head1 quickstart]$ scontrol show part any PartitionName=any AllowGroups=ALL DenyAccounts=sunshine AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=hpc-19-[13-19,21-28,30-37],hpc-20-[13-19,21-28,30-37],hpc-21-[14-16,18-19,22-28,30-37],hpc-22-[07-24,28,30,32,34,36,38],hpc-23-[07-24,28,30,32,34,36,38],hpc-24-[07-24,28,30,32,34,36,38],hpc-25-[03-10,14-15,17-18,20-21,23-24],hpc-26-[14-15,17-18,20-21,23-24],hpc-80-[04-09,11-16,18-23,25-28,33,36],hpc-81-[04-09,11-16,18-23,25-27,33,35,37],hpc-82-[04-09,11-16,18-23,25-27,33,35,37],hpc-83-[04-09,11-16,18-23,25-27,33,35,37],hpc-89-[03-26,32-33,35-39],hpc-90-[03-26,29-30,32-33,35-38],hpc-91-[10-21,24-25,28-30,32-33],hpc-92-[03-26,29-30,32-33,36-38],hpc-93-[03-26,29-30,32-33,36-38] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=16992 TotalNodes=402 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED TRES=cpu=16992,mem=131746000M,node=402,billing=16992,gres/gpu=208 [naveed@head1 quickstart]$ scontrol show part expansion PartitionName=expansion AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=11520 TotalNodes=180 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED TRES=cpu=11520,mem=91800000M,node=180,billing=11520 [naveed@head1 quickstart]$ srun -p expansion -t 5:00 uptime srun: error: Unable to allocate resources: Requested node configuration is not available [naveed@head1 quickstart]$ srun -p any -t 5:00 uptime 18:10:06 up 10:47, 0 users, load average: 23.67, 26.38, 27.87 From slurm.conf: PartitionName=expansion Default=No Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] MaxTime=14-0 NodeName=hpc-34-[01-20,23-41],hpc-35-[01-20,23-28],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=510000 Features=icelake weight=20 NodeName=hpc-35-[29-38] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=510000 Features=epyc weight=20 PartitionName=expansion Default=No Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] MaxTime=14-0 and trying to conenct to a specific node and the config for that node: [naveed@head1 quickstart]$ srun -p any -t 5:00 --nodelist=hpc-35-04 uptime srun: error: Unable to allocate resources: Requested node configuration is not available [naveed@head1 quickstart]$ scontrol show node hpc-35-04 NodeName=hpc-35-04 Arch=x86_64 CoresPerSocket=32 CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=0.05 AvailableFeatures=icelake ActiveFeatures=icelake Gres=(null) NodeAddr=hpc-35-04 NodeHostName=hpc-35-04 Version=22.05.6 OS=Linux 5.14.0-162.23.1.el9_1.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Mar 23 20:08:28 EDT 2023 RealMemory=510000 AllocMem=0 FreeMem=510826 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=20 Owner=N/A MCS_label=N/A Partitions=expansion BootTime=2023-09-15T08:29:35 SlurmdStartTime=2023-09-15T17:31:52 LastBusyTime=2023-09-15T16:29:39 CfgTRES=cpu=64,mem=510000M,billing=64 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Created attachment 32284 [details] slurm etc dir
BTW- some of them have a POWERED_DOWN state. how do i clear that? they weren;t in the Suspendexclude nodes at first, but are now: NodeName=hpc-34-41 Arch=x86_64 CoresPerSocket=32 CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=0.24 AvailableFeatures=icelake ActiveFeatures=icelake Gres=(null) NodeAddr=hpc-34-41 NodeHostName=hpc-34-41 Version=22.05.6 OS=Linux 5.14.0-162.23.1.el9_1.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Mar 23 20:08:28 EDT 2023 RealMemory=510000 AllocMem=0 FreeMem=503572 Sockets=2 Boards=1 State=IDLE+POWERED_DOWN ThreadsPerCore=2 TmpDisk=0 Weight=20 Owner=N/A MCS_label=N/A Partitions=expansion BootTime=2023-09-15T08:28:47 SlurmdStartTime=2023-09-15T17:31:52 LastBusyTime=Unknown CfgTRES=cpu=64,mem=510000M,billing=64 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Naveed, Looking at the configs you shared I see that nodes from "expansion" partition weren't added to topology.conf, this results in the situation when SelectType plugin with topology/tree is unable to allocate resources.[1] >BTW- some of them have a POWERED_DOWN state. how do i clear that? they weren;t in the Suspendexclude nodes at first, but are now: Did you try: >scontrol update node=... state=POWER_UP[2] Let me know if that helps. cheers, Marcin [1]From slurm.conf: >TopologyPlugin = topology/tree >PartitionName=expansion Default=No Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] MaxTime=14-0 [2]https://slurm.schedmd.com/scontrol.html#OPT_POWER_UP
Thank you. The topology file helped, but I am still habving issue getting the nodes out of POWERING_UP state now. I ran the socntorl update command about an hour ago and they are still in a POWERING_UP state: NodeName=hpc-34-41 CoresPerSocket=32 CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=N/A AvailableFeatures=icelake ActiveFeatures=icelake Gres=(null) NodeAddr=hpc-34-41 NodeHostName=hpc-34-41 Version=22.05.6 RealMemory=510000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1 State=UNKNOWN+POWERING_UP ThreadsPerCore=2 TmpDisk=0 Weight=20 Owner=N/A MCS_label=N/A Partitions=expansion BootTime=2023-09-15T08:28:48 SlurmdStartTime=2023-09-18T08:08:32 LastBusyTime=2023-09-18T07:56:45 CfgTRES=cpu=64,mem=510000M,billing=64 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s When I restarted slurs after adding the topology file and trying to fix the Suspexclugenodes I see this in the log, but can’t determine he reason: [2023-09-18T07:56:48.237] error: Invalid SuspendExcNodes hpc-19-[13-19,21-28,30-37],hpc-20-[13-19,21-28,30-37],hpc-21-[22-28,30-37],hpc-21-[14-16,18-19],hpc-22-[07-24,28,30,32,34,36,38],hpc-23-[07-24,28,30,32,34,36,38],hpc-24-[07-24,28,30,32,34,36,38],hpc-25-[03-10,14-15,17-18,20-21,23-24],hpc-26-[14-15,17-18,20-21,23-24],hpc-79-11,hpc-80-[04-09,11-16,18-23,25-28,33,36],Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41],hpc-81-[04-09,11-16,18-23,25-27,33,35,37],hpc-82-[04-09,11-16,18-23,25-27,33,35,37],hpc-83-[04-09,11-16,18-23,25-27,33,35,37],hpc-89-[03-26,32-33,35-39],hpc-90-[03-26,29-30,32-33,35-38],hpc-91-[10-21,24-25,28-30,32-33],hpc-92-[03-26,29-30,32-33,36-38],hpc-93-[03-26,29-30,32-33,36-38] ignored On Sep 18, 2023, at 12:30 AM, bugs@schedmd.com wrote: Marcin Stolarek<mailto:cinek@schedmd.com> changed bug 17703<https://bugs.schedmd.com/show_bug.cgi?id=17703> What Removed Added CC cinek@schedmd.com Assignee support@schedmd.com cinek@schedmd.com Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=17703#c3> on bug 17703<https://bugs.schedmd.com/show_bug.cgi?id=17703> from Marcin Stolarek<mailto:cinek@schedmd.com> Naveed, Looking at the configs you shared I see that nodes from "expansion" partition weren't added to topology.conf, this results in the situation when SelectType plugin with topology/tree is unable to allocate resources.[1] >BTW- some of them have a POWERED_DOWN state. how do i clear that? they weren;t in the Suspendexclude nodes at first, but are now: Did you try: >scontrol update node=... state=POWER_UP[2] Let me know if that helps. cheers, Marcin [1]From slurm.conf: >TopologyPlugin = topology/tree >PartitionName=expansion Default=No Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] MaxTime=14-0 [2]https://slurm.schedmd.com/scontrol.html#OPT_POWER_UP ________________________________ You are receiving this mail because: * You reported the bug.
Maybe because of the copy & paste like mistake of "Nodes" in the SuspendExcNodes? >SuspendExcNodes=[...]hpc-80-[04-09,11-16,18-23,25-28,33,36],Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38] > ^^^^^ I assume the nodes are up and running slurmd? Could you please try to set them DOWN and then RESUME? cheers, Marcin
Thank you. i think we are good on this ticket now. habe a different issue now but will open a different ticket for it to keep it clear.
Thanks for confirming!