Ticket 17703

Summary:	cannot submit jobs to new partition
Product:	Slurm	Reporter:	Naveed Near-Ansari <naveed>
Component:	Configuration	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	cinek
Version:	22.05.6
Hardware:	Linux
OS:	Linux
Site:	Caltech	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm etc dir

Description Naveed Near-Ansari 2023-09-15 19:16:57 MDT

we expanded our cluster and added a new partition for testing for now.  the issue is that i can;t seem to submit anything to it though it seems like everything should be working.  here are the parititons and the simple srun command i am trying to run:

[naveed@head1 quickstart]$ scontrol show part any
PartitionName=any
   AllowGroups=ALL DenyAccounts=sunshine AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=hpc-19-[13-19,21-28,30-37],hpc-20-[13-19,21-28,30-37],hpc-21-[14-16,18-19,22-28,30-37],hpc-22-[07-24,28,30,32,34,36,38],hpc-23-[07-24,28,30,32,34,36,38],hpc-24-[07-24,28,30,32,34,36,38],hpc-25-[03-10,14-15,17-18,20-21,23-24],hpc-26-[14-15,17-18,20-21,23-24],hpc-80-[04-09,11-16,18-23,25-28,33,36],hpc-81-[04-09,11-16,18-23,25-27,33,35,37],hpc-82-[04-09,11-16,18-23,25-27,33,35,37],hpc-83-[04-09,11-16,18-23,25-27,33,35,37],hpc-89-[03-26,32-33,35-39],hpc-90-[03-26,29-30,32-33,35-38],hpc-91-[10-21,24-25,28-30,32-33],hpc-92-[03-26,29-30,32-33,36-38],hpc-93-[03-26,29-30,32-33,36-38]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=16992 TotalNodes=402 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED
   TRES=cpu=16992,mem=131746000M,node=402,billing=16992,gres/gpu=208

[naveed@head1 quickstart]$ scontrol show part expansion
PartitionName=expansion
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=11520 TotalNodes=180 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED
   TRES=cpu=11520,mem=91800000M,node=180,billing=11520

[naveed@head1 quickstart]$ srun -p expansion -t 5:00 uptime
srun: error: Unable to allocate resources: Requested node configuration is not available
[naveed@head1 quickstart]$ srun -p any -t 5:00 uptime
 18:10:06 up 10:47,  0 users,  load average: 23.67, 26.38, 27.87

From slurm.conf:
PartitionName=expansion Default=No Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] MaxTime=14-0

NodeName=hpc-34-[01-20,23-41],hpc-35-[01-20,23-28],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=510000 Features=icelake weight=20
NodeName=hpc-35-[29-38] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=510000 Features=epyc weight=20
PartitionName=expansion Default=No Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] MaxTime=14-0


and trying to conenct to a specific node and the config for that node:

[naveed@head1 quickstart]$ srun -p any -t 5:00 --nodelist=hpc-35-04 uptime
srun: error: Unable to allocate resources: Requested node configuration is not available
[naveed@head1 quickstart]$ scontrol show node hpc-35-04
NodeName=hpc-35-04 Arch=x86_64 CoresPerSocket=32
   CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=0.05
   AvailableFeatures=icelake
   ActiveFeatures=icelake
   Gres=(null)
   NodeAddr=hpc-35-04 NodeHostName=hpc-35-04 Version=22.05.6
   OS=Linux 5.14.0-162.23.1.el9_1.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Mar 23 20:08:28 EDT 2023
   RealMemory=510000 AllocMem=0 FreeMem=510826 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=20 Owner=N/A MCS_label=N/A
   Partitions=expansion
   BootTime=2023-09-15T08:29:35 SlurmdStartTime=2023-09-15T17:31:52
   LastBusyTime=2023-09-15T16:29:39
   CfgTRES=cpu=64,mem=510000M,billing=64
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Comment 1 Naveed Near-Ansari 2023-09-15 19:20:18 MDT

Created attachment 32284 [details]
slurm etc dir

Comment 2 Naveed Near-Ansari 2023-09-15 19:33:01 MDT

BTW- some of them have a POWERED_DOWN state.  how do i clear that? they weren;t in the Suspendexclude nodes at first, but are now:

NodeName=hpc-34-41 Arch=x86_64 CoresPerSocket=32
   CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=0.24
   AvailableFeatures=icelake
   ActiveFeatures=icelake
   Gres=(null)
   NodeAddr=hpc-34-41 NodeHostName=hpc-34-41 Version=22.05.6
   OS=Linux 5.14.0-162.23.1.el9_1.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Mar 23 20:08:28 EDT 2023
   RealMemory=510000 AllocMem=0 FreeMem=503572 Sockets=2 Boards=1
   State=IDLE+POWERED_DOWN ThreadsPerCore=2 TmpDisk=0 Weight=20 Owner=N/A MCS_label=N/A
   Partitions=expansion
   BootTime=2023-09-15T08:28:47 SlurmdStartTime=2023-09-15T17:31:52
   LastBusyTime=Unknown
   CfgTRES=cpu=64,mem=510000M,billing=64
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Comment 3 Marcin Stolarek 2023-09-18 01:30:27 MDT

Naveed,

Looking at the configs you shared I see that nodes from "expansion" partition weren't added to topology.conf, this results in the situation when SelectType plugin with topology/tree is unable to allocate resources.[1]

>BTW- some of them have a POWERED_DOWN state.  how do i clear that? they weren;t in the Suspendexclude nodes at first, but are now:
Did you try:
>scontrol update node=... state=POWER_UP[2]

Let me know if that helps.

cheers,
Marcin
[1]From slurm.conf:
>TopologyPlugin          = topology/tree
>PartitionName=expansion Default=No Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] MaxTime=14-0

[2]https://slurm.schedmd.com/scontrol.html#OPT_POWER_UP

Comment 4 Naveed Near-Ansari 2023-09-18 09:37:09 MDT

Thank you.  The topology file helped, but I am still habving issue getting the nodes out of POWERING_UP state now.

I ran the socntorl update command about an hour ago and they are still in a POWERING_UP state:

NodeName=hpc-34-41 CoresPerSocket=32
   CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=N/A
   AvailableFeatures=icelake
   ActiveFeatures=icelake
   Gres=(null)
   NodeAddr=hpc-34-41 NodeHostName=hpc-34-41 Version=22.05.6
   RealMemory=510000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   State=UNKNOWN+POWERING_UP ThreadsPerCore=2 TmpDisk=0 Weight=20 Owner=N/A MCS_label=N/A
   Partitions=expansion
   BootTime=2023-09-15T08:28:48 SlurmdStartTime=2023-09-18T08:08:32
   LastBusyTime=2023-09-18T07:56:45
   CfgTRES=cpu=64,mem=510000M,billing=64
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

When I restarted slurs after adding the topology file and trying to fix the Suspexclugenodes I see this in the log, but can’t determine he reason:

[2023-09-18T07:56:48.237] error: Invalid SuspendExcNodes hpc-19-[13-19,21-28,30-37],hpc-20-[13-19,21-28,30-37],hpc-21-[22-28,30-37],hpc-21-[14-16,18-19],hpc-22-[07-24,28,30,32,34,36,38],hpc-23-[07-24,28,30,32,34,36,38],hpc-24-[07-24,28,30,32,34,36,38],hpc-25-[03-10,14-15,17-18,20-21,23-24],hpc-26-[14-15,17-18,20-21,23-24],hpc-79-11,hpc-80-[04-09,11-16,18-23,25-28,33,36],Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41],hpc-81-[04-09,11-16,18-23,25-27,33,35,37],hpc-82-[04-09,11-16,18-23,25-27,33,35,37],hpc-83-[04-09,11-16,18-23,25-27,33,35,37],hpc-89-[03-26,32-33,35-39],hpc-90-[03-26,29-30,32-33,35-38],hpc-91-[10-21,24-25,28-30,32-33],hpc-92-[03-26,29-30,32-33,36-38],hpc-93-[03-26,29-30,32-33,36-38] ignored


On Sep 18, 2023, at 12:30 AM, bugs@schedmd.com wrote:

Marcin Stolarek<mailto:cinek@schedmd.com> changed bug 17703<https://bugs.schedmd.com/show_bug.cgi?id=17703>
What    Removed Added
CC              cinek@schedmd.com
Assignee        support@schedmd.com     cinek@schedmd.com

Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=17703#c3> on bug 17703<https://bugs.schedmd.com/show_bug.cgi?id=17703> from Marcin Stolarek<mailto:cinek@schedmd.com>

Naveed,

Looking at the configs you shared I see that nodes from "expansion" partition
weren't added to topology.conf, this results in the situation when SelectType
plugin with topology/tree is unable to allocate resources.[1]

>BTW- some of them have a POWERED_DOWN state.  how do i clear that? they weren;t in the Suspendexclude nodes at first, but are now:
Did you try:
>scontrol update node=... state=POWER_UP[2]

Let me know if that helps.

cheers,
Marcin
[1]From slurm.conf:
>TopologyPlugin          = topology/tree
>PartitionName=expansion Default=No Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] MaxTime=14-0

[2]https://slurm.schedmd.com/scontrol.html#OPT_POWER_UP

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 5 Marcin Stolarek 2023-09-18 09:48:50 MDT

Maybe because of the copy & paste like mistake of "Nodes" in the SuspendExcNodes?
>SuspendExcNodes=[...]hpc-80-[04-09,11-16,18-23,25-28,33,36],Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38]
>                                                            ^^^^^

I assume the nodes are up and running slurmd? Could you please try to set them DOWN and then RESUME?

cheers,
Marcin

Comment 6 Naveed Near-Ansari 2023-09-19 12:14:31 MDT

Thank you.  i think we are good on this ticket now.  habe a different issue now but will open a different ticket for it to keep it clear.

Comment 7 Marcin Stolarek 2023-09-20 01:27:23 MDT

Thanks for confirming!