| Summary: | cannot submit jobs to new partition | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Naveed Near-Ansari <naveed> |
| Component: | Configuration | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | cinek |
| Version: | 22.05.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Caltech | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurm etc dir | ||
|
Description
Naveed Near-Ansari
2023-09-15 19:16:57 MDT
Created attachment 32284 [details]
slurm etc dir
BTW- some of them have a POWERED_DOWN state. how do i clear that? they weren;t in the Suspendexclude nodes at first, but are now: NodeName=hpc-34-41 Arch=x86_64 CoresPerSocket=32 CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=0.24 AvailableFeatures=icelake ActiveFeatures=icelake Gres=(null) NodeAddr=hpc-34-41 NodeHostName=hpc-34-41 Version=22.05.6 OS=Linux 5.14.0-162.23.1.el9_1.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Mar 23 20:08:28 EDT 2023 RealMemory=510000 AllocMem=0 FreeMem=503572 Sockets=2 Boards=1 State=IDLE+POWERED_DOWN ThreadsPerCore=2 TmpDisk=0 Weight=20 Owner=N/A MCS_label=N/A Partitions=expansion BootTime=2023-09-15T08:28:47 SlurmdStartTime=2023-09-15T17:31:52 LastBusyTime=Unknown CfgTRES=cpu=64,mem=510000M,billing=64 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Naveed, Looking at the configs you shared I see that nodes from "expansion" partition weren't added to topology.conf, this results in the situation when SelectType plugin with topology/tree is unable to allocate resources.[1] >BTW- some of them have a POWERED_DOWN state. how do i clear that? they weren;t in the Suspendexclude nodes at first, but are now: Did you try: >scontrol update node=... state=POWER_UP[2] Let me know if that helps. cheers, Marcin [1]From slurm.conf: >TopologyPlugin = topology/tree >PartitionName=expansion Default=No Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] MaxTime=14-0 [2]https://slurm.schedmd.com/scontrol.html#OPT_POWER_UP Thank you. The topology file helped, but I am still habving issue getting the nodes out of POWERING_UP state now. I ran the socntorl update command about an hour ago and they are still in a POWERING_UP state: NodeName=hpc-34-41 CoresPerSocket=32 CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=N/A AvailableFeatures=icelake ActiveFeatures=icelake Gres=(null) NodeAddr=hpc-34-41 NodeHostName=hpc-34-41 Version=22.05.6 RealMemory=510000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1 State=UNKNOWN+POWERING_UP ThreadsPerCore=2 TmpDisk=0 Weight=20 Owner=N/A MCS_label=N/A Partitions=expansion BootTime=2023-09-15T08:28:48 SlurmdStartTime=2023-09-18T08:08:32 LastBusyTime=2023-09-18T07:56:45 CfgTRES=cpu=64,mem=510000M,billing=64 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s When I restarted slurs after adding the topology file and trying to fix the Suspexclugenodes I see this in the log, but can’t determine he reason: [2023-09-18T07:56:48.237] error: Invalid SuspendExcNodes hpc-19-[13-19,21-28,30-37],hpc-20-[13-19,21-28,30-37],hpc-21-[22-28,30-37],hpc-21-[14-16,18-19],hpc-22-[07-24,28,30,32,34,36,38],hpc-23-[07-24,28,30,32,34,36,38],hpc-24-[07-24,28,30,32,34,36,38],hpc-25-[03-10,14-15,17-18,20-21,23-24],hpc-26-[14-15,17-18,20-21,23-24],hpc-79-11,hpc-80-[04-09,11-16,18-23,25-28,33,36],Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41],hpc-81-[04-09,11-16,18-23,25-27,33,35,37],hpc-82-[04-09,11-16,18-23,25-27,33,35,37],hpc-83-[04-09,11-16,18-23,25-27,33,35,37],hpc-89-[03-26,32-33,35-39],hpc-90-[03-26,29-30,32-33,35-38],hpc-91-[10-21,24-25,28-30,32-33],hpc-92-[03-26,29-30,32-33,36-38],hpc-93-[03-26,29-30,32-33,36-38] ignored On Sep 18, 2023, at 12:30 AM, bugs@schedmd.com wrote: Marcin Stolarek<mailto:cinek@schedmd.com> changed bug 17703<https://bugs.schedmd.com/show_bug.cgi?id=17703> What Removed Added CC cinek@schedmd.com Assignee support@schedmd.com cinek@schedmd.com Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=17703#c3> on bug 17703<https://bugs.schedmd.com/show_bug.cgi?id=17703> from Marcin Stolarek<mailto:cinek@schedmd.com> Naveed, Looking at the configs you shared I see that nodes from "expansion" partition weren't added to topology.conf, this results in the situation when SelectType plugin with topology/tree is unable to allocate resources.[1] >BTW- some of them have a POWERED_DOWN state. how do i clear that? they weren;t in the Suspendexclude nodes at first, but are now: Did you try: >scontrol update node=... state=POWER_UP[2] Let me know if that helps. cheers, Marcin [1]From slurm.conf: >TopologyPlugin = topology/tree >PartitionName=expansion Default=No Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38],hpc-52-[01-20,23-41],hpc-53-[01-20,23-29],hpc-54-[01-20,23-41] MaxTime=14-0 [2]https://slurm.schedmd.com/scontrol.html#OPT_POWER_UP ________________________________ You are receiving this mail because: * You reported the bug. Maybe because of the copy & paste like mistake of "Nodes" in the SuspendExcNodes?
>SuspendExcNodes=[...]hpc-80-[04-09,11-16,18-23,25-28,33,36],Nodes=hpc-34-[01-20,23-41],hpc-35-[01-20,23-38]
> ^^^^^
I assume the nodes are up and running slurmd? Could you please try to set them DOWN and then RESUME?
cheers,
Marcin
Thank you. i think we are good on this ticket now. habe a different issue now but will open a different ticket for it to keep it clear. Thanks for confirming! |