| Summary: | move the node other state [similar to drain] | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | RAMYA ERANNA <reranna> |
| Component: | Configuration | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek |
| Version: | 22.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SLAC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
RAMYA ERANNA
2023-09-28 16:58:50 MDT
Weren't you trying to create a reservation with "MAINT" flag set[1]? For instance: >[root@slurmctl slurm-23.11]# scontrol create reservation=maintenance nodes=test01 flags=maint start=now duration=infinite users=root >Reservation created: maintenance cheers, Marcin [1]https://slurm.schedmd.com/scontrol.html#OPT_MAINT I tried the command below, but the nodes still show as being in Drain state. My main intention is to use a state other than Drain state for nodes that are taken out for maintenance. something like maintenance/invalid/ or any other state milano up 10-00:00:0 1 drain$ sdfmilan009 NodeName=sdfmilan009 CoresPerSocket=64 CPUAlloc=0 CPUEfctv=120 CPUTot=128 CPULoad=N/A AvailableFeatures=CPU_GEN:RME,CPU_SKU:7713,CPU_FRQ:2.00GHz ActiveFeatures=CPU_GEN:RME,CPU_SKU:7713,CPU_FRQ:2.00GHz Gres=(null) NodeAddr=sdfmilan009 NodeHostName=sdfmilan009 RealMemory=512000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1 CoreSpecCount=8 CPUSpecList=0,1,2,3,4,5,6,7 State=DOWN+DRAIN+MAINTENANCE+RESERVED+NOT_RESPONDING ThreadsPerCore=1 TmpDisk=0 Weight=476 Owner=N/A MCS_label=N/A Partitions=milano BootTime=None SlurmdStartTime=None LastBusyTime=2023-09-14T14:06:41 CfgTRES=cpu=120,mem=500G,billing=120 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=no_ssh [root@2023-09-14T13:18:30] The node is now reserved for maintenance, you can resume it by:
>#scontrol update node=... state=RESUME
The jobs won't be scheduled on the node since it's reserved.
cheers,
Marcin
Is there anything else I can help you with? Any update from your side? In case of no reply I'll close the ticket as information given. cheers, Marcin Hi, Thank you for the information. Please close this ticket Thank you |