Ticket 17808

Summary: move the node other state [similar to drain]
Product: Slurm Reporter: RAMYA ERANNA <reranna>
Component: ConfigurationAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek
Version: 22.05.2   
Hardware: Linux   
OS: Linux   
Site: SLAC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description RAMYA ERANNA 2023-09-28 16:58:50 MDT
Hi,

I would like to move the nodes which have hardware problems to maitenance state in Slurm. I do not want these nodes to be in drain state. Instead some other state like maintenance. I was unable to put the node into Maintenance state
Can you please provide more details 

scontrol update NodeName=sdfmilan049 State=Maint Reason=Maintenance
Invalid input: State=Maint
Request aborted
Valid states are: NoResp DRAIN FAIL FUTURE RESUME POWER_DOWN POWER_UP UNDRAIN
Not all states are valid given a node's prior state
Comment 1 Marcin Stolarek 2023-10-02 06:47:22 MDT
Weren't you trying to create a reservation with "MAINT" flag set[1]?
For instance:
>[root@slurmctl slurm-23.11]# scontrol create reservation=maintenance nodes=test01 flags=maint start=now duration=infinite users=root 
>Reservation created: maintenance

cheers,
Marcin
[1]https://slurm.schedmd.com/scontrol.html#OPT_MAINT
Comment 2 RAMYA ERANNA 2023-10-02 15:38:05 MDT
I tried the command below, but the nodes still show as being in Drain state. My main intention is to use a state other than Drain state for nodes that are taken out for maintenance. something like maintenance/invalid/ or any other state

milano       up 10-00:00:0      1 drain$ sdfmilan009


NodeName=sdfmilan009 CoresPerSocket=64 
   CPUAlloc=0 CPUEfctv=120 CPUTot=128 CPULoad=N/A
   AvailableFeatures=CPU_GEN:RME,CPU_SKU:7713,CPU_FRQ:2.00GHz
   ActiveFeatures=CPU_GEN:RME,CPU_SKU:7713,CPU_FRQ:2.00GHz
   Gres=(null)
   NodeAddr=sdfmilan009 NodeHostName=sdfmilan009 
   RealMemory=512000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   CoreSpecCount=8 CPUSpecList=0,1,2,3,4,5,6,7 
   State=DOWN+DRAIN+MAINTENANCE+RESERVED+NOT_RESPONDING ThreadsPerCore=1 TmpDisk=0 Weight=476 Owner=N/A MCS_label=N/A
   Partitions=milano 
   BootTime=None SlurmdStartTime=None
   LastBusyTime=2023-09-14T14:06:41
   CfgTRES=cpu=120,mem=500G,billing=120
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=no_ssh [root@2023-09-14T13:18:30]
Comment 3 Marcin Stolarek 2023-10-03 01:46:38 MDT
The node is now reserved for maintenance, you can resume it by:
>#scontrol update node=... state=RESUME

The jobs won't be scheduled on the node since it's reserved.

cheers,
Marcin
Comment 4 Marcin Stolarek 2023-10-06 06:34:00 MDT
Is there anything else I can help you with?
Comment 5 Marcin Stolarek 2023-10-17 06:42:01 MDT
Any update from your side? In case of no reply I'll close the ticket as information given.

cheers,
Marcin
Comment 6 RAMYA ERANNA 2023-10-19 14:29:03 MDT
Hi,

Thank you for the information. Please close this ticket

Thank you