Ticket 6290

Summary:	Invalid node state transition requested from=MAINT to=RESUME got (nil)
Product:	Slurm	Reporter:	Sebastien Varrette <Sebastien.Varrette>
Component:	slurmctld	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	17.11.12
Hardware:	Linux
OS:	Linux
Site:	University of Luxembourg	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Sebastien Varrette 2018-12-21 16:53:59 MST

Hi,

Probably my question is naive but I'm unable to resume nodes of our cluster now that the maintenance is over - reason is probably that there is still a running reservation (I'm not wishing for the moment to delete on purpose).

Still, the error message completed with the 'got (nil)' message is surprising. 

Can you advise on the best way to proceed, if that's a normal behavior or if this is a bug inherent to the migration from 17.11.11 to 17.11.12 ? 

```
$> scontrol show config | grep -i version
SLURM_VERSION           = 17.11.12

$> scontrol show reservations
ReservationName=clustermaintenance StartTime=2018-12-18T14:00:00 EndTime=2018-12-24T02:00:00 Duration=5-12:00:00
   Nodes=iris-[001-190] NodeCnt=190 CoreCnt=5656 Features=(null) PartitionName=admin Flags=MAINT,IGNORE_JOBS,SPEC_NODES,PART_NODES
   TRES=cpu=5656
   Users=(null) Accounts=ulhpc Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

$> sinfo -T
RESV_NAME              STATE           START_TIME             END_TIME     DURATION  NODELIST
clustermaintenance    ACTIVE  2018-12-18T14:00:00  2018-12-24T02:00:00   5-12:00:00  iris-[001-190]

# Allocated nodes are jobs running on the admin partition
$> sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
admin          up 5-00:00:00     22  down$ iris-[169-190]
admin          up 5-00:00:00      1   mix$ iris-056
admin          up 5-00:00:00      2 alloc$ iris-[111-112]
admin          up 5-00:00:00    165  maint iris-[001-055,057-110,113-168]
interactive    up    4:00:00      8  maint iris-[001-008]
long           up 30-00:00:0      8  maint iris-[009-016]
batch*         up 5-00:00:00      1   mix$ iris-056
batch*         up 5-00:00:00      2 alloc$ iris-[111-112]
batch*         up 5-00:00:00    149  maint iris-[017-055,057-110,113-168]
gpu            up 5-00:00:00     18  down$ iris-[169-186]
bigmem         up 5-00:00:00      4  down$ iris-[187-190]
```

Now if I try to resume one nodes: 

```
$> scontrol update nodename=iris-001 state=resume
slurm_update error: Invalid node state specified
```

and the error message within the `/var/log/slurm/slurmctld.log`: 

```
[2018-12-22T00:45:57.633] Invalid node state transition requested for node iris-001 from=MAINT to=RESUME
[2018-12-22T00:45:57.633] got (nil)
[2018-12-22T00:45:57.633] _slurm_rpc_update_node for iris-001: Invalid node state specified
```

Trying to change the state to 'idle' raise no error but has no effect: 

```
$> scontrol update nodename=iris-001 state=idle
$> sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
admin          up 5-00:00:00     22  down$ iris-[169-190]
admin          up 5-00:00:00      1   mix$ iris-056
admin          up 5-00:00:00      2 alloc$ iris-[111-112]
admin          up 5-00:00:00    165  maint iris-[001-055,057-110,113-168]
interactive    up    4:00:00      8  maint iris-[001-008]
long           up 30-00:00:0      8  maint iris-[009-016]
batch*         up 5-00:00:00      1   mix$ iris-056
batch*         up 5-00:00:00      2 alloc$ iris-[111-112]
batch*         up 5-00:00:00    149  maint iris-[017-055,057-110,113-168]
gpu            up 5-00:00:00     18  down$ iris-[169-186]
bigmem         up 5-00:00:00      4  down$ iris-[187-190]
```

In `/var/log/slurm/slurmctld.log`: 

```
[2018-12-22T00:48:02.571] update_node: node iris-001 state set to IDLE
[2018-12-22T00:48:02.571] got (nil)
```

Comment 1 Marshall Garey 2018-12-21 17:12:14 MST

You can't change a node's state from "MAINT" to idle. The reservation has to end to get rid of the MAINT state, and the MAINT state will go away by itself once the reservation ends. Simply delete the maintenance reservation when your maintenance is complete.

Comment 2 Marshall Garey 2019-01-02 09:47:35 MST

Closing as infogiven. Please respond if you have more questions.