Ticket 6290 - Invalid node state transition requested from=MAINT to=RESUME got (nil)
Summary: Invalid node state transition requested from=MAINT to=RESUME got (nil)
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 17.11.12
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-12-21 16:53 MST by Sebastien Varrette
Modified: 2019-01-02 09:47 MST (History)
0 users

See Also:
Site: University of Luxembourg
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Sebastien Varrette 2018-12-21 16:53:59 MST
Hi,

Probably my question is naive but I'm unable to resume nodes of our cluster now that the maintenance is over - reason is probably that there is still a running reservation (I'm not wishing for the moment to delete on purpose).

Still, the error message completed with the 'got (nil)' message is surprising. 

Can you advise on the best way to proceed, if that's a normal behavior or if this is a bug inherent to the migration from 17.11.11 to 17.11.12 ? 

```
$> scontrol show config | grep -i version
SLURM_VERSION           = 17.11.12

$> scontrol show reservations
ReservationName=clustermaintenance StartTime=2018-12-18T14:00:00 EndTime=2018-12-24T02:00:00 Duration=5-12:00:00
   Nodes=iris-[001-190] NodeCnt=190 CoreCnt=5656 Features=(null) PartitionName=admin Flags=MAINT,IGNORE_JOBS,SPEC_NODES,PART_NODES
   TRES=cpu=5656
   Users=(null) Accounts=ulhpc Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

$> sinfo -T
RESV_NAME              STATE           START_TIME             END_TIME     DURATION  NODELIST
clustermaintenance    ACTIVE  2018-12-18T14:00:00  2018-12-24T02:00:00   5-12:00:00  iris-[001-190]

# Allocated nodes are jobs running on the admin partition
$> sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
admin          up 5-00:00:00     22  down$ iris-[169-190]
admin          up 5-00:00:00      1   mix$ iris-056
admin          up 5-00:00:00      2 alloc$ iris-[111-112]
admin          up 5-00:00:00    165  maint iris-[001-055,057-110,113-168]
interactive    up    4:00:00      8  maint iris-[001-008]
long           up 30-00:00:0      8  maint iris-[009-016]
batch*         up 5-00:00:00      1   mix$ iris-056
batch*         up 5-00:00:00      2 alloc$ iris-[111-112]
batch*         up 5-00:00:00    149  maint iris-[017-055,057-110,113-168]
gpu            up 5-00:00:00     18  down$ iris-[169-186]
bigmem         up 5-00:00:00      4  down$ iris-[187-190]
```

Now if I try to resume one nodes: 

```
$> scontrol update nodename=iris-001 state=resume
slurm_update error: Invalid node state specified
```

and the error message within the `/var/log/slurm/slurmctld.log`: 

```
[2018-12-22T00:45:57.633] Invalid node state transition requested for node iris-001 from=MAINT to=RESUME
[2018-12-22T00:45:57.633] got (nil)
[2018-12-22T00:45:57.633] _slurm_rpc_update_node for iris-001: Invalid node state specified
```

Trying to change the state to 'idle' raise no error but has no effect: 

```
$> scontrol update nodename=iris-001 state=idle
$> sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
admin          up 5-00:00:00     22  down$ iris-[169-190]
admin          up 5-00:00:00      1   mix$ iris-056
admin          up 5-00:00:00      2 alloc$ iris-[111-112]
admin          up 5-00:00:00    165  maint iris-[001-055,057-110,113-168]
interactive    up    4:00:00      8  maint iris-[001-008]
long           up 30-00:00:0      8  maint iris-[009-016]
batch*         up 5-00:00:00      1   mix$ iris-056
batch*         up 5-00:00:00      2 alloc$ iris-[111-112]
batch*         up 5-00:00:00    149  maint iris-[017-055,057-110,113-168]
gpu            up 5-00:00:00     18  down$ iris-[169-186]
bigmem         up 5-00:00:00      4  down$ iris-[187-190]
```

In `/var/log/slurm/slurmctld.log`: 

```
[2018-12-22T00:48:02.571] update_node: node iris-001 state set to IDLE
[2018-12-22T00:48:02.571] got (nil)
```
Comment 1 Marshall Garey 2018-12-21 17:12:14 MST
You can't change a node's state from "MAINT" to idle. The reservation has to end to get rid of the MAINT state, and the MAINT state will go away by itself once the reservation ends. Simply delete the maintenance reservation when your maintenance is complete.
Comment 2 Marshall Garey 2019-01-02 09:47:35 MST
Closing as infogiven. Please respond if you have more questions.