Hi, Probably my question is naive but I'm unable to resume nodes of our cluster now that the maintenance is over - reason is probably that there is still a running reservation (I'm not wishing for the moment to delete on purpose). Still, the error message completed with the 'got (nil)' message is surprising. Can you advise on the best way to proceed, if that's a normal behavior or if this is a bug inherent to the migration from 17.11.11 to 17.11.12 ? ``` $> scontrol show config | grep -i version SLURM_VERSION = 17.11.12 $> scontrol show reservations ReservationName=clustermaintenance StartTime=2018-12-18T14:00:00 EndTime=2018-12-24T02:00:00 Duration=5-12:00:00 Nodes=iris-[001-190] NodeCnt=190 CoreCnt=5656 Features=(null) PartitionName=admin Flags=MAINT,IGNORE_JOBS,SPEC_NODES,PART_NODES TRES=cpu=5656 Users=(null) Accounts=ulhpc Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a $> sinfo -T RESV_NAME STATE START_TIME END_TIME DURATION NODELIST clustermaintenance ACTIVE 2018-12-18T14:00:00 2018-12-24T02:00:00 5-12:00:00 iris-[001-190] # Allocated nodes are jobs running on the admin partition $> sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST admin up 5-00:00:00 22 down$ iris-[169-190] admin up 5-00:00:00 1 mix$ iris-056 admin up 5-00:00:00 2 alloc$ iris-[111-112] admin up 5-00:00:00 165 maint iris-[001-055,057-110,113-168] interactive up 4:00:00 8 maint iris-[001-008] long up 30-00:00:0 8 maint iris-[009-016] batch* up 5-00:00:00 1 mix$ iris-056 batch* up 5-00:00:00 2 alloc$ iris-[111-112] batch* up 5-00:00:00 149 maint iris-[017-055,057-110,113-168] gpu up 5-00:00:00 18 down$ iris-[169-186] bigmem up 5-00:00:00 4 down$ iris-[187-190] ``` Now if I try to resume one nodes: ``` $> scontrol update nodename=iris-001 state=resume slurm_update error: Invalid node state specified ``` and the error message within the `/var/log/slurm/slurmctld.log`: ``` [2018-12-22T00:45:57.633] Invalid node state transition requested for node iris-001 from=MAINT to=RESUME [2018-12-22T00:45:57.633] got (nil) [2018-12-22T00:45:57.633] _slurm_rpc_update_node for iris-001: Invalid node state specified ``` Trying to change the state to 'idle' raise no error but has no effect: ``` $> scontrol update nodename=iris-001 state=idle $> sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST admin up 5-00:00:00 22 down$ iris-[169-190] admin up 5-00:00:00 1 mix$ iris-056 admin up 5-00:00:00 2 alloc$ iris-[111-112] admin up 5-00:00:00 165 maint iris-[001-055,057-110,113-168] interactive up 4:00:00 8 maint iris-[001-008] long up 30-00:00:0 8 maint iris-[009-016] batch* up 5-00:00:00 1 mix$ iris-056 batch* up 5-00:00:00 2 alloc$ iris-[111-112] batch* up 5-00:00:00 149 maint iris-[017-055,057-110,113-168] gpu up 5-00:00:00 18 down$ iris-[169-186] bigmem up 5-00:00:00 4 down$ iris-[187-190] ``` In `/var/log/slurm/slurmctld.log`: ``` [2018-12-22T00:48:02.571] update_node: node iris-001 state set to IDLE [2018-12-22T00:48:02.571] got (nil) ```
You can't change a node's state from "MAINT" to idle. The reservation has to end to get rid of the MAINT state, and the MAINT state will go away by itself once the reservation ends. Simply delete the maintenance reservation when your maintenance is complete.
Closing as infogiven. Please respond if you have more questions.