Ticket 11538

Summary: Cloud Node Reset
Product: Slurm Reporter: Brian Christiansen <brian>
Component: CloudAssignee: Broderick Gardner <broderick>
Status: RESOLVED FIXED QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: fdm, nick, schedmd-contacts
Version: 21.08.x   
Hardware: Linux   
OS: Linux   
Site: DS9 (PSLA) Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 21.08.0pre1
Target Release: 21.08 DevPrio: ---
Emory-Cloud Sites: ---

Description Brian Christiansen 2021-05-05 10:06:49 MDT
Make it possible to always reset CLOUD nodes with a RESET status rather than putting first to DOWN and then RESUME.
Comment 4 Brian Christiansen 2021-07-15 22:00:13 MDT
In 21.08:

 -- added power_down_asap, power_down_force power down state for scontrol.
e.g. scontrol update nodename=<> state=power_down_asap

power_down - queue up the the node to be powered down when the node if free. Jobs can 
             continue to land on the node until it powers down.

power_down_asap - queue up the node to be powered down and put the node in a drain 
                  state. This makes it so no more jobs are scheduled on the node and 
                  the node will power down after the currently running jobs are done.

power_down_force - cancel jobs, requeue if possible, and power down node. This state can also be 
                   used to cancel a powering up node and reset it back to powered down.


 -- Define and separate node power state transitions. Previously a powering
    down node was in both states, POWERING_OFF and POWERED_OFF. These are now
    separated.
    e.g.
       IDLE+POWERED_OFF (IDLE~)
    -> IDLE+POWERING_UP (IDLE#)   - Manual power up or allocation
    -> IDLE
    -> IDLE+POWER_DOWN (IDLE!)    - Node waiting for power down
    -> IDLE+POWERING_DOWN (IDLE%) - Node powering down
    -> IDLE+POWERED_OFF (IDLE~)   - Powered off

 -- Some node state flag names have changed. These would be noticeable for
    example if using a state flag to filter nodes with sinfo.
    e.g.
    POWER_UP -> POWERING_UP
    POWER_DOWN -> POWERED_DOWN
    POWER_DOWN now represents a node pending power down

Let us know if you have any questions.

Thanks,
Brian