Ticket 11538 - Cloud Node Reset
Summary: Cloud Node Reset
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Cloud (show other tickets)
Version: 21.08.x
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Broderick Gardner
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-05-05 10:06 MDT by Brian Christiansen
Modified: 2021-07-15 22:00 MDT (History)
3 users (show)

See Also:
Site: DS9 (PSLA)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.0pre1
Target Release: 21.08
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Brian Christiansen 2021-05-05 10:06:49 MDT
Make it possible to always reset CLOUD nodes with a RESET status rather than putting first to DOWN and then RESUME.
Comment 4 Brian Christiansen 2021-07-15 22:00:13 MDT
In 21.08:

 -- added power_down_asap, power_down_force power down state for scontrol.
e.g. scontrol update nodename=<> state=power_down_asap

power_down - queue up the the node to be powered down when the node if free. Jobs can 
             continue to land on the node until it powers down.

power_down_asap - queue up the node to be powered down and put the node in a drain 
                  state. This makes it so no more jobs are scheduled on the node and 
                  the node will power down after the currently running jobs are done.

power_down_force - cancel jobs, requeue if possible, and power down node. This state can also be 
                   used to cancel a powering up node and reset it back to powered down.


 -- Define and separate node power state transitions. Previously a powering
    down node was in both states, POWERING_OFF and POWERED_OFF. These are now
    separated.
    e.g.
       IDLE+POWERED_OFF (IDLE~)
    -> IDLE+POWERING_UP (IDLE#)   - Manual power up or allocation
    -> IDLE
    -> IDLE+POWER_DOWN (IDLE!)    - Node waiting for power down
    -> IDLE+POWERING_DOWN (IDLE%) - Node powering down
    -> IDLE+POWERED_OFF (IDLE~)   - Powered off

 -- Some node state flag names have changed. These would be noticeable for
    example if using a state flag to filter nodes with sinfo.
    e.g.
    POWER_UP -> POWERING_UP
    POWER_DOWN -> POWERED_DOWN
    POWER_DOWN now represents a node pending power down

Let us know if you have any questions.

Thanks,
Brian