Ticket 15249

Summary:	Support "scontrol update nodename=blah state=POWERED_UP" when node is in POWERING_UP state
Product:	Slurm	Reporter:	Chrysovalantis Paschoulas <c.paschoulas>
Component:	Other	Assignee:	Skyler Malinowski <skyler>
Status:	RESOLVED WONTFIX	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	marshall
Version:	21.08.8
Hardware:	Linux
OS:	Linux
Site:	Jülich	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Chrysovalantis Paschoulas 2022-10-20 10:17:48 MDT

Again we have a blocker issue with the ResumeProgram.

In SlurmctldProlog we do:
- wait in a loop until all nodes leave their POWERING states
- if we have any node failures (nodes are drained or downed) set for that job Requeue=1 and exit script with non-zero
- otherwise proceed as normal and run some checks

The current workflow of our ResumeProgram is:
- power on node with ipmi
- wait until we can ssh on the node
- mount gpfs and start some services
- start slurmd
If in any step we have node failure (e.g. we can't mount gpfs) we drain the node and exit without doing the next steps, so it can happen that slurmd is never started on nodes with some failures.

In ResumeProgram we need to start slurmd at the very end because otherwise (e.g. if we start slurmd in the beginning) the node state will change to ALLOCATED (since slurmd will register with slurmctld) and in SlurmctldProlog we will proceed with the checks even if we still haven't finished with all actions in ResumeProgram and this can lead to some race conditions where e.g. we check in SlurmctldProlog if gpfs is mounted but we haven't reached that step yet in ResumeProgram.. And also another issue is that we can't requeue a job (in SlurmctldProlog workflow) when on of the allocated nodes is still in POWERING_UP state..

So I can think of 2 solutions here:
1. Do not start SlurmctldProlog before all ResumeProgram instances for a job have exited.
2. Support a way to remove POWERING_UP state with scontrol, so instead of slurmd we fake that a node is up. And in this case what we really want to do is call:
```
scontrol update nodename=blah state=DRAIN+POWERED_UP reason="ResumeProgram failed because of blah blah"
```
Currently when we have node failures before we start slurmd, we drain nodes in ResumeProgram with our nice reasons but when ResumeTimeout is reached then slurmctld sets node to DOWN and overwrite reason with "ResumeTimeout reached" and we lose our nice and helpful reasons..

What do you think? Does any of this make sense to you?

Comment 1 Chrysovalantis Paschoulas 2022-10-20 11:05:38 MDT

(In reply to Chrysovalantis Paschoulas from comment #0)
> 1. Do not start SlurmctldProlog before all ResumeProgram instances for a job
> have exited.

In this case what we really want is to start SlurmctldProlog after all nodes have left from their POWERING_UP state, which may happen only until we reach the ResumeTimeout sometimes.. So this solution is not good because it makes the whole workflow very slow..

Comment 3 Skyler Malinowski 2022-10-21 09:18:59 MDT

What about using: 
> scontrol update nodename=blah state=POWER_DOWN_FORCE reason="ResumeProgram failed: blah"

This will node fail the given nodes, cleanup and requeue the job if allowed, and power down the nodes even if they are powering up. This should also preserve the node reason as whatever you set unless SuspendTimeout is hit.

Does that work for you?

Comment 4 Chrysovalantis Paschoulas 2022-10-24 02:48:21 MDT

(In reply to Skyler Malinowski from comment #3)
> What about using: 
> > scontrol update nodename=blah state=POWER_DOWN_FORCE reason="ResumeProgram failed: blah"
> 
> This will node fail the given nodes, cleanup and requeue the job if allowed,
> and power down the nodes even if they are powering up. This should also
> preserve the node reason as whatever you set unless SuspendTimeout is hit.
> 
> Does that work for you?

According to man page:
```
POWER_DOWN_FORCE
    Will cancel all jobs on the node, power it down, and reset its state to "IDLE".
```
this will cancel the jobs. If we have JobRequeue=1 in slurm.conf, the jobs will be requeued instead of being cancelled?

But in this case node will powered down aka the SuspendProgram will be called for that node, right? We really don't want this to happen, we opened another ticket where we explained that we don't want to suspend/power-down problematic/drained nodes because we want to debug them. We have diskless nodes and we lose their whole state (except the syslogs that we send to other service nodes).


Also in another ticket SchedMD suggested to call:
```
scontrol update nodename=.. state=RESUME
```
in the end of the SuspendProgram, otherwise the nodes were staying in POWERING_DOWN state until they reach the SuspendTimeout and then they were going into POWERED_DOWN.
Instead of the workaround with the RESUME state, should I change it to POWER_DOWN_ASAP or POWER_DOWN_FORCE? Would that help?

Comment 5 Chrysovalantis Paschoulas 2022-10-24 02:59:59 MDT

FYI currently we have 2 stupid workarounds in order to make our workflows work in suspend and resume progrems.

In ResumeProgram we start slurmd at the very end of the script, but if we have a failure in a previous step we drain node and we have an extra hook where we restart "slurmd -b" in order to register with slurmctld, otherwise node will stay in POWERING_UP until ResumeTimeout is reached and node state will change to down and reason will be overwritten to "ResumeTimeout reached".

In SuspendProgram we skip powering down nodes that are drained and for this case we have to save node's reason, resume it and then drain it again with old reason. If we don't do that the node will stay in POWERING_DOWN state until SuspendTimeout is reached and then the reason will be overwritten to "SuspendTimeout reached".

In both cases what we really need is to have a way to change manually (e.g. with scontrol) node states from POWERING_UP/DOWN to POWERED_UP/DOWN. I know there is no POWERED_UP state but you get my point, right?

Comment 6 Skyler Malinowski 2022-10-24 10:12:20 MDT

(In reply to Chrysovalantis Paschoulas from comment #4)
> According to man page:
> ```
> POWER_DOWN_FORCE
>     Will cancel all jobs on the node, power it down, and reset its state to
> "IDLE".
> ```
> this will cancel the jobs. If we have JobRequeue=1 in slurm.conf, the jobs
> will be requeued instead of being cancelled?

That description seems incomplete. Let me clarify. POWER_DOWN_FORCE marks the node as IDLE+POWER_DOWN and requeue the running jobs on the node (if able and allowed) with reason NODE_FAIL. Moreover, interactive jobs cannot requeue and batch jobs can request to be requeued or not. JobRequeue=1 (default) will requeue the job if the job was to be requeued, otherwise they would never requeue.

> But in this case node will powered down aka the SuspendProgram will be
> called for that node, right? We really don't want this to happen, we opened
> another ticket where we explained that we don't want to suspend/power-down
> problematic/drained nodes because we want to debug them. We have diskless
> nodes and we lose their whole state (except the syslogs that we send to
> other service nodes).

Nodes with POWER_DOWN will have SuspendProgram called on them, yes.

> Also in another ticket SchedMD suggested to call:
> ```
> scontrol update nodename=.. state=RESUME
> ```
> in the end of the SuspendProgram, otherwise the nodes were staying in
> POWERING_DOWN state until they reach the SuspendTimeout and then they were
> going into POWERED_DOWN.
> Instead of the workaround with the RESUME state, should I change it to
> POWER_DOWN_ASAP or POWER_DOWN_FORCE? Would that help?

POWER_DOWN_ASAP sets the node as DRAIN+POWER_DOWN. POWER_DOWN_FORCE sets the node as IDLE+POWER_DOWN and requeues the running jobs on the node. Given that both set POWER_DOWN, it does not sound like it will help without a workaround to prevent Slurm from destroying the node and changing the reason.

(In reply to Chrysovalantis Paschoulas from comment #5)
> In both cases what we really need is to have a way to change manually (e.g.
> with scontrol) node states from POWERING_UP/DOWN to POWERED_UP/DOWN. I know
> there is no POWERED_UP state but you get my point, right?

I understand what you are trying to do. I will look into what we are willing to do for this outside of an enhancement request. I can certainly see a use case in power_save environments (cloud) where debugging bad nodes is wanted but Slurm wants to power down the downed and failed nodes.

Another workaround is to add the bad nodes to SuspendExcNodes in slurm.conf and then reconfigure.

Comment 7 Skyler Malinowski 2022-10-26 13:56:44 MDT

I can see the frustration that this deficiency in use case is causing. However, if we were to make changes here, we would rather enhance slurm to best support this. Simply allowing the admin to assert that a node has powered up when it has not is not a trivial change. Hence an RFE will be needed to best fix and support this use case.

All the best,
Skyler