| Summary: | Support "scontrol update nodename=blah state=POWERED_UP" when node is in POWERING_UP state | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Chrysovalantis Paschoulas <c.paschoulas> |
| Component: | Other | Assignee: | Skyler Malinowski <skyler> |
| Status: | RESOLVED WONTFIX | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | marshall |
| Version: | 21.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Jülich | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Chrysovalantis Paschoulas
2022-10-20 10:17:48 MDT
(In reply to Chrysovalantis Paschoulas from comment #0) > 1. Do not start SlurmctldProlog before all ResumeProgram instances for a job > have exited. In this case what we really want is to start SlurmctldProlog after all nodes have left from their POWERING_UP state, which may happen only until we reach the ResumeTimeout sometimes.. So this solution is not good because it makes the whole workflow very slow.. What about using:
> scontrol update nodename=blah state=POWER_DOWN_FORCE reason="ResumeProgram failed: blah"
This will node fail the given nodes, cleanup and requeue the job if allowed, and power down the nodes even if they are powering up. This should also preserve the node reason as whatever you set unless SuspendTimeout is hit.
Does that work for you?
(In reply to Skyler Malinowski from comment #3) > What about using: > > scontrol update nodename=blah state=POWER_DOWN_FORCE reason="ResumeProgram failed: blah" > > This will node fail the given nodes, cleanup and requeue the job if allowed, > and power down the nodes even if they are powering up. This should also > preserve the node reason as whatever you set unless SuspendTimeout is hit. > > Does that work for you? According to man page: ``` POWER_DOWN_FORCE Will cancel all jobs on the node, power it down, and reset its state to "IDLE". ``` this will cancel the jobs. If we have JobRequeue=1 in slurm.conf, the jobs will be requeued instead of being cancelled? But in this case node will powered down aka the SuspendProgram will be called for that node, right? We really don't want this to happen, we opened another ticket where we explained that we don't want to suspend/power-down problematic/drained nodes because we want to debug them. We have diskless nodes and we lose their whole state (except the syslogs that we send to other service nodes). Also in another ticket SchedMD suggested to call: ``` scontrol update nodename=.. state=RESUME ``` in the end of the SuspendProgram, otherwise the nodes were staying in POWERING_DOWN state until they reach the SuspendTimeout and then they were going into POWERED_DOWN. Instead of the workaround with the RESUME state, should I change it to POWER_DOWN_ASAP or POWER_DOWN_FORCE? Would that help? FYI currently we have 2 stupid workarounds in order to make our workflows work in suspend and resume progrems. In ResumeProgram we start slurmd at the very end of the script, but if we have a failure in a previous step we drain node and we have an extra hook where we restart "slurmd -b" in order to register with slurmctld, otherwise node will stay in POWERING_UP until ResumeTimeout is reached and node state will change to down and reason will be overwritten to "ResumeTimeout reached". In SuspendProgram we skip powering down nodes that are drained and for this case we have to save node's reason, resume it and then drain it again with old reason. If we don't do that the node will stay in POWERING_DOWN state until SuspendTimeout is reached and then the reason will be overwritten to "SuspendTimeout reached". In both cases what we really need is to have a way to change manually (e.g. with scontrol) node states from POWERING_UP/DOWN to POWERED_UP/DOWN. I know there is no POWERED_UP state but you get my point, right? (In reply to Chrysovalantis Paschoulas from comment #4) > According to man page: > ``` > POWER_DOWN_FORCE > Will cancel all jobs on the node, power it down, and reset its state to > "IDLE". > ``` > this will cancel the jobs. If we have JobRequeue=1 in slurm.conf, the jobs > will be requeued instead of being cancelled? That description seems incomplete. Let me clarify. POWER_DOWN_FORCE marks the node as IDLE+POWER_DOWN and requeue the running jobs on the node (if able and allowed) with reason NODE_FAIL. Moreover, interactive jobs cannot requeue and batch jobs can request to be requeued or not. JobRequeue=1 (default) will requeue the job if the job was to be requeued, otherwise they would never requeue. > But in this case node will powered down aka the SuspendProgram will be > called for that node, right? We really don't want this to happen, we opened > another ticket where we explained that we don't want to suspend/power-down > problematic/drained nodes because we want to debug them. We have diskless > nodes and we lose their whole state (except the syslogs that we send to > other service nodes). Nodes with POWER_DOWN will have SuspendProgram called on them, yes. > Also in another ticket SchedMD suggested to call: > ``` > scontrol update nodename=.. state=RESUME > ``` > in the end of the SuspendProgram, otherwise the nodes were staying in > POWERING_DOWN state until they reach the SuspendTimeout and then they were > going into POWERED_DOWN. > Instead of the workaround with the RESUME state, should I change it to > POWER_DOWN_ASAP or POWER_DOWN_FORCE? Would that help? POWER_DOWN_ASAP sets the node as DRAIN+POWER_DOWN. POWER_DOWN_FORCE sets the node as IDLE+POWER_DOWN and requeues the running jobs on the node. Given that both set POWER_DOWN, it does not sound like it will help without a workaround to prevent Slurm from destroying the node and changing the reason. (In reply to Chrysovalantis Paschoulas from comment #5) > In both cases what we really need is to have a way to change manually (e.g. > with scontrol) node states from POWERING_UP/DOWN to POWERED_UP/DOWN. I know > there is no POWERED_UP state but you get my point, right? I understand what you are trying to do. I will look into what we are willing to do for this outside of an enhancement request. I can certainly see a use case in power_save environments (cloud) where debugging bad nodes is wanted but Slurm wants to power down the downed and failed nodes. Another workaround is to add the bad nodes to SuspendExcNodes in slurm.conf and then reconfigure. I can see the frustration that this deficiency in use case is causing. However, if we were to make changes here, we would rather enhance slurm to best support this. Simply allowing the admin to assert that a node has powered up when it has not is not a trivial change. Hence an RFE will be needed to best fix and support this use case. All the best, Skyler |