4313 – node states on cray systems

Ticket 4313 - node states on cray systems

Summary: node states on cray systems

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	17.02.7
Hardware:	Cray XC Linux

Severity:	4 - Minor Issue
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-10-26 17:06 MDT by Doug Jacobsen
Modified:	2017-12-05 23:22 MST (History)
CC List:	0 users

See Also:
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Doug Jacobsen 2017-10-26 17:06:42 MDT

Hello,

I'm looking at the scontrol man page and have some questions about this paragraph in the Node state section:

"""
State=<state>
Identify the state to be assigned to the node. Possible node states are "NoResp", "ALLOC", "ALLOCATED", "COMPLETING", "DOWN", "DRAIN", "ERROR, "FAIL", "FAILING", "FUTURE" "IDLE", "MAINT", "MIXED", "PERFCTRS/NPC", "RESERVED",
"POWER_DOWN", "POWER_UP", "RESUME" or "UNDRAIN". Not all of those states can be set using the scontrol command only the following can: "NoResp", "DRAIN", "FAIL", "FUTURE", "RESUME", "POWER_DOWN", "POWER_UP" and "UNDRAIN". If a node
is in a "MIXED" state it usually means the node is in multiple states. For instance if only part of the node is "ALLOCATED" and the rest of the node is "IDLE" the state will be "MIXED". If you want to remove a node from service, you
typically want to set it's state to "DRAIN". "FAILING" is similar to "DRAIN" except that some applications will seek to relinquish those nodes before the job completes. "PERFCTRS/NPC" indicates that Network Performance Counters
associated with this node are in use, rendering this node as not usable for any other jobs. "RESERVED" indicates the node is in an advanced reservation and not generally available. "RESUME" is not an actual node state, but will
change a node state from "DRAINED", "DRAINING", "DOWN", "MAINT" or "REBOOT" to either "IDLE" or "ALLOCATED" state as appropriate. "UNDRAIN" clears the node from being drained (like "RESUME"), but will not change the node's base state
(e.g. "DOWN"). Setting a node "DOWN" will cause all running and suspended jobs on that node to be terminated. "POWER_DOWN" and "POWER_UP" will use the configured SuspendProg and ResumeProg programs to explicitly place a node in or
out of a power saving mode. If a node is already in the process of being powered up or down, the command will have no effect until the configured ResumeTimeout or SuspendTimeout is reached. The "NoResp" state will only set the
"NoResp" flag for a node without changing its underlying state. While all of the above states are valid, some of them are not valid new node states given their prior state. If the node state code printed is followed by "~", this
indicates the node is presently in a power saving mode (typically running at reduced frequency). If the node state code is followed by "#", this indicates the node is presently being powered up or configured. If the node state code
is followed by "$", this indicates the node is currently in a reservation with a flag value of "maintenance". If the node state code is followed by "@", this indicates the node is currently scheduled to be rebooted. Generally only
"DRAIN", "FAIL" and "RESUME" should be used. NOTE: The scontrol command should not be used to change node state on Cray systems. Use Cray tools such as xtprocadmin instead.
"""

<rant>
First, I heartily disagree with the closing statement that scontrol should be avoided. In fact we disallow anyone from using xtprocadmin -k s to set node states, and have done everything we can to disable the connections between xtprocadmin and slurm, since slurm is a layer on top of the cray and is the resource manager of the system -- not the other way around. Anyway, with this arrangement we can get independent views of the node states (HSS system controls xtprocadmin, whereas the poorly defined "AdminDown" state simply obscures that relationship). These recommendations are based on the fact that the NHC and wlm_trans software that Cray provides by default perform large numbers of per-node operations which can easily overwhelm slurmctld for very large scale systems.
</rant>

My question is about the "NoResp" state. It says will not change the underlying state of the node. If I set "NoResp" on a node, will it continue to use the node for overlay communications?

Related to some other case we have open (or Cray opened for us) about using xtconsumer on the SMW to inform slurm of ec_node_unavailable states, I'm wondering if we did something to simply add the NoResp state (perhaps bundling up messages for 2s or so to prevent the aforementioned per-node updates), if that would have the desirable traits I'm looking for of:
1) not interfering with ec_node_unavailable caused by knl mode change (or scontrol reboot_nodes)
2) not blowing away existing reasons
3) does discontinue attempting to use the node for intermediate communications and slurmctld pinging.

If so I could code this up in an hour and get massive improvements in system reliability.

Thanks,
Doug

Comment 1 Tim Wickberg 2017-10-30 14:13:44 MDT

(In reply to Doug Jacobsen from comment #0)
> Hello,
> 
> I'm looking at the scontrol man page and have some questions about this
> paragraph in the Node state section:
> 
> """
>        State=<state>
>               Identify  the  state  to  be  assigned  to  the  node. 
> Possible  node states are "NoResp", "ALLOC", "ALLOCATED", "COMPLETING",
> "DOWN", "DRAIN", "ERROR, "FAIL", "FAILING", "FUTURE" "IDLE", "MAINT",
> "MIXED", "PERFCTRS/NPC", "RESERVED",
>               "POWER_DOWN", "POWER_UP", "RESUME" or "UNDRAIN". Not all of
> those states can be set using the scontrol command only the following can:
> "NoResp", "DRAIN", "FAIL", "FUTURE", "RESUME", "POWER_DOWN", "POWER_UP" and
> "UNDRAIN".  If  a  node
>               is in a "MIXED" state it usually means the node is in multiple
> states.  For instance if only part of the node is "ALLOCATED" and the rest
> of the node is "IDLE" the state will be "MIXED".  If you want to remove a
> node from service, you
>               typically want to set it's state to "DRAIN".  "FAILING" is
> similar to "DRAIN" except that some applications will seek to relinquish
> those nodes before the job completes.  "PERFCTRS/NPC"  indicates  that 
> Network  Performance  Counters
>               associated  with  this  node  are  in use, rendering this node
> as not usable for any other jobs.  "RESERVED" indicates the node is in an
> advanced reservation and not generally available.  "RESUME" is not an actual
> node state, but will
>               change a node state from "DRAINED", "DRAINING", "DOWN",
> "MAINT" or "REBOOT" to either "IDLE" or "ALLOCATED" state as appropriate. 
> "UNDRAIN" clears the node from being drained (like "RESUME"), but will not
> change the node's base state
>               (e.g.  "DOWN").   Setting a node "DOWN" will cause all running
> and suspended jobs on that node to be terminated.  "POWER_DOWN" and
> "POWER_UP" will use the configured SuspendProg and ResumeProg programs to
> explicitly place a node in or
>               out of a power saving mode. If a node is already in the
> process of being powered up or down, the command will have no effect until
> the configured ResumeTimeout or SuspendTimeout is reached.   The  "NoResp" 
> state  will  only  set  the
>               "NoResp"  flag  for  a  node without changing its underlying
> state.  While all of the above states are valid, some of them are not valid
> new node states given their prior state.  If the node state code printed is
> followed by "~", this
>               indicates the node is presently in a power saving mode
> (typically running at reduced frequency).  If the node state code is
> followed by "#", this indicates the node is presently being powered up or
> configured.  If the node state  code
>               is  followed  by "$", this indicates the node is currently in
> a reservation with a flag value of "maintenance".  If the node state code is
> followed by "@", this indicates the node is currently scheduled to be
> rebooted.  Generally only
>               "DRAIN", "FAIL" and "RESUME" should be used.  NOTE: The
> scontrol command should not be used to change node state on Cray systems.
> Use Cray tools such as xtprocadmin instead.
> """
> 
> 
> <rant>
> First, I heartily disagree with the closing statement that scontrol should
> be avoided.  In fact we disallow anyone from using xtprocadmin -k s to set
> node states, and have done everything we can to disable the connections
> between xtprocadmin and slurm, since slurm is a layer on top of the cray and
> is the resource manager of the system -- not the  other way around.  Anyway,
> with this arrangement we can get independent views of the node states (HSS
> system controls xtprocadmin, whereas the poorly defined "AdminDown" state
> simply obscures that relationship).  These recommendations are based on the
> fact that the NHC and wlm_trans software that Cray provides by default
> perform large numbers of per-node operations which can easily overwhelm
> slurmctld for very large scale systems.
> </rant>

That specific comment is rather dated, and would have been appropriate for Cray/ALPS.

> My question is about the "NoResp" state.  It says will not change the
> underlying state of the node.  If I set "NoResp" on a node, will it continue
> to use the node for overlay communications?

slurmctld should not initiate any messages destined to those nodes, and as such they would not be involved in any communication trees.

Based on some other comments, I think you may misunderstand how the hierarchical communication works. The set of nodes involved in each broadcast changes dynamically - it is not a fixed pattern. This is to ensure traffic related to a given job doesn't disrupt nodes that don't belong to that job.

For non-job related communication, the same dynamic routing is done - only the nodes that need a given message should be involved in that hierarchy. While I'm not familiar with how NoResp is implemented under the covers, I would expect that NHC / Node Registration / Node Ping are all avoiding nodes with that setting. If not, we may need to fix that.

The one caveat is that if you have jobs running that involve the node you're marking as NoResp they might still attempt to relay through it - but the job is presumably dead and attempting to clean up at that point.

> Related to some other case we have open (or Cray opened for us) about using
> xtconsumer on the SMW to inform slurm of ec_node_unavailable states, I'm
> wondering if we did something to simply add the NoResp state (perhaps
> bundling up messages for 2s or so to prevent the aforementioned per-node
> updates), if that would have the desirable traits I'm looking for of:
> 1) not interfering with ec_node_unavailable caused by knl mode change (or
> scontrol reboot_nodes)
> 2) not blowing away existing reasons
> 3) does discontinue attempting to use the node for intermediate
> communications and slurmctld pinging.
> 
> If so I could code this up in an hour and get massive improvements in system
> reliability.

Bug 3769.

Comment 2 Tim Wickberg 2017-12-05 23:22:19 MST

Tagging resolved/infogiven on this one, given slurmsmwd you've obviously figured out how this works. :)