Ticket 9455

Summary:	node state interactions with "scontrol reboot ASAP" and "ResumeTimeout"
Product:	Slurm	Reporter:	Lloyd Brown <lloyd_brown>
Component:	Scheduling	Assignee:	Broderick Gardner <broderick>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	20.02.1
Hardware:	Linux
OS:	Linux
Site:	BYU - Brigham Young University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	n/a
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Lloyd Brown 2020-07-22 12:21:41 MDT

We're currently updating and testing our cluster's rolling-reboot-for-OS-updates mechanism, to utilize the "scontrol reboot ASAP" functionality, and it's behaving oddly. I was hoping you can clarify how the interaction is supposed to be working.

Specifically, we've seen some nodes reboot multiple times, though we're not 100% certain this is related to the "scontrol reboot ASAP" or not. Based on the documentation (https://slurm.schedmd.com/scontrol.html) that says, in part, "A node will be marked "DOWN" if it doesn't reboot within ResumeTimeout", we're trying to determine if extending ResumeTimeout would be helpful, but it's having some unintended side-effects.

Note that I'm going from memory on some of the node state information below, so it may not be perfectly accurate.

Scenario 1:
- ResumeTimeout is left at default (60 sec)
- Node is already idle (no jobs running)
- Management script on node calls "scontrol reboot ASAP reason=CUSTOMREASONSTRING $(hostname -s)"
- Node state becomes REBOOT+DRAIN (I think)
- slurmd calls RebootProgram, to do cleanup and reboot the node
- While the node is still rebooting, the ResumeTimeout (60 sec) expires, and the node state becomes "REBOOT+DOWN" (I think)
- Node completes its bootup/provisioning process, and starts slurmd, which checks in with slurmctld, and the node state becomes IDLE
- Node will sometimes reboot again, even multiple times. Note that we cannot yet *prove* that this is related to the "scontrol reboot ASAP" or not.

Scenario 2:
- ResumeTimeout is set to 1800 sec (30 minutes)
- Node is already idle (no jobs running)
- Management script on node calls "scontrol reboot ASAP reason=CUSTOMREASONSTRING $(hostname -s)"
- Node state becomes REBOOT+DRAIN (I think)
- slurmd calls RebootProgram, to do cleanup and reboot the node
- Node completes its bootup/provisioning process, and starts slurmd, which checks in with slurmctld, and the node state becomes IDLE+DRAIN. This occurs before the ResumeTimeout timer expires. The node only reboots once, AFAICT.
- After waiting for several hours, the DRAIN state never gets removed, unless I do it manually using "scontrol update NodeName=HOSTNAME state=resume" or similar.

So, I suppose I'm asking for help in understanding the following:
- Is it possible for slurmd to trigger multiple reboots in Scenario 1 above? If not, I'll need to look elsewhere to find what might be triggering it.
- In Scenario 2 above, why would the DRAIN state be maintained? At what point in the process is that supposed to be cleared?

Comment 1 Lloyd Brown 2020-07-23 09:55:42 MDT

I've been doing some more testing this morning, and I think I have made some progress.

Under Scenario 1 (60sec ResumeTimeout):
- If the node reaches the DOWN state due to a timeout, before slurmd starts up, then the node will come up normally
- If the node starts up slurmd while the node is still in REBOOT+DRAIN, and has not timed out, it appears that the RebootProgram is being called again as soon as slurmd is available again
- I've seen a lot of variation in the amount of time between the "scontrol reboot ASAP" call, and the node being marked DOWN, in the rough ballpark of 5-15 minutes.  Much of that variation might be related to the time it takes to run the RebootProgram tool, which currently has quite a bit of variation.  I have not yet figured out what other timeouts would be involved in how quickly slurm marks the node as DOWN.



Under Scenario 2 (1800sec ResumeTimeout):
- Nodes seem to come up, but get stuck in the "IDLE+DRAIN" state.
- When I add a "nextstate=resume" to the "scontrol reboot ASAP", it seems to come up cleanly, and return to service.  I've only got one test case so far, but this looks promising.  I'll continue to test here and let you know if there are any further issues.

Comment 2 Lloyd Brown 2020-07-27 08:49:59 MDT

After running over the weekend with the "nextstate=resume" addition, I think this will be the solution we're after.  None of the nodes are staying in "IDLE+DRAIN" anymore, and since we're using the longer ResumeTimeout, they're not re-triggering the reboot anymore.

I'm calling this solved.