| Summary: | node state interactions with "scontrol reboot ASAP" and "ResumeTimeout" | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Lloyd Brown <lloyd_brown> |
| Component: | Scheduling | Assignee: | Broderick Gardner <broderick> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.02.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | BYU - Brigham Young University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | n/a | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Lloyd Brown
2020-07-22 12:21:41 MDT
I've been doing some more testing this morning, and I think I have made some progress. Under Scenario 1 (60sec ResumeTimeout): - If the node reaches the DOWN state due to a timeout, before slurmd starts up, then the node will come up normally - If the node starts up slurmd while the node is still in REBOOT+DRAIN, and has not timed out, it appears that the RebootProgram is being called again as soon as slurmd is available again - I've seen a lot of variation in the amount of time between the "scontrol reboot ASAP" call, and the node being marked DOWN, in the rough ballpark of 5-15 minutes. Much of that variation might be related to the time it takes to run the RebootProgram tool, which currently has quite a bit of variation. I have not yet figured out what other timeouts would be involved in how quickly slurm marks the node as DOWN. Under Scenario 2 (1800sec ResumeTimeout): - Nodes seem to come up, but get stuck in the "IDLE+DRAIN" state. - When I add a "nextstate=resume" to the "scontrol reboot ASAP", it seems to come up cleanly, and return to service. I've only got one test case so far, but this looks promising. I'll continue to test here and let you know if there are any further issues. After running over the weekend with the "nextstate=resume" addition, I think this will be the solution we're after. None of the nodes are staying in "IDLE+DRAIN" anymore, and since we're using the longer ResumeTimeout, they're not re-triggering the reboot anymore. I'm calling this solved. |