| Summary: | cray scontrol reboot_nodes never resume automatically | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | Other | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 17.11.1 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=4583 | ||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Doug Jacobsen
2018-01-05 16:27:48 MST
(In reply to Doug Jacobsen from comment #0) > Hello, > > I've configured RebootProgram=/usr/sbin/capmc_resume (well, really my > wrapper around it that pokes our monitoring system first). > > ... > > slurmctld logs: > > ctlnet1:~ # grep nid00021 /var/tmp/slurm/slurmctld.log > ... > [2018-01-05T14:20:26.115] reboot request queued for nodes nid00021 > [2018-01-05T14:20:26.494] debug: Queuing reboot request for nodes nid00021 > [2018-01-05T14:20:47.905] update_node: node nid00021 state set to DOWN > [2018-01-05T14:21:07.908] update_node: node nid00021 state set to DOWN > [2018-01-05T14:28:32.592] Node nid00021 rebooted 225 secs ago > [2018-01-05T14:28:32.592] Node nid00021 now responding > > > ctlnet1:~ # grep nid00022 /var/tmp/slurm/slurmctld.log > [2018-01-05T14:20:35.893] reboot request queued for nodes nid00022 > [2018-01-05T14:20:36.502] debug: Queuing reboot request for nodes nid00022 > [2018-01-05T14:25:09.918] update_node: node nid00022 state set to DOWN > [2018-01-05T14:25:29.920] update_node: node nid00022 state set to DOWN > [2018-01-05T14:32:46.921] Node nid00022 rebooted 217 secs ago > [2018-01-05T14:32:46.921] Node nid00022 now responding > > > > I think that in the case of a reboot_nodes call when no other reason or down > state is set to begin with, the node should resume automatically. OR -- at > least update the reason indicating that it has been rebooted. > > Thank you, > Doug Hey Doug, I did the same operations on my testbed and I am not able to reproduce your situation. Whenever I do a scontrol reboot_nodes, the node state is set to REBOOT and this flag doesn't disappear until the node is re-registered, then it is put online again or in the same state than it was before (i.e. drain). [2018-01-16T16:26:19.841] reboot request queued for nodes moll1 [2018-01-16T16:26:20.110] debug: Queuing reboot request for nodes moll1 [2018-01-16T16:26:20.116] debug: Still waiting for boot of node moll1 [2018-01-16T16:40:07.071] Node moll1 now responding [2018-01-16T16:40:07.071] node moll1 returned to service I see a difference between my logs and yours, and is that in yours, while it is booting your node is set to DOWN: > [2018-01-05T14:20:47.905] update_node: node nid00021 state set to DOWN > [2018-01-05T14:21:07.908] update_node: node nid00021 state set to DOWN Note the timestamps. Is it possible that something external put the node to down while it was booting? In fact my Slurm follows strictly the behavior described in 'man scontrol' -> reboot, last two lines: reboot [ASAP] [NodeList] Reboot all nodes in the system when they become idle using the RebootProgram as configured in Slurm's slurm.conf file. The option "ASAP" prevents initiation of additional jobs so the node can be rebooted and returned to service "As Soon As Possible" (i.e. ASAP). Accepts an option list of nodes to reboot. By default all nodes are rebooted. NOTE: This command does not prevent additional jobs from being scheduled on these nodes, so many jobs can be executed on the nodes prior to them being rebooted. You can explicitly drain the nodes in order to reboot nodes as soon as possible, but the nodes must also explicitly be returned to service after being rebooted. You can alternately create an advanced reservation to prevent additional jobs from being initiated on nodes to be rebooted. NOTE: Nodes will be placed in a state of "REBOOT" until rebooted and returned to service with a normal state. Alternately the node's state "REBOOT" may be cleared by using the scontrol command to set the node state to "RESUME", which clears the "REBOOT" flag. Please, tell me if you are still experiencing this issue. Hi Doug, Did you have any chance to take a look at that issue and my last comment? Thank you! Doug, I am closing this issue with status 'timed out' since it's been 20 days since last response. My guess as I explained in comment 4 is that somebody or something manually put the nodes DOWN, as the log message indicates. If it happens you to try it again and it is reproducible, just reopen this bug and we will look deeper in it. Regards, Felip M |