Ticket 4581

Summary:	cray scontrol reboot_nodes never resume automatically
Product:	Slurm	Reporter:	Doug Jacobsen <dmjacobsen>
Component:	Other	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	felip.moll
Version:	17.11.1
Hardware:	Cray XC
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=4583
Site:	NERSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Doug Jacobsen 2018-01-05 16:27:48 MST

Hello,

I've configured RebootProgram=/usr/sbin/capmc_resume (well, really my wrapper around it that pokes our monitoring system first).

The nodes reboot, but never resume on their own.


Initiating the boot.

boot-gerty:~ # sinfo --node=nid000[21,22]
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
system         up      30:00      2  drain nid000[21-22]
debug*         up      30:00      2  drain nid000[21-22]
regular        up 4-00:00:00      0    n/a
regularx       up 2-00:00:00      2  drain nid000[21-22]
special        up   14:00:00      2  drain nid000[21-22]
benchmark      up 1-12:00:00      2  drain nid000[21-22]
realtime       up   12:00:00      2  drain nid000[21-22]
shared         up 2-00:00:00      2  drain nid000[21-22]
interactive    up    4:00:00      2  drain nid000[21-22]
jgi            up 3-00:00:00      2  drain nid000[21-22]
boot-gerty:~ # sinfo --node=nid000[21,22] -p system
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
system       up      30:00      1 drain* nid00021
system       up      30:00      1   down nid00022
boot-gerty:~ # sinfo --node=nid000[21,22] -p system -R
REASON               USER      TIMESTAMP           NODELIST
Reboot ASAP          root      2018-01-05T14:20:26 nid00021
none                 Unknown   Unknown             nid00022
boot-gerty:~ #



.... Wait awhile ....


boot-gerty:~ # pdsh -w nid000[21,22] uptime
nid00022:  15:22pm  up   0:53,  0 users,  load average: 0.00, 0.00, 0.00
nid00021:  15:22pm  up   0:58,  0 users,  load average: 0.24, 0.05, 0.02
boot-gerty:~ # ssh nid00021 cat /var/spool/slurmd/nid*log
[2018-01-05T14:28:32.550] Message aggregation enabled: WindowMsgs=10, WindowTime=10
[2018-01-05T14:28:32.565] slurmd version 17.11.2 started
[2018-01-05T14:28:32.565] error: No /var/spool/slurmd/job_container_state file for job_container/cncu state recovery
[2018-01-05T14:28:32.565] core_spec/cray: init
[2018-01-05T14:28:32.579] slurmd started on Fri, 05 Jan 2018 14:28:32 -0800
[2018-01-05T14:28:32.579] CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=128807 TmpDisk=64403 Uptime=225 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
boot-gerty:~ # ssh nid00022 cat /var/spool/slurmd/nid*log
[2018-01-05T14:32:46.871] Message aggregation enabled: WindowMsgs=10, WindowTime=10
[2018-01-05T14:32:46.884] slurmd version 17.11.2 started
[2018-01-05T14:32:46.884] error: No /var/spool/slurmd/job_container_state file for job_container/cncu state recovery
[2018-01-05T14:32:46.884] core_spec/cray: init
[2018-01-05T14:32:46.898] slurmd started on Fri, 05 Jan 2018 14:32:46 -0800
[2018-01-05T14:32:46.898] CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=128807 TmpDisk=64403 Uptime=217 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
boot-gerty:~ #


slurmctld logs:

ctlnet1:~ # grep nid00021 /var/tmp/slurm/slurmctld.log
...
[2018-01-05T14:20:26.115] reboot request queued for nodes nid00021
[2018-01-05T14:20:26.494] debug:  Queuing reboot request for nodes nid00021
[2018-01-05T14:20:47.905] update_node: node nid00021 state set to DOWN
[2018-01-05T14:21:07.908] update_node: node nid00021 state set to DOWN
[2018-01-05T14:28:32.592] Node nid00021 rebooted 225 secs ago
[2018-01-05T14:28:32.592] Node nid00021 now responding


ctlnet1:~ # grep nid00022 /var/tmp/slurm/slurmctld.log
[2018-01-05T14:20:35.893] reboot request queued for nodes nid00022
[2018-01-05T14:20:36.502] debug:  Queuing reboot request for nodes nid00022
[2018-01-05T14:25:09.918] update_node: node nid00022 state set to DOWN
[2018-01-05T14:25:29.920] update_node: node nid00022 state set to DOWN
[2018-01-05T14:32:46.921] Node nid00022 rebooted 217 secs ago
[2018-01-05T14:32:46.921] Node nid00022 now responding



I think that in the case of a reboot_nodes call when no other reason or down state is set to begin with, the node should resume automatically.  OR -- at least update the reason indicating that it has been rebooted.

Thank you,
Doug

Comment 4 Felip Moll 2018-01-16 09:00:30 MST

(In reply to Doug Jacobsen from comment #0)
> Hello,
> 
> I've configured RebootProgram=/usr/sbin/capmc_resume (well, really my
> wrapper around it that pokes our monitoring system first).
> 
> ...
>
> slurmctld logs:
> 
> ctlnet1:~ # grep nid00021 /var/tmp/slurm/slurmctld.log
> ...
> [2018-01-05T14:20:26.115] reboot request queued for nodes nid00021
> [2018-01-05T14:20:26.494] debug:  Queuing reboot request for nodes nid00021
> [2018-01-05T14:20:47.905] update_node: node nid00021 state set to DOWN
> [2018-01-05T14:21:07.908] update_node: node nid00021 state set to DOWN
> [2018-01-05T14:28:32.592] Node nid00021 rebooted 225 secs ago
> [2018-01-05T14:28:32.592] Node nid00021 now responding
> 
> 
> ctlnet1:~ # grep nid00022 /var/tmp/slurm/slurmctld.log
> [2018-01-05T14:20:35.893] reboot request queued for nodes nid00022
> [2018-01-05T14:20:36.502] debug:  Queuing reboot request for nodes nid00022
> [2018-01-05T14:25:09.918] update_node: node nid00022 state set to DOWN
> [2018-01-05T14:25:29.920] update_node: node nid00022 state set to DOWN
> [2018-01-05T14:32:46.921] Node nid00022 rebooted 217 secs ago
> [2018-01-05T14:32:46.921] Node nid00022 now responding
> 
> 
> 
> I think that in the case of a reboot_nodes call when no other reason or down
> state is set to begin with, the node should resume automatically.  OR -- at
> least update the reason indicating that it has been rebooted.
> 
> Thank you,
> Doug

Hey Doug,

I did the same operations on my testbed and I am not able to reproduce your situation. Whenever I do a scontrol reboot_nodes, the node state is set to REBOOT and this flag doesn't disappear until the node is re-registered, then it is put online again or in the same state than it was before (i.e. drain).

[2018-01-16T16:26:19.841] reboot request queued for nodes moll1
[2018-01-16T16:26:20.110] debug:  Queuing reboot request for nodes moll1
[2018-01-16T16:26:20.116] debug:  Still waiting for boot of node moll1
[2018-01-16T16:40:07.071] Node moll1 now responding
[2018-01-16T16:40:07.071] node moll1 returned to service

I see a difference between my logs and yours, and is that in yours, while it is booting your node is set to DOWN:

> [2018-01-05T14:20:47.905] update_node: node nid00021 state set to DOWN
> [2018-01-05T14:21:07.908] update_node: node nid00021 state set to DOWN

Note the timestamps. 

Is it possible that something external put the node to down while it was booting?

In fact my Slurm follows strictly the behavior described in 'man scontrol' -> reboot, last two lines:

reboot [ASAP] [NodeList]
Reboot all nodes in the system when they become idle using the RebootProgram as configured in Slurm's slurm.conf file.
The option "ASAP" prevents initiation of additional jobs so the node can be rebooted and  returned to  service "As Soon As Possible" (i.e. ASAP).
Accepts an option list of nodes to reboot. By default all nodes are rebooted.  NOTE: This command does not prevent additional jobs from being
scheduled on these nodes, so many jobs can be executed on the nodes prior to them being rebooted. You can explicitly drain the nodes in order
to reboot nodes as soon as possible, but the nodes must also explicitly  be  returned  to  service  after being rebooted. You can alternately
create an advanced reservation to prevent additional jobs from being initiated on nodes to be rebooted.  

NOTE: Nodes will be placed in a state of "REBOOT" until rebooted and returned to service with a normal state.
Alternately the node's state "REBOOT" may be cleared by using the scontrol command to set the node state to "RESUME", which clears the "REBOOT" flag.


Please, tell me if you are still experiencing this issue.

Comment 5 Felip Moll 2018-01-23 08:11:44 MST

Hi Doug,

Did you have any chance to take a look at that issue and my last comment?

Thank you!

Comment 6 Felip Moll 2018-01-26 02:38:04 MST

Doug,

I am closing this issue with status 'timed out' since it's been 20 days since last response.

My guess as I explained in comment 4 is that somebody or something manually put the nodes DOWN, as the log message indicates.

If it happens you to try it again and it is reproducible, just reopen this bug and we will look deeper in it.

Regards,
Felip M