Ticket 3702 - scontrol reboot_nodes leaves nodes in unexpectedly rebooted state
Summary: scontrol reboot_nodes leaves nodes in unexpectedly rebooted state
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 17.02.1
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-04-16 20:52 MDT by Josko Plazonic
Modified: 2017-05-02 09:37 MDT (History)
1 user (show)

See Also:
Site: Princeton (PICSciE)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Logs and slurm config from a test (101.52 KB, application/octet-stream)
2017-04-23 20:59 MDT, Josko Plazonic
Details
Extra logging patch for manual reboot problem (1.40 KB, patch)
2017-04-25 16:59 MDT, Tim Shaw
Details | Diff
Full slurmctld log at debug5 (86.57 KB, application/octet-stream)
2017-04-25 21:08 MDT, Josko Plazonic
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Josko Plazonic 2017-04-16 20:52:33 MDT
Hi there,

scontrol reboot_nodes is very frequently leaving nodes in

"Node unexpectedly rebooted"

state, but not always.  It also doesn't seem to take effect every time we do it.  E.g. I am trying it out now and I did:

scontrol reboot_nodes tiger-s[001-002]

and tiger-s001 went into state=REBOOT, and rebooted while tiger-s002 stayed idle. Only after another scontrol reboot_nodes tiger-s002 did it reboot as well. After reboot:

NodeName=tiger-s001 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.24
   AvailableFeatures=serial
   ActiveFeatures=serial
   Gres=(null)
   NodeAddr=tiger-s001 NodeHostName=tiger-s001 Version=17.02
   OS=Linux RealMemory=256000 AllocMem=0 FreeMem=256377 Sockets=2 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=serial 
   BootTime=2017-04-16T22:34:09 SlurmdStartTime=2017-04-16T22:35:38
   CfgTRES=cpu=20,mem=250G
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [slurm@2017-04-16T22:35:38]

tiger-s002 also rebooted and it resumed just fine. In previous test tiger-s001 was OK while tiger-s002 ended up "unexpectedly rebooted".

In both of these cases reboot command is equivalent to:

#!/bin/bash
/sbin/service slurm stop
/sbin/reboot

(actual script is bigger and can, after stopping slurm, update the node and then reboot it but in this instance it just did the above).

Thanks!
Comment 3 Tim Shaw 2017-04-17 10:40:11 MDT
Josko,

Can you attach your whole reboot script to the bug?  I'm wondering if the extra contents of your reboot script could be playing a part in the inconsistent reboot behavior you're seeing.  What if you simplified the script to just be:

#!/bin/bash
/sbin/service slurm stop

sleep 3 

/sbin/reboot

Also with a sleep between the slurm stop and reboot.  I'd be interested to see if this simplified reboot script would be reliable for you. 

If you tried the simplified script but are still seeing inconsistent reboots, please attach the slurmd log from a node that failed to reboot when you submitted the command.  I'll look through it.

Thanks

Tim
Comment 4 Josko Plazonic 2017-04-17 11:57:55 MDT
Hi there,

I just tried that - edited our reboot program, made it what you suggested and issued reboot.  This time they both rebooted but after reboot:

Node unexpectedly re slurm     2017-04-17T13:37:44 tiger-s001
Node unexpectedly re slurm     2017-04-17T13:37:40 tiger-s002

So I don't think it's the script...

Josko
Comment 5 Tim Shaw 2017-04-17 12:23:37 MDT
Can you attach the slurmd log to the bug?  I'll need to go through it and look for clues.

Thanks.

Tim
Comment 6 Josko Plazonic 2017-04-18 09:01:43 MDT
Good morning,

sorry about the delay - too many things going on.  I am going to switch to a smaller test cluster for these tests.  Let's start with slurmctld logs as I think they are more relevant:

[2017-04-18T09:29:05.170] debug2: Processing RPC: REQUEST_REBOOT_NODES from uid=0
[2017-04-18T09:29:05.170] reboot request queued for nodes mcmillan-r1c1n[15-16]
[2017-04-18T09:29:06.170] debug:  Queuing reboot request for nodes mcmillan-r1c1n[15-16]
[2017-04-18T09:29:06.170] debug2: Spawning RPC agent for msg_type REQUEST_REBOOT_NODES
[2017-04-18T09:29:06.170] debug2: got 2 threads to send out
[2017-04-18T09:29:06.171] debug3: slurm_send_only_node_msg: sent 183
[2017-04-18T09:29:06.171] debug3: slurm_send_only_node_msg: sent 183
[2017-04-18T09:29:06.175] debug:  Still waiting for boot of node mcmillan-r1c1n15
[2017-04-18T09:29:06.175] debug2: node_did_resp mcmillan-r1c1n15
[2017-04-18T09:29:06.175] debug:  Still waiting for boot of node mcmillan-r1c1n16
[2017-04-18T09:29:06.175] debug2: node_did_resp mcmillan-r1c1n16
.....
[2017-04-18T09:30:17.257] debug:  Spawning health check agent for mcmillan-r1c1n[1-16],mcmillan-r1k[1-4],mcmillan-r1n[1-7]
[2017-04-18T09:30:17.257] debug2: Spawning RPC agent for msg_type REQUEST_HEALTH_CHECK
[2017-04-18T09:30:17.273] debug2: Sending tres '1=1312,2=1602000,3=0,4=27' for cluster
[2017-04-18T09:30:17.273] debug2: got 1 threads to send out
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1c1n1
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1c1n3
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1c1n9
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1c1n11
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1c1n12
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1c1n4
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1k3
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1c1n16
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1k4
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1n3
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1c1n13
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1n7
[2017-04-18T09:30:17.273] debug3: Tree sending to mcmillan-r1c1n15
[2017-04-18T09:30:17.274] debug3: Tree sending to mcmillan-r1c1n8
[2017-04-18T09:30:17.274] debug2: slurm_connect failed: Connection refused
[2017-04-18T09:30:17.274] debug2: Error connecting slurm stream socket at 10.36.1.16:6818: Connection refused
[2017-04-18T09:30:17.274] debug3: connect refused, retrying
....
[2017-04-18T09:30:17.275] debug2: Error connecting slurm stream socket at 10.36.1.15:6818: Connection refused
[2017-04-18T09:30:27.280] debug2: Error connecting slurm stream socket at 10.36.1.15:6818: Connection refused
[2017-04-18T09:30:27.280] debug3: problems with mcmillan-r1c1n15
[2017-04-18T09:30:27.280] debug2: Tree head got back 27
[2017-04-18T09:30:27.280] agent/is_node_resp: node:mcmillan-r1c1n16 RPC:REQUEST_HEALTH_CHECK : Communication connection failure
[2017-04-18T09:30:27.280] agent/is_node_resp: node:mcmillan-r1c1n15 RPC:REQUEST_HEALTH_CHECK : Communication connection failure
....
[2017-04-18T09:35:42.592] Node mcmillan-r1c1n15 rebooted 114 secs ago
[2017-04-18T09:35:42.592] Node mcmillan-r1c1n15 now responding
[2017-04-18T09:35:42.592] validate_node_specs: Node mcmillan-r1c1n15 unexpectedly rebooted boot_time=1492522428 last response=1492522206
[2017-04-18T09:35:42.592] debug2: _slurm_rpc_node_registration complete for mcmillan-r1c1n15 usec=116
....
[2017-04-18T09:35:44.784] Node mcmillan-r1c1n16 rebooted 115 secs ago
[2017-04-18T09:35:44.784] Node mcmillan-r1c1n16 now responding
[2017-04-18T09:35:44.784] validate_node_specs: Node mcmillan-r1c1n16 unexpectedly rebooted boot_time=1492522429 last response=1492522206
[2017-04-18T09:35:44.784] debug2: _slurm_rpc_node_registration complete for mcmillan-r1c1n16 usec=103

I'll see about producing some slurmd logs from nodes as well but I suspect this is really up to slurmctld, not slurmd.
Comment 7 Josko Plazonic 2017-04-18 09:25:12 MDT
Now, regarding slurmd side:

[2017-04-18T11:05:31.707] debug3: in the service_connection
[2017-04-18T11:05:31.708] debug2: got this type of message 1015
[2017-04-18T11:05:31.708] debug2: Processing RPC: REQUEST_REBOOT_NODES
[2017-04-18T11:05:31.708] Node reboot request being processed
[2017-04-18T11:05:31.763] got shutdown request
[2017-04-18T11:05:31.763] waiting on 1 active threads
[2017-04-18T11:11:59.626] debug2: hwloc_topology_init
[2017-04-18T11:11:59.626] debug2: hwloc_topology_load
[2017-04-18T11:11:59.642] debug:  CPUs:12 Boards:1 Sockets:2 CoresPerSocket:6 ThreadsPerCore:1
[2017-04-18T11:11:59.642] debug4: CPU map[0]=>0 S:C:T 0:0:0
...

and I see nothing relevant about the reboot/offline in the remainder of the log.
Comment 8 Tim Shaw 2017-04-18 10:37:54 MDT
I'm looking over the log lines but nothing jumps out at me as to why this is happening.

But regarding your reboot script, it dawned on me that it's not wise to have the script call "/sbin/service slurm stop" when slurmd is what launches the script in the first place.  There might be some timing involved that shuts down slurmd before it can execute the reboot.  Try setting your reboot script to just be:

#!/bin/bash

/sbin/reboot

Whether or not this fixes the problem, "/sbin/service slurm stop" should not be part of your reboot script.  But I think there's a good chance it will fix the problem.

Tim
Comment 9 Josko Plazonic 2017-04-18 11:43:40 MDT
OK, I'll play along:

[root@mcmillan2 slurm]# sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Node unexpectedly re slurm     2017-04-18T13:41:20 mcmillan-r1c1n15
Node unexpectedly re slurm     2017-04-18T13:41:12 mcmillan-r1c1n16
old_gpus             root      2017-04-14T16:41:21 mcmillan-r1n[4-5]
old_gpus             root      2017-04-14T16:41:07 mcmillan-r1n[6-7]
[root@mcmillan2 slurm]# ssh mcmillan-r1c1n16
Last login: Tue Apr 18 13:34:15 2017 from mcmillan2.princeton.edu
[root@mcmillan-r1c1n16 ~]# cat /usr/local/sbin/slurmreboot 
#!/bin/bash
reboot

I changed it to just the above and issued another reboot_nodes and it still does the same thing.  

Let's face it - it's busted... Could it be the fact that we use nhc and nodes don't report back during the reboot?
Comment 11 Tim Shaw 2017-04-20 10:22:34 MDT
There is a chance this could be related to the node health check.  I've spent a fair amount time trying to reproduce it with and without the NHC but so far have been unsuccessful.  Have you tried disabling the node health check and issuing the reboot command?  Let me know if that fixes the problem so we know if it's related.  If you can give me some reproductions steps as well, that would be helpful.

Also, please attach the full log of the slurmctld and slurmd to this ticket.  There may be clues there that will be helpful.

Thanks

Tim
Comment 12 Josko Plazonic 2017-04-23 20:59:33 MDT
Created attachment 4404 [details]
Logs and slurm config from a test

Slurm conf file + logs from two rebooted nodes.
Comment 13 Josko Plazonic 2017-04-23 21:01:28 MDT
Attached full logs (trimmed to restart time of daemons after debug increase) and config file. This is with nhc disabled.  Please ignore slurm.conf being different between controller and nodes - that's only because of slurmd debug level.

scontrol show node mcmillan-r1c1n15
NodeName=mcmillan-r1c1n15 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=mcmillan-r1c1n15 NodeHostName=mcmillan-r1c1n15 Version=17.02
   OS=Linux RealMemory=48000 AllocMem=0 FreeMem=47440 Sockets=2 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=all,nongpu 
   BootTime=2017-04-23T22:31:36 SlurmdStartTime=2017-04-23T22:33:16
   CfgTRES=cpu=12,mem=48000M
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [slurm@2017-04-23T22:33:16]

scontrol show node mcmillan-r1c1n16
NodeName=mcmillan-r1c1n16 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.55
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=mcmillan-r1c1n16 NodeHostName=mcmillan-r1c1n16 Version=17.02
   OS=Linux RealMemory=48000 AllocMem=0 FreeMem=45119 Sockets=2 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=all,all,nongpu 
   BootTime=2017-04-23T22:31:36 SlurmdStartTime=2017-04-23T22:33:19
   CfgTRES=cpu=12,mem=48000M
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [slurm@2017-04-23T22:33:19]
Comment 14 Tim Shaw 2017-04-25 16:59:13 MDT
Created attachment 4417 [details]
Extra logging patch for manual reboot problem

Josko,

Using the log files I've been able to narrow down the area in the code where the problem is happening.  However, the logs don't provide enough detail to explain why the code is taking a different path for some of your nodes.  I'm surprised this is happening because no other customers have reported this problem.  I've attached a patch to add some extra logging in the code where the problem is happening to hopefully shed some light on the situation.  Can you please apply the patch and rerun your reboot test?  Then reattach the logs to this bug.  The way you did it last time worked well.

Also, although unlikely, could you quickly verify that your nodes reboot in less time than is specified by the SlurmdTimeout in the slurm.conf file?  If the reboot takes longer than the SlurmdTimeout, you will get that the message "Node unexpectedly rebooted".

I appreciate your efforts and patience as I continue to investigate this problem.

Tim
Comment 15 Josko Plazonic 2017-04-25 20:48:41 MDT
OK, rebuilt, tested. First reboot went fine on 2nd things failed the usual way.  I can tar up the whole log tomorrow but maybe all you need is this:

[2017-04-25T22:17:29.284] _queue_reboot_msg: Node mcmillan-r1c1n15, state REBOOT, IS_NODE_DOWN = 1, IS_NODE_REBOOT = 65536, ret2service = 0
[2017-04-25T22:17:29.284] _queue_reboot_msg: Node mcmillan-r1c1n16, state REBOOT, IS_NODE_DOWN = 1, IS_NODE_REBOOT = 65536, ret2service = 0
[2017-04-25T22:23:48.481] validate_node_specs: Node mcmillan-r1c1n15, state REBOOT, IS_NODE_DOWN = 1, IS_NODE_REBOOT = 65536, ret2service = 0
[2017-04-25T22:23:49.825] validate_node_specs: Node mcmillan-r1c1n16, state REBOOT, IS_NODE_DOWN = 1, IS_NODE_REBOOT = 65536, ret2service = 0
[2017-04-25T22:39:08.029] _queue_reboot_msg: Node mcmillan-r1c1n15, state REBOOT, IS_NODE_DOWN = 1, IS_NODE_REBOOT = 65536, ret2service = 0
[2017-04-25T22:39:08.029] _queue_reboot_msg: Node mcmillan-r1c1n16, state REBOOT, IS_NODE_DOWN = 1, IS_NODE_REBOOT = 65536, ret2service = 0
[2017-04-25T22:45:30.624] validate_node_specs: Node mcmillan-r1c1n15, state DOWN, IS_NODE_DOWN = 1, IS_NODE_REBOOT = 0, ret2service = 0
[2017-04-25T22:45:33.411] validate_node_specs: Node mcmillan-r1c1n16, state DOWN, IS_NODE_DOWN = 1, IS_NODE_REBOOT = 0, ret2service = 0

First 4 lines from first (good) reboot, 2nd from the "bad" one.  I also suspect that on the first run the state did not change from State=REBOOT to State=REBOOT*.

Regarding others not reporting - I think you'll find we've been more frequent users of this feature then others, just look through related tickets...
Comment 16 Josko Plazonic 2017-04-25 21:08:39 MDT
Created attachment 4418 [details]
Full slurmctld log at debug5

Take note at node pings - I suspect the problem is with them hitting nodes during their reboot.  Maybe reboots in your testing are too quick so not getting caught by them? Just a thought.
Comment 19 Tim Shaw 2017-04-26 14:31:00 MDT
I've been going through the code and testing this more and I believe I know what is causing the problem.  What I've found is if a node reboot takes longer than the ResumeTimeout (default 60 seconds) and a node ping occurs, then you will see unexpected rebooted message.  ResumeTimeout is the amount of time the reboot can take before marking the node as down.  See the slurm.conf man page for more information or here:

https://slurm.schedmd.com/slurm.conf.html

Can you please verify this by setting the ResumeTimeout in your slurm.conf file to a higher value (maybe twice as long as it normally takes a node to boot)?  For example:

ResumeTimeout  840

Let me know how it goes for you.

Tim
Comment 20 Josko Plazonic 2017-04-26 14:59:00 MDT
OK, it does make sense and it seems to work in the last few tries. Do note that the man page says nothing about ResumeTimeout being applicable to RebootProgram - seems like it should be mentioned?

I'll run a few more tests on production stuff and then definitely confirm whether it solves it or not.

Thanks!
Comment 22 Tim Shaw 2017-04-26 15:23:31 MDT
Agreed.  A comment about ResumeTimeout has been added to the RebootProgram section.  See commit 4a9532aa2a89 in the slurm-17.02 branch.

Tim
Comment 23 Tim Shaw 2017-05-01 16:08:15 MDT
Hey Josko,

Did my solution fix your problem?  Am I good to resolve this bug?

Thanks

Tim
Comment 24 Josko Plazonic 2017-05-02 09:33:10 MDT
Sure, let's close.  I think we are good though I won't know for sure until next week once we use this to reboot all of our clusters. Thanks!
Comment 25 Tim Shaw 2017-05-02 09:37:01 MDT
Resolving this bug.  Feel free to reopen if further action is needed.

Tim