Ticket 3392

Summary: automated node reboots can incorrectly resume prior to reboot
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: KNLAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 16.05.8   
Hardware: Cray XC   
OS: Linux   
Site: NERSC Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 16.05.10 and 17.02.1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Doug Jacobsen 2017-01-10 23:03:24 MST
Hello,

We've had some issues with nodes getting rebooted and coming back up then failing with reason "Unexpected node reboot" (or somesuch similar reason).

This happens when:
1) job require node reboot is allocated and node reboot starts
2) capmc is called requesting node_reinit, but an xtbounce is already occuring and so the request is queued (only one xtbounce can run at a time)
3) the user cancels the job (before the queued xtbounce occurs)
4) immediately following the cancellation the node is put back in Idle (potentially starting another job)
5) xtremoted finally reinits the node (starting the xtbounce)

Would it be possible to ask the slurmd to terminate when a node reboot request is queued? (or other means for ensuring that only upon the next slurmd boot will the node be allowed to re-join the cluster)

Thanks,
Doug



Supporting information:
excerpt of slurmctld log for node nid11230:

[2017-01-07T03:36:59.003] debug:  Note large processing time from schedule: usec=1112305 began=03:36:57.891
[2017-01-07T06:14:57.804] debug:  Still waiting for boot of node nid11230
[2017-01-07T06:15:08.652] debug:  Still waiting for boot of node nid11230
[2017-01-07T06:16:43.359] update_node: node nid11230 state set to IDLE
[2017-01-07T06:16:43.397] Node nid11230 now responding
[2017-01-07T06:22:26.846] agent/is_node_resp: node:nid11230 RPC:REQUEST_PING : Communication connection failure
[2017-01-07T06:59:20.363] Node nid11230 rebooted 1859 secs ago
[2017-01-07T06:59:20.363] Node nid11230 now responding
[2017-01-07T06:59:20.363] validate_node_specs: Node nid11230 unexpectedly rebooted boot_time=1483799301 last response=1483798730



significant truncation of sinfo output (hundreds of nodes were affected):
...
Node unexpectedly rebooted                                   root      2      2017-01-07T06:58:51 nid[10121,11005]
Node unexpectedly rebooted                                   root      14     2017-01-07T06:58:53 nid[10710,10713,10905,10962,10966,10988,11007,11178-11179,11183-11184,11191,11238,11240]
Node unexpectedly rebooted                                   root      10     2017-01-07T06:58:52 nid[10889,10921,10929,10937,10967,11017,11028,11172,11233,11239]
Node unexpectedly rebooted                                   root      2      2017-01-07T06:59:02 nid[10935,11181]
Node unexpectedly rebooted                                   root      1      2017-01-07T06:59:20 nid11230
...



full slurmd log post reboot (we lose slurmd logs upon reboot, so this shows it really did reboot):
nid11230:~ # cat /var/spool/slurmd/nid11230.log
[2017-01-07T06:59:18.293] Message aggregation disabled
[2017-01-07T06:59:18.525] Resource spec: Reserved system memory limit not configured for this node
[2017-01-07T06:59:18.551] slurmd version 16.05.7 started
[2017-01-07T06:59:18.552] error: No /var/spool/slurmd/job_container_state file for job_container/cncu state recovery
[2017-01-07T06:59:18.553] core_spec/cray: init
[2017-01-07T06:59:18.563] slurmd started on Sat, 07 Jan 2017 06:59:18 -0800
[2017-01-07T06:59:18.563] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96757 TmpDisk=47902 Uptime=1859 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
nid11230:~ #




from the xtremoted log (receiving agent for the capmc requests):
...
<157>1 2017-01-07T06:14:57.432527-08:00 corismw1 xtremoted_dbutil 30294 - [hss_xtremoted@34] do_key_value_edit: Setting `mcdram_cfg=cache` for nids [10889,10...
...
<157>1 2017-01-07T06:15:02.166370-08:00 corismw1 xtremoted_dbutil 31829 - [hss_xtremoted@34] do_key_value_edit: Setting `numa_cfg=quad` for nids [10100,1...
....
<157>1 2017-01-07T06:15:03.420693-08:00 corismw1 xtremoted 33823 - [hss_xtremoted@34] auth_cb: Remote IP (10.128.3.76) URI(/capmc/node_reinit) request is  authorized
....
<157>1 2017-01-07T06:19:32.419644-08:00 corismw1 xtremoted 33823 - [hss_xtremoted@34] 46459 START xtremoted_node_on_off --reinit 11180,11196,11197,11183,11238,10712,11201,11202,11203,11204,11232,11191,11192,11193,11236,11195,10710,11198,11241,10713,10714,11231,11245,11233,11234,11237,11239,11240,11242,11229,11230,11243,11244,10100,10101,10102,10103,10104,10...
...


So the request for mode change was processed at 06:14:57 and 06:15:02.  Node reinit was requested at 06:15:03, but for some reason the actual reboot wasn't started until 06:19:32.

During that time job 3401601, which requested the node reboots was allocated at 2017-01-07T06:14:53.633 and then cancelled/requeued at 2017-01-07T06:16:43.342, prior to the actual node_reinit command.

Given that cray (1) queues xtbounce requests, and (2) can retry xtbounce several times if there are failures, and (3) xtbounce can be disrupted by system resiliency operations, and in the future those will re-queue (hopefully), there is a _significant_ potential delay between the campc node_reinit request and when the node reboot actually begins.  Thus, there should probably be some kind of mechanism for ignoring slurmd reconnecting or other invalid interference until the node is legitimately rebooted if the node_reinit has been called.

I can provide more detailed logs if that will be helpful.
Comment 5 Moe Jette 2017-01-23 14:31:30 MST
I'm not seeing this behavior unless capmc returns an error. Could you look for lines in your slurmctld log file around the reboot time that include the string "capmc_resume"?

If found, they are likely to be rather terse, such as this one:
error: capmc_resume[23797]: capmc(node_reinit,-n,1): 256
Comment 7 Moe Jette 2017-02-02 11:14:04 MST
(In reply to Moe Jette from comment #5)
> I'm not seeing this behavior unless capmc returns an error. Could you look
> for lines in your slurmctld log file around the reboot time that include the
> string "capmc_resume"?
> 
> If found, they are likely to be rather terse, such as this one:
> error: capmc_resume[23797]: capmc(node_reinit,-n,1): 256

Any update on this?
Comment 8 Doug Jacobsen 2017-02-02 17:21:39 MST
Hello,
We'll deploy 16.05.9 on cori next week (and 17.01.0-0rc1 on gerty aw
well).  We will also re-enable dynamic mode reprovisions at that time, so I
should be able to let you know by late next week if these corrections
helped.
Thanks so much,
Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Thu, Feb 2, 2017 at 10:14 AM, <bugs@schedmd.com> wrote:

> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=3392#c7> on bug
> 3392 <https://bugs.schedmd.com/show_bug.cgi?id=3392> from Moe Jette
> <jette@schedmd.com> *
>
> (In reply to Moe Jette from comment #5 <https://bugs.schedmd.com/show_bug.cgi?id=3392#c5>)> I'm not seeing this behavior unless capmc returns an error. Could you look
> > for lines in your slurmctld log file around the reboot time that include the
> > string "capmc_resume"?
> >
> > If found, they are likely to be rather terse, such as this one:
> > error: capmc_resume[23797]: capmc(node_reinit,-n,1): 256
>
> Any update on this?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 9 Moe Jette 2017-02-27 15:08:08 MST
(In reply to Doug Jacobsen from comment #8)
> Hello,
> We'll deploy 16.05.9 on cori next week (and 17.01.0-0rc1 on gerty aw
> well).  We will also re-enable dynamic mode reprovisions at that time, so I
> should be able to let you know by late next week if these corrections
> helped.
> Thanks so much,
> Doug

Any updates?

The "capmc reinit" command that Slurm uses to reboot nodes still does not return any information about specific failures. If Slurm requests a reboot of 1000 nodes and something bad happens to any of them, I just get back a generic error and have to try to determine what happened. I've got a request to Cray for node-specific errors for the "reinit" command like they have for the "set_mcdram_cfg" and "set_numa_cfg", but that doesn't seem to be going anywhere.
Comment 10 Moe Jette 2017-03-07 10:35:19 MST
I was revisiting this bug and was just able to reproduce the failure.
Comment 12 Moe Jette 2017-03-07 12:05:12 MST
(In reply to Moe Jette from comment #10)
> I was revisiting this bug and was just able to reproduce the failure.

Actually, I reproduced the bug with Slurm version 16.05.9 and can no longer reproduce it with version 16.05.10 (or 17.02.1). I believe the bug was fixed in this commit:
https://github.com/SchedMD/slurm/commit/f6d42fdbb293ca89da609779db8d8c04a86a8d13

Since there's been no update on this since Feb 2 and indications here are that the bug was fixed, I'[l close this.
Comment 13 Doug Jacobsen 2017-03-07 12:08:42 MST
Thanks, Moe!

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Tue, Mar 7, 2017 at 11:05 AM, <bugs@schedmd.com> wrote:

> *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=3392#c12> on bug
> 3392 <https://bugs.schedmd.com/show_bug.cgi?id=3392> from Moe Jette
> <jette@schedmd.com> *
>
> (In reply to Moe Jette from comment #10 <https://bugs.schedmd.com/show_bug.cgi?id=3392#c10>)> I was revisiting this bug and was just able to reproduce the failure.
>
> Actually, I reproduced the bug with Slurm version 16.05.9 and can no longer
> reproduce it with version 16.05.10 (or 17.02.1). I believe the bug was fixed in
> this commit:https://github.com/SchedMD/slurm/commit/f6d42fdbb293ca89da609779db8d8c04a86a8d13
>
> Since there's been no update on this since Feb 2 and indications here are that
> the bug was fixed, I'[l close this.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 14 Moe Jette 2017-03-08 08:52:01 MST
Already fixed.