3318 – Mitigate impact of Aries network quiesce

Ticket 3318 - Mitigate impact of Aries network quiesce

Summary: Mitigate impact of Aries network quiesce

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	16.05.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Danny Auble
QA Contact:

URL:

Duplicates (1):	3769 (view as ticket list)
Depends on:
Blocks:

Reported:	2016-12-02 12:24 MST by David Gloe
Modified:	2022-02-22 10:08 MST (History)
CC List:	7 users (show)

See Also:	3463 3542 5285 13477
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:	Cori
CLE Version:
Version Fixed:	18.08.0-pre2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmsmwd initial patch (28.72 KB, patch) 2017-11-12 05:31 MST, Doug Jacobsen	Details \| Diff
aries resiliency protecting connect() during local quiesce (1.13 KB, patch) 2017-11-14 07:41 MST, Doug Jacobsen	Details \| Diff
17.11 slurmsmwd patch (25.95 KB, patch) 2017-11-28 00:48 MST, Doug Jacobsen	Details \| Diff
17.11 check_ghal patch (1.13 KB, patch) 2017-11-28 00:48 MST, Doug Jacobsen	Details \| Diff
17.11 slurmsmwd spec file patch (1.43 KB, patch) 2017-11-28 00:49 MST, Doug Jacobsen	Details \| Diff
Revised patch to work in 17.11 (70.09 KB, patch) 2018-03-13 16:53 MDT, Danny Auble	Details \| Diff
Show Obsolete (3) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description David Gloe 2016-12-02 12:24:10 MST

At NERSC, the customer has noticed that network quiesces lead to Slurm socket timeouts, causing job launch failures.

Quiesces can cause network traffic to be delayed by up to about 90 seconds, and they currently have MessageTimeout set to 60.

Would it be a good idea to increase the MessageTimeout even further to handle this?

Comment 1 David Gloe 2016-12-02 12:26:41 MST

If you have access to the Cray bugzilla, there's a wealth of information in bug 846093. The bug is urgent and acceptance gating so I've made it high priority here.

Comment 2 Tim Wickberg 2016-12-02 12:40:12 MST

FYI - whenever you file against the NERSC site Doug Jacobsen needs to be added as a CC; he manages the direct support contract for NERSC with SchedMD. If you'd prefer to keep filing this against the internal Cray system currently under acceptance, you should categorize this as Cray Internal instead. Adding Brian as well to keep him in the loop.

I'm assuming that these network quiesces are the cause behind both bug 3313 and 3315 as well; I'd suggest we close those out as duplicates of this issue to avoid further problems keeping everyone in sync. (One of those Doug owns, one of those Brian owns.)

From our perspective, that quiesce period is highly problematic. While you could increase the timeout to avoid this, is there anything that prevents repeated queiesce periods? At a certain point other timers will come in to play here; Slurm does prefer to be able to maintain contact with all of the nodes in a reasonable interval, and it sounds like this network issues precludes that on a fairly regular basis.

Comment 3 David Gloe 2016-12-02 12:59:44 MST

Here's a comment from Chris Johns with more information about quiesces:

I'd just like to make the point again that a network quiesce is supposed to be a relatively rare event, happening only when links fail, or when a warm swap is done, and is supposed to last for as little time as possible.

However, because some relatively complex, and system-wide, operations have to be done while the network is quiesced, there are some unavoidable durations that will occur in the quiesced state. In particular, there is an absolute minimum of 12 seconds to allow the network to drain of packets before and after up-but-unused links are taken down. In addition, distributing actions across the HSS network to all controllers, acting on those requests, and collecting and verifying responses, all take a certain amount of time.

Realistically, the total time spent quiesced on any XC system will be at least 25 seconds, and, depending on particular conditions on any of the blade controllers, could be somewhat more than that, perhaps as high as 50 seconds or so.

Note that this is not the total time during which the network could be non-functional, because from the moment a link fails (not a 0x62ff - too many soft errors failure, but a hard link going down failure), packets could stop getting delivered until the new routes are in place and the network is then unquiesced. In the case of one or more blades losing power, causing link failures, a 30 second timeout is in play to determine which blades failed, adding that much time to the processing of the failure.

Consequently, in such a case, total time from the beginning of the network problem to an unquiesced, rerouted, fully working network could be on the order of 90 seconds.

This has been the case for many years now, since the first Gemini systems were shipped. It baffles me that we're just finding out now that a mere 45 seconds of quiesced time is sufficient to cause serious problems with parts of the CLE software stack.

We're not going to reduce the quiesced time below, in the best case, around 30 seconds, and we're not going to reduce the network-is-down time in the case of a blade failure below 60 seconds or so. Consequently, I would suggest that effort is put into understanding why parts of the software stack are having such trouble with these periods of network outage, and attempt to correct them. Other sites (e.g. CSCS, ECMWF, SS*) are not seeing these issues, so why is NERSC special in this regard?

Comment 4 David Gloe 2016-12-02 13:05:45 MST

This could be the same issue as 3313 and 3315 but I don't know that we've confirmed that a quiesce happened at the same time those jobs failed.

In one set of logs from this system there were 37 quiesces in a time period of 3 days and 3 hours.

Comment 8 Tim Wickberg 2016-12-02 14:57:57 MST

Is there any way to tell that a quiesce is underway on the compute nodes and sdb?

I'd love to have a way to directly correlate those events to times when there have been other issues on the system; although I'm just speculating what that may entail at this point.

Comment 9 David Gloe 2016-12-02 16:29:03 MST

(In reply to Tim Wickberg from comment #8)
> Is there any way to tell that a quiesce is underway on the compute nodes and
> sdb?
> 
> I'd love to have a way to directly correlate those events to times when
> there have been other issues on the system; although I'm just speculating
> what that may entail at this point.

Chris Johns says: "No, I don't believe so. There is an area of kernel memory in the ghal driver that has this information, but it's not exposed to user mode."

So the only way to tell is to dig through logs and correlate times. I'm going to try to do that next Tuesday with the logs we've been provided (vacation day Monday).

Comment 10 Doug Jacobsen 2016-12-02 16:48:49 MST

A really irritating way one can do it on node is to take a look at the 
kernel message log (dmesg), and see if there is a recent message from 
the LNet kernel module stating that the aries is quiesced:


ctl1:~ # dmesg | grep -i quies | tail
[1168692.345619] LNet: Quiesce start: hardware quiesce
[1168737.373305] LNet: Quiesce complete: hardware quiesce
[1169123.727892] LNet: Quiesce start: hardware quiesce
[1169178.758209] LNet: Quiesce complete: hardware quiesce
[1169566.178705] LNet: Quiesce start: hardware quiesce
[1169611.204935] LNet: Quiesce complete: hardware quiesce
[1187948.039176] LNet: Quiesce start: hardware quiesce
[1188003.071006] LNet: Quiesce complete: hardware quiesce
[1188382.446692] LNet: Quiesce start: hardware quiesce
[1188427.472886] LNet: Quiesce complete: hardware quiesce
ctl1:~ #

In any case I'm quite sure the sbcast issues are unrelated to quiesces, 
and do deserve a bug in their own right.

A user did effectively do an
#!/bin/bash
set -e
sbcast ....
srun ...

And it still failed with the Text file busy error.  In that case the job:

ctl1:~ # sacct -j 3195168 --format=job,start,end

        JobID               Start                 End

------------ ------------------- -------------------

3195168      2016-12-02T09:40:48 2016-12-02T09:41:41

3195168.bat+ 2016-12-02T09:40:48 2016-12-02T09:41:41

3195168.ext+ 2016-12-02T09:40:48 2016-12-02T09:43:36

3195168.0    2016-12-02T09:40:57 2016-12-02T09:41:31

ctl1:~ #


Checking the nlrd log shows that there were no network throttles at 9:40AM.
Also there is not time for a 300s timeout between initiation of the 
batch step and step 0.
Also sbcast did not end with a non-zero exit status (which it probably 
should have done if it exited without sending all data).

So, I think we can probably separate network throttles (generally bad 
for everyone, but especially TCP/IP traffic) from the sbcast issues.
-Doug

Comment 12 Tim Wickberg 2016-12-06 15:15:50 MST

Doug -

Before this weekend's fun with 3320 you were going to try to correlate some of these communication issues within slurm against the aries quiesce messages; have you had a chance to do that?

Based on discussions here, setting the MessageTimeout to 120 may be an okay compromise; you'll have slightly longer delays detecting failed nodes, but may be able to smoothly work past the quiesce period.

David -

Would it be possible to add a file under /sys that we could poll periodically for the Aries network status? Being able to report that in certain slurmd/slurmctld log messages would be a good start to better understanding when these communication issues arise.

Comment 13 Moe Jette 2016-12-06 15:27:52 MST

The obvious problem with shutting down the network for 90 seconds is that users launching jobs interactively will suddenly find that the time to launch a job  goes from milliseconds to tens of seconds and that happens sporadically every couple of hours or so. It would be really nice detect the situation and provide some sort of user notification that the network is unavailable so they don't point fingers at Slurm's highly inconsistent responsiveness.

Comment 15 Brian F Gilmer 2016-12-06 15:56:12 MST

What we need to keep in mind is that IP over Aries is not a robust IP network implementation.  The Aries is point-to-point so all of the IP network functionality that is normally supplied by routers is missing.  For example, the ICMP type 3 messages that unreachable endpoints is missing.  On the Aries many of the network errors that are smoothly handled by the network stack do not happen.

Comment 16 Brian F Gilmer 2016-12-06 16:11:02 MST

(In reply to Tim Wickberg from comment #12)
> David -
> 
> Would it be possible to add a file under /sys that we could poll
> periodically for the Aries network status? Being able to report that in
> certain slurmd/slurmctld log messages would be a good start to better
> understanding when these communication issues arise.

That should already be in the ipogif device:

What:		/sys/class/net/<iface>/carrier
Date:		April 2005
KernelVersion:	2.6.12
Contact:	netdev@vger.kernel.org
Description:
		Indicates the current physical link state of the interface.
		Posssible values are:
		0: physical link is down
		1: physical link is up

What:		/sys/class/net/<iface>/dormant
Date:		March 2006
KernelVersion:	2.6.17
Contact:	netdev@vger.kernel.org
Description:
		Indicates whether the interface is in dormant state. Possible
		values are:
		0: interface is not dormant
		1: interface is dormant

		This attribute can be used by supplicant software to signal that
		the device is not usable unless some supplicant-based
		authentication is performed (e.g: 802.1x). 'link_mode' attribute
		will also reflect the dormant state.

What:		/sys/class/net/<iface>/link_mode
Date:		March 2006
KernelVersion:	2.6.17
Contact:	netdev@vger.kernel.org
Description:
		Indicates the interface link mode, as a decimal number. This
		attribute should be used in conjunction with 'dormant' attribute
		to determine the interface usability. Possible values:
		0: default link mode
		1: dormant link mode

The IP over GNI should also leave messages to show up link state.
bgilmer@tiger:~> dmesg | awk '/ipogif/ {  print }'
[   14.540837] ipogif_init:Cray(R) Gemini IP over Fabric device driver - version 0.17
[   14.553180] ipogif_init:Copyright (C) 2007 Cray Inc.
[   14.560530] ipogif_probe:Immediate checksum off
[   17.076917] ipogif_up:Bringing interface up

bgilmer@tiger:~> dmesg | awk '/eth0/ {  print }'
[   15.256772] init: Boot interface (bootif) eth0 not present.
[   15.772228] igb 0000:03:00.0: added PHC on eth0
[   15.790280] igb 0000:03:00.0: eth0: (PCIe:2.5Gb/s:Width x4) 90:e2:ba:00:ab:c2
[   15.802249] igb 0000:03:00.0: eth0: PBA No: Unknown
[  165.604544] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[  167.671517] igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[  167.682483] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

Comment 17 Doug Jacobsen 2016-12-07 10:46:56 MST

Hmm, I just increased MessageTimeout on cori to 120s.  Now all sinfo and 
scontrol instances generate the following ominous warning on invocation:

boot-p0:~ # for i in $(sinfo --format="%R" -h) ; do scontrol update 
part=$i state=up; done
sinfo: WARNING: MessageTimeout is too high for effective fault-tolerance
scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance
scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance
scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance
scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance
scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance
scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance
scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance
scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance
scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance
scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance
boot-p0:~ #

-Doug

Comment 18 Danny Auble 2016-12-07 10:54:50 MST

Doug, that is if you are root.  Otherwise it is a debug message if the timeout is > 100.

Perhaps we could make a debugflag that handles that differently?

Comment 19 Danny Auble 2016-12-07 10:54:58 MST

Is there an interface spec for IPoGIF?  Perhaps it could notify us a quiesce is happening.  If so we could handle it in a more graceful manner?

Comment 20 Danny Auble 2016-12-09 14:24:50 MST

Doug, how is the 120 timeout working?  I could raise that check to 120 so the message doesn't get printed if you would like.  Just let me know.

Comment 21 Doug Jacobsen 2017-01-06 14:33:25 MST

Hello,

Increasing MessageTimeout was insufficient to correct this issue.  srun and sbcasts have still been failing, particularly when individual nodes are getting paused upon ORB timeouts (10-15s localized quiesce).

We've increased TcpTimeout from 2s to 20s on cori (and 10s on edison).  The justification is that with 3 retries 2s * 3 retries = 6s (or 8s depending on how the retries are counted).  In any case ORB timeouts are 10-15s pauses on either cori or edison, so TcpTimeout + retries needs to at least exceed that limit.  In addition HSN re-route throttles are 22s on edison and 50s on cori.  So 10s TcpTimeout should cover those on edison, and 20s (with 3 retries) should cover on cori.

I'll report again once we have more information about the success or failure of this approach, but I would strongly suspect that all cray/slurm sites would need to have an increased TcpTimeout, probably chosen based on the reroute throttle time of their network.

-Doug

Comment 22 Tim Wickberg 2017-01-20 16:54:56 MST

(In reply to Doug Jacobsen from comment #21)
> Hello,
> 
> Increasing MessageTimeout was insufficient to correct this issue.  srun and
> sbcasts have still been failing, particularly when individual nodes are
> getting paused upon ORB timeouts (10-15s localized quiesce).
> 
> We've increased TcpTimeout from 2s to 20s on cori (and 10s on edison).  The
> justification is that with 3 retries 2s * 3 retries = 6s (or 8s depending on
> how the retries are counted).  In any case ORB timeouts are 10-15s pauses on
> either cori or edison, so TcpTimeout + retries needs to at least exceed that
> limit.  In addition HSN re-route throttles are 22s on edison and 50s on
> cori.  So 10s TcpTimeout should cover those on edison, and 20s (with 3
> retries) should cover on cori.
> 
> I'll report again once we have more information about the success or failure
> of this approach, but I would strongly suspect that all cray/slurm sites
> would need to have an increased TcpTimeout, probably chosen based on the
> reroute throttle time of their network.

I'm assuming this has been working so far? Any further info you can provide?

Dropping the severity as I believe the increase in TcpTimeout is acting as a sufficient fix for the time being.

Comment 23 Doug Jacobsen 2017-01-20 17:02:58 MST

We ended up needing to set TcpTimeout to 60s on cori because many of the 
critical operations are not retried at this time.  That said, things 
have improved for the bulk of our issues by doing that (as far as I can 
tell).


-Doug


On 1/20/17 3:54 PM, bugs@schedmd.com wrote:
> Tim Wickberg <mailto:tim@schedmd.com> changed bug 3318 
> <https://bugs.schedmd.com/show_bug.cgi?id=3318>
> What 	Removed 	Added
> Severity 	2 - High Impact 	3 - Medium Impact
>
> *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=3318#c22> on 
> bug 3318 <https://bugs.schedmd.com/show_bug.cgi?id=3318> from Tim 
> Wickberg <mailto:tim@schedmd.com> *
> (In reply to Doug Jacobsen fromcomment #21 <show_bug.cgi?id=3318#c21>)
> > Hello, > > Increasing MessageTimeout was insufficient to correct this issue. 
> srun and > sbcasts have still been failing, particularly when 
> individual nodes are > getting paused upon ORB timeouts (10-15s 
> localized quiesce). > > We've increased TcpTimeout from 2s to 20s on 
> cori (and 10s on edison). The > justification is that with 3 retries 
> 2s * 3 retries = 6s (or 8s depending on > how the retries are 
> counted). In any case ORB timeouts are 10-15s pauses on > either cori 
> or edison, so TcpTimeout + retries needs to at least exceed that > 
> limit. In addition HSN re-route throttles are 22s on edison and 50s on 
> > cori. So 10s TcpTimeout should cover those on edison, and 20s (with 
> 3 > retries) should cover on cori. > > I'll report again once we have 
> more information about the success or failure > of this approach, but 
> I would strongly suspect that all cray/slurm sites > would need to 
> have an increased TcpTimeout, probably chosen based on the > reroute 
> throttle time of their network.
>
> I'm assuming this has been working so far? Any further info you can provide?
>
> Dropping the severity as I believe the increase in TcpTimeout is acting as a
> sufficient fix for the time being.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>

Comment 24 Tim Wickberg 2017-02-13 08:57:15 MST

David -

Can we revisit exposing that network state bit from the ghal driver?

I'm assuming it would be relatively straightforward to expose this info under /sys/ with access to the driver.

As NERSC's new bug 3463 shows, these network quiesce periods are a regular occurrence, and continue to cause unexpected issues for us. I'd also prefer to be able to lower the TCPTimeout on their systems for normal use. I think that, given a way to monitor the network status, having Slurm preemptively back off on all communication would provide a much more robust method while not requiring us to come up with additional workarounds to each separate subsystem.

What I'd like to explore is having slurm effectively 'pause' all operations during these periods, as otherwise the usual timeout mechanisms within Slurm are fighting with the unavailability of the communication fabric.

(In reply to David Gloe from comment #9)
> (In reply to Tim Wickberg from comment #8)
> > Is there any way to tell that a quiesce is underway on the compute nodes and
> > sdb?
> > 
> > I'd love to have a way to directly correlate those events to times when
> > there have been other issues on the system; although I'm just speculating
> > what that may entail at this point.
> 
> Chris Johns says: "No, I don't believe so. There is an area of kernel memory
> in the ghal driver that has this information, but it's not exposed to user
> mode."
> 
> So the only way to tell is to dig through logs and correlate times. I'm
> going to try to do that next Tuesday with the logs we've been provided
> (vacation day Monday).

Comment 25 David Gloe 2017-02-13 09:08:05 MST

Unfortunately the work for that has not been scheduled. I've asked if they can bump up the priority.

Comment 27 David Gloe 2017-02-14 07:52:12 MST

Unfortunately, the HSS subscription and RCA libraries are Cray proprietary, so there could be legal issues for Slurm to link to them.

Marlys Kohnke has already notified SchedMD of a method to use the xtconsumer command to subscribe for events. That method should be fine legally.

Doug, I advise you to remove the snippet from rca_lib.h - here's the disclaimer from the top of that file:
/*
 * Copyright (c) 2002 Cray, Inc.
 *
 * The contents of this file is proprietary information of Cray Inc. 
 * and may not be disclosed without prior written consent.
 */

Comment 28 David Gloe 2017-02-14 07:54:39 MST

Here's the information from Marlys on using xtconsumer:
https://bugs.schedmd.com/show_bug.cgi?id=2873#c6

Comment 29 Brian F Gilmer 2017-02-14 08:11:00 MST

https://bugs.schedmd.com/show_bug.cgi?id=3318(In reply to David Gloe from comment #28)
> Here's the information from Marlys on using xtconsumer:
> https://bugs.schedmd.com/show_bug.cgi?id=2873#c6

The use of xtconsumer requires slurmctld to run on the SDB and precludes using a backup controller.

Comment 32 Moe Jette 2017-02-14 08:56:28 MST

(In reply to Brian F Gilmer from comment #29)
> The use of xtconsumer requires slurmctld to run on the SDB and precludes
> using a backup controller.

That also precludes informing the login and compute nodes that the network is quiescing, since the network is quiescing...

What other mechanisms are available?

Comment 33 Doug Jacobsen 2017-02-14 15:12:09 MST

David, I don't see any way to remove or edit a comment, so I'm making this bug private.  That said, I excerpted the pasted bit from a file installed by default on cray login nodes.

I'm not terribly interested in running the on the sdb because that generates a number of other problems with the way we must run slurmctld.

Also, it seems to me that slurmd needs this information as well, it would be pretty limited if only slurmctld could access the data.

Perhaps Cray could make the data available in some other way that would be accessible on all nodes?

Comment 34 Tim Wickberg 2017-02-14 15:20:19 MST

I'm marking comment 26 internal, and removing the NERSC group here. With the NERSC group set Brian Gilmore wouldn't be able to see this any longer.

Comment 35 David Gloe 2017-02-14 15:36:31 MST

I agree, it's silly to claim that installed plaintext header files are proprietary.

I am not a lawyer, but it may be possible for Slurm to use the Cray event system if it is considered an operating system component, or if an exception is added to the Slurm license. If you would like I can try contacting Cray's legal team about this.

Other than that, you could stress the importance of the RFE I filed with your support representative. It's Cray bug number 846903, JIRA issue NETARIES-44. I think it's not getting as much attention since it was filed internally and not by a customer.

Comment 36 Doug Jacobsen 2017-02-14 23:00:37 MST

OK, I'll read up on 846903 (assuming it's public).  Is there any chance we can have a 3-way discussion on this topic either Thursday or next week after the holiday.

I'm given to understand that Moe may be visiting the Lab on Thursday/Friday (but I won't be in on Friday), so that may be an opportune time for a discussion.  Otherwise all on phone next week would be good.

In particular, I think a brief discussion of the current state of affairs, what we think the desired behavior should be, bug 846903, and perhaps discussion of other methods that might have some relevance to allowing (1) a slurm porcess to determine if it is operating on a node that may not have network connectivity, (2) a slurm process to determine if it is about to send traffic to a disconnected end point (avoid ORB timeout), and possibly (3) get throttle/congestion events into the user output.

Tina Declerck (system lead of cori) will probably reach out to Cray about this topic tomorrow.

Comment 37 David Gloe 2017-02-15 08:39:06 MST

I've made bug 846903 public, you should be able to see it now.

Comment 38 David Gloe 2017-03-07 10:08:21 MST

We're moving forward with adding a /proc file which shows quiesce status. I think that will be available in CLE 6.0up04 in June. I'll give further details on the path and format as I get them.

Comment 39 Tim Wickberg 2017-04-18 21:52:21 MDT

(In reply to David Gloe from comment #38)
> We're moving forward with adding a /proc file which shows quiesce status. I
> think that will be available in CLE 6.0up04 in June. I'll give further
> details on the path and format as I get them.

David -

Any update on this? Would it be possible to get a patched version on Kachina ahead of June to start developing a fix?

Comment 40 David Gloe 2017-04-19 08:04:52 MDT

This may be available on kachina now, though I can't confirm at the moment.
Quiesce status is available in the file /sys/class/gni/ghal0/quiesce_status

A nonzero value means a quiesce is occurring.

Comment 41 Moe Jette 2017-04-19 08:22:05 MDT

(In reply to David Gloe from comment #40)
> This may be available on kachina now, though I can't confirm at the moment.
> Quiesce status is available in the file /sys/class/gni/ghal0/quiesce_status
> 
> A nonzero value means a quiesce is occurring.

Where is this available? All nodes on the network?

Comment 42 David Gloe 2017-04-19 08:26:15 MDT

(In reply to Moe Jette from comment #41)
> (In reply to David Gloe from comment #40)
> > This may be available on kachina now, though I can't confirm at the moment.
> > Quiesce status is available in the file /sys/class/gni/ghal0/quiesce_status
> > 
> > A nonzero value means a quiesce is occurring.
> 
> Where is this available? All nodes on the network?

Yes, it should be available on all nodes. It's also world readable.
dgloe@opal-p2:~> ls -l /sys/class/gni/ghal0/quiesce_status
-r--r--r-- 1 root root 4096 Apr 19 09:25 /sys/class/gni/ghal0/quiesce_status

Comment 52 Tim Wickberg 2017-08-17 12:36:59 MDT

I'm updating this to match our current understanding and proposed approach, and acknowledge that there is yet more work to tackle on this.

I believe the next Cray release will expose /sys/class/gni/ghal0/quiesce_status, which we can then use to inform slurmctld/slurmd to back off their usual timeouts.

Unfortunately, I have not had time to pursue this further, and this is unlikely to happen before we feature freeze 17.11 (although the associated code could likely be backported easily into 17.11 / 17.02 when/if finished).

I'm reclassifying this as a Sev5 enhancement to reflect that.

Comment 53 Doug Jacobsen 2017-11-10 22:14:08 MST

FYI, it appears we have this path now:

nid00009:/proc/sys/kgnilnd # cat /sys/class/gni/ghal0/quiesce_status
0
nid00009:/proc/sys/kgnilnd #

as well as:

nid00009:/proc/sys/kgnilnd # cat /sys/class/gni/ghal0/quiesce
0
nid00009:/proc/sys/kgnilnd #

as well as:

nid00009:/proc/sys/kgnilnd # cat /proc/sys/kgnilnd/hw_quiesce
0
nid00009:/proc/sys/kgnilnd #


What is the difference between these?


This is ramping back up my priority list as the pressure to stop downing partitions during warmswap maintenance operations is building.

I suppose that connections would be most sensitive, and thus an initial stab might be (against the 17.02 codebase right now):


diff --git a/src/common/slurm_protocol_socket_implementation.c b/src/common/slurm_protocol_socket_implementation.c
index 9f4608b..b5cec6d 100644
--- a/src/common/slurm_protocol_socket_implementation.c
+++ b/src/common/slurm_protocol_socket_implementation.c
@@ -470,6 +470,23 @@ extern int slurm_open_stream(slurm_addr_t *addr, bool retry)
 	uint16_t port;
 	char ip[32];

+#ifdef HAVE_NATIVE_CRAY
+	char buffer[20];
+	int max_retry = 300;
+	int quiesce_fd = open("/sys/class/gni/ghal0/quiesce_status", O_RDONLY);
+	while (quiesce_fd >= 0 && retry < max_retry) {
+		if (read(quiesce_fd, buffer, sizeof(buffer)) > 0) {
+			if (buffer[0] == '0')
+				break;
+		}
+		usleep(500000);
+		retry++;
+		close(quiesce_fd);
+		quiesce_fd = open("/sys/class/gni/ghal0/quiesce_status",
+				  O_RDONLY);
+	}
+#endif
+
 	if ( (addr->sin_family == 0) || (addr->sin_port  == 0) ) {
 		error("Error connecting, bad data: family = %u, port = %u",
 			addr->sin_family, addr->sin_port);





David, is there a way to generate an event that will force a quiesce?  I suppose I could force a warmswap and try to stage things to "test" this, but it feels like forcing an event with xtgenevent on the SMW might make the testing a little more reliable.


I'm testing a version of the smw_xtconsumer_slurm_helperd I've been working on and will send that shortly.

Comment 54 Doug Jacobsen 2017-11-10 22:16:42 MST

doh, forgot to close the fd in all (most) cases:

diff --git a/src/common/slurm_protocol_socket_implementation.c b/src/common/slurm_protocol_socket_implementation.c
index 9f4608b..859d49e 100644
--- a/src/common/slurm_protocol_socket_implementation.c
+++ b/src/common/slurm_protocol_socket_implementation.c
@@ -470,6 +470,25 @@ extern int slurm_open_stream(slurm_addr_t *addr, bool retry)
 	uint16_t port;
 	char ip[32];

+#ifdef HAVE_NATIVE_CRAY
+	char buffer[20];
+	int max_retry = 300;
+	int quiesce_fd = open("/sys/class/gni/ghal0/quiesce_status", O_RDONLY);
+	while (quiesce_fd >= 0 && retry < max_retry) {
+		if (read(quiesce_fd, buffer, sizeof(buffer)) > 0) {
+			if (buffer[0] == '0')
+				break;
+		}
+		usleep(500000);
+		retry++;
+		close(quiesce_fd);
+		quiesce_fd = open("/sys/class/gni/ghal0/quiesce_status",
+				  O_RDONLY);
+	}
+	if (quiesce_fd >= 0)
+		close(quiesce_fd);
+#endif
+
 	if ( (addr->sin_family == 0) || (addr->sin_port  == 0) ) {
 		error("Error connecting, bad data: family = %u, port = %u",
 			addr->sin_family, addr->sin_port);

Comment 55 Doug Jacobsen 2017-11-12 05:31:17 MST

Created attachment 5546 [details]
slurmsmwd initial patch

17.02 patch (only because the slurm.spec and automake changes are presently against slurm 17.02).  This is lightly tested and appears functional.  The spec file modifications are essentially untested.

NERSC is using our elogin build of slurm on the SMW (built with --enable-really-no-cray), copying the elogin version of /etc/slurm, and running munge on the SMW to use this daemon.

We are presently only testing it on a test system, but initial results function as expected.  When an ec_node_failed or ec_node_unavailable message is received the targeted nodes are marked NotResponding.

This code assumes Cascade hardware (encodes a mechanism for translating cnames to nids).

Example /etc/slurm/slurmsmwd.conf
CabinetsPerRow=12
LogFile=/var/opt/cray/log/p0-current/slurmsmwd.log
DebugLevel=debug3

The CabinetsPerRow is the only required input for XC systems to convert cname to nid, so is required.  For single row systems (or air cooled systems), any value greater or equal to the number of cabinets is fine.

A systemd service file is included.

Comment 56 David Gloe 2017-11-13 08:35:15 MST

I'm not sure what the quiesce and hw_quiesce files do. The quiesce_status file will tell you if there's a quiesce happening at the moment.

You can use xtwarmswap without changing out hardware at all to force a quiesce. I don't know of other ways to quiesce the network on command.

Comment 57 Doug Jacobsen 2017-11-14 07:41:38 MST

Created attachment 5554 [details]
aries resiliency protecting connect() during local quiesce

Comment 58 Doug Jacobsen 2017-11-28 00:48:15 MST

Created attachment 5633 [details]
17.11 slurmsmwd patch

Comment 59 Doug Jacobsen 2017-11-28 00:48:41 MST

Created attachment 5634 [details]
17.11 check_ghal patch

Comment 60 Doug Jacobsen 2017-11-28 00:49:33 MST

Created attachment 5635 [details]
17.11 slurmsmwd spec file patch

Comment 61 Doug Jacobsen 2017-11-28 00:52:12 MST

I've been using these patches (and nearly identical patches against slurm 17.02) on my test systems for a couple weeks now with no obvious negative impact.  Will be rolling out to production in slurm 17.02.9 (or .10 depending on the release date) in the next week or two.

If it's possible for inclusion in 17.11, I think that would be ideal -- especially since the bulk of the code is in contrib.  Otherwise I'll just keep patching and will have to somehow direct other interested parties in the Cray/Slurm community to this bug.

Comment 62 Tim Wickberg 2017-11-28 01:03:09 MST

Thanks for getting this submitted, this is looking really nice.

But I can't get this reviewed properly for 17.11.0 - we're still expecting to cut the release tomorrow.

I'll discuss internally whether we can override our usual restrictions and get slurmsmwd included in 17.11.1 or some later maintenance release.

- Tim

Comment 64 Doug Jacobsen 2017-11-28 01:17:26 MST

OK, no trouble.  It seems to merge in fine for me, so I'll live with it in our build system for the time being.

Comment 67 Doug Jacobsen 2018-01-23 23:30:08 MST

FYI, we've been operating slurm 17.11 with slurmsmwd and the ghal patches on all of our cray systems.

We've been allowing the jobs to schedule through warmswaps and have not noticed any negative effects so far. (in the past we would mark partitions down before starting a warmswap operation, this prevented large scale sruns from crashing (based on the idea that job starts lead to srun starts)).


Example debug3 slurmsmwd log output:

[2018-01-22T14:01:13.739] down node cnt: 36
[2018-01-22T14:01:13.739] setting nid[01340-01343,01488-01491,01572-01575,01668-01671,02732-02735,02952-02955,03196-03199,03408-03411,03852-03855] to NotResponding
[2018-01-23T11:00:02.909] debug3: read 97 bytes
[2018-01-23T11:00:02.925] debug3: got line: 2018-01-23 11:00:02|2018-01-23 11:00:02|0x400020e8 - ec_node_unavailable|src=:1:s0|::c2-0c1s8n2
[2018-01-23T11:00:02.925] received event: ec_node_unavailable, nodelist: ::c2-0c1s8n2
[2018-01-23T11:00:03.709] down node cnt: 1
[2018-01-23T11:00:03.709] setting nid[00482] to NotResponding
[2018-01-23T11:00:27.546] debug3: read 97 bytes
[2018-01-23T11:00:27.546] debug3: got line: :2018-01-23 11:00:27|2018-01-23 11:00:27|0x400020e8 - ec_node_unavailable|src=:1:s0|::c2-0c1s8n2
[2018-01-23T11:00:27.546] received event: ec_node_unavailable, nodelist: ::c2-0c1s8n2
[2018-01-23T11:00:28.959] down node cnt: 1
[2018-01-23T11:00:28.960] setting nid[00482] to NotResponding
[2018-01-23T20:56:05.713] debug3: read 189 bytes
[2018-01-23T20:56:05.713] debug3: got line: 2018-01-23 20:56:05|2018-01-23 20:56:05|0x40008063 - ec_node_failed|src=:1:s0|::c1-2c2s4n1
[2018-01-23T20:56:05.713] received event: ec_node_failed, nodelist: ::c1-2c2s4n1
[2018-01-23T20:56:05.713] debug3: got line: 2018-01-23 20:56:05|2018-01-23 20:56:05|0x400020e8 - ec_node_unavailable|src=:1:s0|::c1-2c2s4n1
[2018-01-23T20:56:05.713] received event: ec_node_unavailable, nodelist: ::c1-2c2s4n1
[2018-01-23T20:56:06.418] down node cnt: 2
[2018-01-23T20:56:06.418] setting nid[03409] to NotResponding


edismw:/var/opt/cray/disk/1/log/p0-20180109t183252 # cat /etc/slurm/slurmsmwd.conf
CabinetsPerRow=8
LogFile=/var/opt/cray/log/p0-current/slurmsmwd.log
DebugLevel=debug3

Comment 72 Danny Auble 2018-03-13 16:53:54 MDT

Created attachment 6367 [details]
Revised patch to work in 17.11

Doug, attached you will find a patch we plan to include in 17.11.6.  Please let me know if you see anything I messed up with.  The code is largely the same as your with just style edits.

Comment 73 Danny Auble 2018-03-15 14:36:10 MDT

Doug this patch is now part of 17.11.6+, commit 9c3441909bc078.  Please report any issues if any are seen.

We are still mulling over attachment 5634 [details].  We will get back to you more on this.  I am guessing since you installed this patch you have not had any quiesce issues?

Comment 74 Danny Auble 2018-03-30 10:46:54 MDT

Doug, The Ghal patch has been added to 18.08.

In commit ee649a605814 we added a new slurm.conf option of CommunicationParameters.

Then we added your patch in 6c8389533b74322 if  CommunicationParameters=CheckGhalQuiesce.

One other note is going forward we moved No*InAddrAny from TopologyParam to CommunicationParameters as well in commit 5e87370ddd729.  The old method will still work, but is no longer documented.

Please reopen if anything else is needed on this.  Until 18.08 using your patch as a local mod is the best option.

Comment 75 Tim Wickberg 2020-07-27 12:29:57 MDT

*** Ticket 3769 has been marked as a duplicate of this ticket. ***