At NERSC, the customer has noticed that network quiesces lead to Slurm socket timeouts, causing job launch failures. Quiesces can cause network traffic to be delayed by up to about 90 seconds, and they currently have MessageTimeout set to 60. Would it be a good idea to increase the MessageTimeout even further to handle this?
If you have access to the Cray bugzilla, there's a wealth of information in bug 846093. The bug is urgent and acceptance gating so I've made it high priority here.
FYI - whenever you file against the NERSC site Doug Jacobsen needs to be added as a CC; he manages the direct support contract for NERSC with SchedMD. If you'd prefer to keep filing this against the internal Cray system currently under acceptance, you should categorize this as Cray Internal instead. Adding Brian as well to keep him in the loop. I'm assuming that these network quiesces are the cause behind both bug 3313 and 3315 as well; I'd suggest we close those out as duplicates of this issue to avoid further problems keeping everyone in sync. (One of those Doug owns, one of those Brian owns.) From our perspective, that quiesce period is highly problematic. While you could increase the timeout to avoid this, is there anything that prevents repeated queiesce periods? At a certain point other timers will come in to play here; Slurm does prefer to be able to maintain contact with all of the nodes in a reasonable interval, and it sounds like this network issues precludes that on a fairly regular basis.
Here's a comment from Chris Johns with more information about quiesces: I'd just like to make the point again that a network quiesce is supposed to be a relatively rare event, happening only when links fail, or when a warm swap is done, and is supposed to last for as little time as possible. However, because some relatively complex, and system-wide, operations have to be done while the network is quiesced, there are some unavoidable durations that will occur in the quiesced state. In particular, there is an absolute minimum of 12 seconds to allow the network to drain of packets before and after up-but-unused links are taken down. In addition, distributing actions across the HSS network to all controllers, acting on those requests, and collecting and verifying responses, all take a certain amount of time. Realistically, the total time spent quiesced on any XC system will be at least 25 seconds, and, depending on particular conditions on any of the blade controllers, could be somewhat more than that, perhaps as high as 50 seconds or so. Note that this is not the total time during which the network could be non-functional, because from the moment a link fails (not a 0x62ff - too many soft errors failure, but a hard link going down failure), packets could stop getting delivered until the new routes are in place and the network is then unquiesced. In the case of one or more blades losing power, causing link failures, a 30 second timeout is in play to determine which blades failed, adding that much time to the processing of the failure. Consequently, in such a case, total time from the beginning of the network problem to an unquiesced, rerouted, fully working network could be on the order of 90 seconds. This has been the case for many years now, since the first Gemini systems were shipped. It baffles me that we're just finding out now that a mere 45 seconds of quiesced time is sufficient to cause serious problems with parts of the CLE software stack. We're not going to reduce the quiesced time below, in the best case, around 30 seconds, and we're not going to reduce the network-is-down time in the case of a blade failure below 60 seconds or so. Consequently, I would suggest that effort is put into understanding why parts of the software stack are having such trouble with these periods of network outage, and attempt to correct them. Other sites (e.g. CSCS, ECMWF, SS*) are not seeing these issues, so why is NERSC special in this regard?
This could be the same issue as 3313 and 3315 but I don't know that we've confirmed that a quiesce happened at the same time those jobs failed. In one set of logs from this system there were 37 quiesces in a time period of 3 days and 3 hours.
Is there any way to tell that a quiesce is underway on the compute nodes and sdb? I'd love to have a way to directly correlate those events to times when there have been other issues on the system; although I'm just speculating what that may entail at this point.
(In reply to Tim Wickberg from comment #8) > Is there any way to tell that a quiesce is underway on the compute nodes and > sdb? > > I'd love to have a way to directly correlate those events to times when > there have been other issues on the system; although I'm just speculating > what that may entail at this point. Chris Johns says: "No, I don't believe so. There is an area of kernel memory in the ghal driver that has this information, but it's not exposed to user mode." So the only way to tell is to dig through logs and correlate times. I'm going to try to do that next Tuesday with the logs we've been provided (vacation day Monday).
A really irritating way one can do it on node is to take a look at the kernel message log (dmesg), and see if there is a recent message from the LNet kernel module stating that the aries is quiesced: ctl1:~ # dmesg | grep -i quies | tail [1168692.345619] LNet: Quiesce start: hardware quiesce [1168737.373305] LNet: Quiesce complete: hardware quiesce [1169123.727892] LNet: Quiesce start: hardware quiesce [1169178.758209] LNet: Quiesce complete: hardware quiesce [1169566.178705] LNet: Quiesce start: hardware quiesce [1169611.204935] LNet: Quiesce complete: hardware quiesce [1187948.039176] LNet: Quiesce start: hardware quiesce [1188003.071006] LNet: Quiesce complete: hardware quiesce [1188382.446692] LNet: Quiesce start: hardware quiesce [1188427.472886] LNet: Quiesce complete: hardware quiesce ctl1:~ # In any case I'm quite sure the sbcast issues are unrelated to quiesces, and do deserve a bug in their own right. A user did effectively do an #!/bin/bash set -e sbcast .... srun ... And it still failed with the Text file busy error. In that case the job: ctl1:~ # sacct -j 3195168 --format=job,start,end JobID Start End ------------ ------------------- ------------------- 3195168 2016-12-02T09:40:48 2016-12-02T09:41:41 3195168.bat+ 2016-12-02T09:40:48 2016-12-02T09:41:41 3195168.ext+ 2016-12-02T09:40:48 2016-12-02T09:43:36 3195168.0 2016-12-02T09:40:57 2016-12-02T09:41:31 ctl1:~ # Checking the nlrd log shows that there were no network throttles at 9:40AM. Also there is not time for a 300s timeout between initiation of the batch step and step 0. Also sbcast did not end with a non-zero exit status (which it probably should have done if it exited without sending all data). So, I think we can probably separate network throttles (generally bad for everyone, but especially TCP/IP traffic) from the sbcast issues. -Doug
Doug - Before this weekend's fun with 3320 you were going to try to correlate some of these communication issues within slurm against the aries quiesce messages; have you had a chance to do that? Based on discussions here, setting the MessageTimeout to 120 may be an okay compromise; you'll have slightly longer delays detecting failed nodes, but may be able to smoothly work past the quiesce period. David - Would it be possible to add a file under /sys that we could poll periodically for the Aries network status? Being able to report that in certain slurmd/slurmctld log messages would be a good start to better understanding when these communication issues arise.
The obvious problem with shutting down the network for 90 seconds is that users launching jobs interactively will suddenly find that the time to launch a job goes from milliseconds to tens of seconds and that happens sporadically every couple of hours or so. It would be really nice detect the situation and provide some sort of user notification that the network is unavailable so they don't point fingers at Slurm's highly inconsistent responsiveness.
What we need to keep in mind is that IP over Aries is not a robust IP network implementation. The Aries is point-to-point so all of the IP network functionality that is normally supplied by routers is missing. For example, the ICMP type 3 messages that unreachable endpoints is missing. On the Aries many of the network errors that are smoothly handled by the network stack do not happen.
(In reply to Tim Wickberg from comment #12) > David - > > Would it be possible to add a file under /sys that we could poll > periodically for the Aries network status? Being able to report that in > certain slurmd/slurmctld log messages would be a good start to better > understanding when these communication issues arise. That should already be in the ipogif device: What: /sys/class/net/<iface>/carrier Date: April 2005 KernelVersion: 2.6.12 Contact: netdev@vger.kernel.org Description: Indicates the current physical link state of the interface. Posssible values are: 0: physical link is down 1: physical link is up What: /sys/class/net/<iface>/dormant Date: March 2006 KernelVersion: 2.6.17 Contact: netdev@vger.kernel.org Description: Indicates whether the interface is in dormant state. Possible values are: 0: interface is not dormant 1: interface is dormant This attribute can be used by supplicant software to signal that the device is not usable unless some supplicant-based authentication is performed (e.g: 802.1x). 'link_mode' attribute will also reflect the dormant state. What: /sys/class/net/<iface>/link_mode Date: March 2006 KernelVersion: 2.6.17 Contact: netdev@vger.kernel.org Description: Indicates the interface link mode, as a decimal number. This attribute should be used in conjunction with 'dormant' attribute to determine the interface usability. Possible values: 0: default link mode 1: dormant link mode The IP over GNI should also leave messages to show up link state. bgilmer@tiger:~> dmesg | awk '/ipogif/ { print }' [ 14.540837] ipogif_init:Cray(R) Gemini IP over Fabric device driver - version 0.17 [ 14.553180] ipogif_init:Copyright (C) 2007 Cray Inc. [ 14.560530] ipogif_probe:Immediate checksum off [ 17.076917] ipogif_up:Bringing interface up bgilmer@tiger:~> dmesg | awk '/eth0/ { print }' [ 15.256772] init: Boot interface (bootif) eth0 not present. [ 15.772228] igb 0000:03:00.0: added PHC on eth0 [ 15.790280] igb 0000:03:00.0: eth0: (PCIe:2.5Gb/s:Width x4) 90:e2:ba:00:ab:c2 [ 15.802249] igb 0000:03:00.0: eth0: PBA No: Unknown [ 165.604544] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 167.671517] igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX [ 167.682483] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Hmm, I just increased MessageTimeout on cori to 120s. Now all sinfo and scontrol instances generate the following ominous warning on invocation: boot-p0:~ # for i in $(sinfo --format="%R" -h) ; do scontrol update part=$i state=up; done sinfo: WARNING: MessageTimeout is too high for effective fault-tolerance scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance scontrol: WARNING: MessageTimeout is too high for effective fault-tolerance boot-p0:~ # -Doug
Doug, that is if you are root. Otherwise it is a debug message if the timeout is > 100. Perhaps we could make a debugflag that handles that differently?
Is there an interface spec for IPoGIF? Perhaps it could notify us a quiesce is happening. If so we could handle it in a more graceful manner?
Doug, how is the 120 timeout working? I could raise that check to 120 so the message doesn't get printed if you would like. Just let me know.
Hello, Increasing MessageTimeout was insufficient to correct this issue. srun and sbcasts have still been failing, particularly when individual nodes are getting paused upon ORB timeouts (10-15s localized quiesce). We've increased TcpTimeout from 2s to 20s on cori (and 10s on edison). The justification is that with 3 retries 2s * 3 retries = 6s (or 8s depending on how the retries are counted). In any case ORB timeouts are 10-15s pauses on either cori or edison, so TcpTimeout + retries needs to at least exceed that limit. In addition HSN re-route throttles are 22s on edison and 50s on cori. So 10s TcpTimeout should cover those on edison, and 20s (with 3 retries) should cover on cori. I'll report again once we have more information about the success or failure of this approach, but I would strongly suspect that all cray/slurm sites would need to have an increased TcpTimeout, probably chosen based on the reroute throttle time of their network. -Doug
(In reply to Doug Jacobsen from comment #21) > Hello, > > Increasing MessageTimeout was insufficient to correct this issue. srun and > sbcasts have still been failing, particularly when individual nodes are > getting paused upon ORB timeouts (10-15s localized quiesce). > > We've increased TcpTimeout from 2s to 20s on cori (and 10s on edison). The > justification is that with 3 retries 2s * 3 retries = 6s (or 8s depending on > how the retries are counted). In any case ORB timeouts are 10-15s pauses on > either cori or edison, so TcpTimeout + retries needs to at least exceed that > limit. In addition HSN re-route throttles are 22s on edison and 50s on > cori. So 10s TcpTimeout should cover those on edison, and 20s (with 3 > retries) should cover on cori. > > I'll report again once we have more information about the success or failure > of this approach, but I would strongly suspect that all cray/slurm sites > would need to have an increased TcpTimeout, probably chosen based on the > reroute throttle time of their network. I'm assuming this has been working so far? Any further info you can provide? Dropping the severity as I believe the increase in TcpTimeout is acting as a sufficient fix for the time being.
We ended up needing to set TcpTimeout to 60s on cori because many of the critical operations are not retried at this time. That said, things have improved for the bulk of our issues by doing that (as far as I can tell). -Doug On 1/20/17 3:54 PM, bugs@schedmd.com wrote: > Tim Wickberg <mailto:tim@schedmd.com> changed bug 3318 > <https://bugs.schedmd.com/show_bug.cgi?id=3318> > What Removed Added > Severity 2 - High Impact 3 - Medium Impact > > *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=3318#c22> on > bug 3318 <https://bugs.schedmd.com/show_bug.cgi?id=3318> from Tim > Wickberg <mailto:tim@schedmd.com> * > (In reply to Doug Jacobsen fromcomment #21 <show_bug.cgi?id=3318#c21>) > > Hello, > > Increasing MessageTimeout was insufficient to correct this issue. > srun and > sbcasts have still been failing, particularly when > individual nodes are > getting paused upon ORB timeouts (10-15s > localized quiesce). > > We've increased TcpTimeout from 2s to 20s on > cori (and 10s on edison). The > justification is that with 3 retries > 2s * 3 retries = 6s (or 8s depending on > how the retries are > counted). In any case ORB timeouts are 10-15s pauses on > either cori > or edison, so TcpTimeout + retries needs to at least exceed that > > limit. In addition HSN re-route throttles are 22s on edison and 50s on > > cori. So 10s TcpTimeout should cover those on edison, and 20s (with > 3 > retries) should cover on cori. > > I'll report again once we have > more information about the success or failure > of this approach, but > I would strongly suspect that all cray/slurm sites > would need to > have an increased TcpTimeout, probably chosen based on the > reroute > throttle time of their network. > > I'm assuming this has been working so far? Any further info you can provide? > > Dropping the severity as I believe the increase in TcpTimeout is acting as a > sufficient fix for the time being. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. >
David - Can we revisit exposing that network state bit from the ghal driver? I'm assuming it would be relatively straightforward to expose this info under /sys/ with access to the driver. As NERSC's new bug 3463 shows, these network quiesce periods are a regular occurrence, and continue to cause unexpected issues for us. I'd also prefer to be able to lower the TCPTimeout on their systems for normal use. I think that, given a way to monitor the network status, having Slurm preemptively back off on all communication would provide a much more robust method while not requiring us to come up with additional workarounds to each separate subsystem. What I'd like to explore is having slurm effectively 'pause' all operations during these periods, as otherwise the usual timeout mechanisms within Slurm are fighting with the unavailability of the communication fabric. (In reply to David Gloe from comment #9) > (In reply to Tim Wickberg from comment #8) > > Is there any way to tell that a quiesce is underway on the compute nodes and > > sdb? > > > > I'd love to have a way to directly correlate those events to times when > > there have been other issues on the system; although I'm just speculating > > what that may entail at this point. > > Chris Johns says: "No, I don't believe so. There is an area of kernel memory > in the ghal driver that has this information, but it's not exposed to user > mode." > > So the only way to tell is to dig through logs and correlate times. I'm > going to try to do that next Tuesday with the logs we've been provided > (vacation day Monday).
Unfortunately the work for that has not been scheduled. I've asked if they can bump up the priority.
Unfortunately, the HSS subscription and RCA libraries are Cray proprietary, so there could be legal issues for Slurm to link to them. Marlys Kohnke has already notified SchedMD of a method to use the xtconsumer command to subscribe for events. That method should be fine legally. Doug, I advise you to remove the snippet from rca_lib.h - here's the disclaimer from the top of that file: /* * Copyright (c) 2002 Cray, Inc. * * The contents of this file is proprietary information of Cray Inc. * and may not be disclosed without prior written consent. */
Here's the information from Marlys on using xtconsumer: https://bugs.schedmd.com/show_bug.cgi?id=2873#c6
https://bugs.schedmd.com/show_bug.cgi?id=3318(In reply to David Gloe from comment #28) > Here's the information from Marlys on using xtconsumer: > https://bugs.schedmd.com/show_bug.cgi?id=2873#c6 The use of xtconsumer requires slurmctld to run on the SDB and precludes using a backup controller.
(In reply to Brian F Gilmer from comment #29) > The use of xtconsumer requires slurmctld to run on the SDB and precludes > using a backup controller. That also precludes informing the login and compute nodes that the network is quiescing, since the network is quiescing... What other mechanisms are available?
David, I don't see any way to remove or edit a comment, so I'm making this bug private. That said, I excerpted the pasted bit from a file installed by default on cray login nodes. I'm not terribly interested in running the on the sdb because that generates a number of other problems with the way we must run slurmctld. Also, it seems to me that slurmd needs this information as well, it would be pretty limited if only slurmctld could access the data. Perhaps Cray could make the data available in some other way that would be accessible on all nodes?
I'm marking comment 26 internal, and removing the NERSC group here. With the NERSC group set Brian Gilmore wouldn't be able to see this any longer.
I agree, it's silly to claim that installed plaintext header files are proprietary. I am not a lawyer, but it may be possible for Slurm to use the Cray event system if it is considered an operating system component, or if an exception is added to the Slurm license. If you would like I can try contacting Cray's legal team about this. Other than that, you could stress the importance of the RFE I filed with your support representative. It's Cray bug number 846903, JIRA issue NETARIES-44. I think it's not getting as much attention since it was filed internally and not by a customer.
OK, I'll read up on 846903 (assuming it's public). Is there any chance we can have a 3-way discussion on this topic either Thursday or next week after the holiday. I'm given to understand that Moe may be visiting the Lab on Thursday/Friday (but I won't be in on Friday), so that may be an opportune time for a discussion. Otherwise all on phone next week would be good. In particular, I think a brief discussion of the current state of affairs, what we think the desired behavior should be, bug 846903, and perhaps discussion of other methods that might have some relevance to allowing (1) a slurm porcess to determine if it is operating on a node that may not have network connectivity, (2) a slurm process to determine if it is about to send traffic to a disconnected end point (avoid ORB timeout), and possibly (3) get throttle/congestion events into the user output. Tina Declerck (system lead of cori) will probably reach out to Cray about this topic tomorrow.
I've made bug 846903 public, you should be able to see it now.
We're moving forward with adding a /proc file which shows quiesce status. I think that will be available in CLE 6.0up04 in June. I'll give further details on the path and format as I get them.
(In reply to David Gloe from comment #38) > We're moving forward with adding a /proc file which shows quiesce status. I > think that will be available in CLE 6.0up04 in June. I'll give further > details on the path and format as I get them. David - Any update on this? Would it be possible to get a patched version on Kachina ahead of June to start developing a fix?
This may be available on kachina now, though I can't confirm at the moment. Quiesce status is available in the file /sys/class/gni/ghal0/quiesce_status A nonzero value means a quiesce is occurring.
(In reply to David Gloe from comment #40) > This may be available on kachina now, though I can't confirm at the moment. > Quiesce status is available in the file /sys/class/gni/ghal0/quiesce_status > > A nonzero value means a quiesce is occurring. Where is this available? All nodes on the network?
(In reply to Moe Jette from comment #41) > (In reply to David Gloe from comment #40) > > This may be available on kachina now, though I can't confirm at the moment. > > Quiesce status is available in the file /sys/class/gni/ghal0/quiesce_status > > > > A nonzero value means a quiesce is occurring. > > Where is this available? All nodes on the network? Yes, it should be available on all nodes. It's also world readable. dgloe@opal-p2:~> ls -l /sys/class/gni/ghal0/quiesce_status -r--r--r-- 1 root root 4096 Apr 19 09:25 /sys/class/gni/ghal0/quiesce_status
I'm updating this to match our current understanding and proposed approach, and acknowledge that there is yet more work to tackle on this. I believe the next Cray release will expose /sys/class/gni/ghal0/quiesce_status, which we can then use to inform slurmctld/slurmd to back off their usual timeouts. Unfortunately, I have not had time to pursue this further, and this is unlikely to happen before we feature freeze 17.11 (although the associated code could likely be backported easily into 17.11 / 17.02 when/if finished). I'm reclassifying this as a Sev5 enhancement to reflect that.
FYI, it appears we have this path now: nid00009:/proc/sys/kgnilnd # cat /sys/class/gni/ghal0/quiesce_status 0 nid00009:/proc/sys/kgnilnd # as well as: nid00009:/proc/sys/kgnilnd # cat /sys/class/gni/ghal0/quiesce 0 nid00009:/proc/sys/kgnilnd # as well as: nid00009:/proc/sys/kgnilnd # cat /proc/sys/kgnilnd/hw_quiesce 0 nid00009:/proc/sys/kgnilnd # What is the difference between these? This is ramping back up my priority list as the pressure to stop downing partitions during warmswap maintenance operations is building. I suppose that connections would be most sensitive, and thus an initial stab might be (against the 17.02 codebase right now): diff --git a/src/common/slurm_protocol_socket_implementation.c b/src/common/slurm_protocol_socket_implementation.c index 9f4608b..b5cec6d 100644 --- a/src/common/slurm_protocol_socket_implementation.c +++ b/src/common/slurm_protocol_socket_implementation.c @@ -470,6 +470,23 @@ extern int slurm_open_stream(slurm_addr_t *addr, bool retry) uint16_t port; char ip[32]; +#ifdef HAVE_NATIVE_CRAY + char buffer[20]; + int max_retry = 300; + int quiesce_fd = open("/sys/class/gni/ghal0/quiesce_status", O_RDONLY); + while (quiesce_fd >= 0 && retry < max_retry) { + if (read(quiesce_fd, buffer, sizeof(buffer)) > 0) { + if (buffer[0] == '0') + break; + } + usleep(500000); + retry++; + close(quiesce_fd); + quiesce_fd = open("/sys/class/gni/ghal0/quiesce_status", + O_RDONLY); + } +#endif + if ( (addr->sin_family == 0) || (addr->sin_port == 0) ) { error("Error connecting, bad data: family = %u, port = %u", addr->sin_family, addr->sin_port); David, is there a way to generate an event that will force a quiesce? I suppose I could force a warmswap and try to stage things to "test" this, but it feels like forcing an event with xtgenevent on the SMW might make the testing a little more reliable. I'm testing a version of the smw_xtconsumer_slurm_helperd I've been working on and will send that shortly.
doh, forgot to close the fd in all (most) cases: diff --git a/src/common/slurm_protocol_socket_implementation.c b/src/common/slurm_protocol_socket_implementation.c index 9f4608b..859d49e 100644 --- a/src/common/slurm_protocol_socket_implementation.c +++ b/src/common/slurm_protocol_socket_implementation.c @@ -470,6 +470,25 @@ extern int slurm_open_stream(slurm_addr_t *addr, bool retry) uint16_t port; char ip[32]; +#ifdef HAVE_NATIVE_CRAY + char buffer[20]; + int max_retry = 300; + int quiesce_fd = open("/sys/class/gni/ghal0/quiesce_status", O_RDONLY); + while (quiesce_fd >= 0 && retry < max_retry) { + if (read(quiesce_fd, buffer, sizeof(buffer)) > 0) { + if (buffer[0] == '0') + break; + } + usleep(500000); + retry++; + close(quiesce_fd); + quiesce_fd = open("/sys/class/gni/ghal0/quiesce_status", + O_RDONLY); + } + if (quiesce_fd >= 0) + close(quiesce_fd); +#endif + if ( (addr->sin_family == 0) || (addr->sin_port == 0) ) { error("Error connecting, bad data: family = %u, port = %u", addr->sin_family, addr->sin_port);
Created attachment 5546 [details] slurmsmwd initial patch 17.02 patch (only because the slurm.spec and automake changes are presently against slurm 17.02). This is lightly tested and appears functional. The spec file modifications are essentially untested. NERSC is using our elogin build of slurm on the SMW (built with --enable-really-no-cray), copying the elogin version of /etc/slurm, and running munge on the SMW to use this daemon. We are presently only testing it on a test system, but initial results function as expected. When an ec_node_failed or ec_node_unavailable message is received the targeted nodes are marked NotResponding. This code assumes Cascade hardware (encodes a mechanism for translating cnames to nids). Example /etc/slurm/slurmsmwd.conf CabinetsPerRow=12 LogFile=/var/opt/cray/log/p0-current/slurmsmwd.log DebugLevel=debug3 The CabinetsPerRow is the only required input for XC systems to convert cname to nid, so is required. For single row systems (or air cooled systems), any value greater or equal to the number of cabinets is fine. A systemd service file is included.
I'm not sure what the quiesce and hw_quiesce files do. The quiesce_status file will tell you if there's a quiesce happening at the moment. You can use xtwarmswap without changing out hardware at all to force a quiesce. I don't know of other ways to quiesce the network on command.
Created attachment 5554 [details] aries resiliency protecting connect() during local quiesce
Created attachment 5633 [details] 17.11 slurmsmwd patch
Created attachment 5634 [details] 17.11 check_ghal patch
Created attachment 5635 [details] 17.11 slurmsmwd spec file patch
I've been using these patches (and nearly identical patches against slurm 17.02) on my test systems for a couple weeks now with no obvious negative impact. Will be rolling out to production in slurm 17.02.9 (or .10 depending on the release date) in the next week or two. If it's possible for inclusion in 17.11, I think that would be ideal -- especially since the bulk of the code is in contrib. Otherwise I'll just keep patching and will have to somehow direct other interested parties in the Cray/Slurm community to this bug.
Thanks for getting this submitted, this is looking really nice. But I can't get this reviewed properly for 17.11.0 - we're still expecting to cut the release tomorrow. I'll discuss internally whether we can override our usual restrictions and get slurmsmwd included in 17.11.1 or some later maintenance release. - Tim
OK, no trouble. It seems to merge in fine for me, so I'll live with it in our build system for the time being.
FYI, we've been operating slurm 17.11 with slurmsmwd and the ghal patches on all of our cray systems. We've been allowing the jobs to schedule through warmswaps and have not noticed any negative effects so far. (in the past we would mark partitions down before starting a warmswap operation, this prevented large scale sruns from crashing (based on the idea that job starts lead to srun starts)). Example debug3 slurmsmwd log output: [2018-01-22T14:01:13.739] down node cnt: 36 [2018-01-22T14:01:13.739] setting nid[01340-01343,01488-01491,01572-01575,01668-01671,02732-02735,02952-02955,03196-03199,03408-03411,03852-03855] to NotResponding [2018-01-23T11:00:02.909] debug3: read 97 bytes [2018-01-23T11:00:02.925] debug3: got line: 2018-01-23 11:00:02|2018-01-23 11:00:02|0x400020e8 - ec_node_unavailable|src=:1:s0|::c2-0c1s8n2 [2018-01-23T11:00:02.925] received event: ec_node_unavailable, nodelist: ::c2-0c1s8n2 [2018-01-23T11:00:03.709] down node cnt: 1 [2018-01-23T11:00:03.709] setting nid[00482] to NotResponding [2018-01-23T11:00:27.546] debug3: read 97 bytes [2018-01-23T11:00:27.546] debug3: got line: :2018-01-23 11:00:27|2018-01-23 11:00:27|0x400020e8 - ec_node_unavailable|src=:1:s0|::c2-0c1s8n2 [2018-01-23T11:00:27.546] received event: ec_node_unavailable, nodelist: ::c2-0c1s8n2 [2018-01-23T11:00:28.959] down node cnt: 1 [2018-01-23T11:00:28.960] setting nid[00482] to NotResponding [2018-01-23T20:56:05.713] debug3: read 189 bytes [2018-01-23T20:56:05.713] debug3: got line: 2018-01-23 20:56:05|2018-01-23 20:56:05|0x40008063 - ec_node_failed|src=:1:s0|::c1-2c2s4n1 [2018-01-23T20:56:05.713] received event: ec_node_failed, nodelist: ::c1-2c2s4n1 [2018-01-23T20:56:05.713] debug3: got line: 2018-01-23 20:56:05|2018-01-23 20:56:05|0x400020e8 - ec_node_unavailable|src=:1:s0|::c1-2c2s4n1 [2018-01-23T20:56:05.713] received event: ec_node_unavailable, nodelist: ::c1-2c2s4n1 [2018-01-23T20:56:06.418] down node cnt: 2 [2018-01-23T20:56:06.418] setting nid[03409] to NotResponding edismw:/var/opt/cray/disk/1/log/p0-20180109t183252 # cat /etc/slurm/slurmsmwd.conf CabinetsPerRow=8 LogFile=/var/opt/cray/log/p0-current/slurmsmwd.log DebugLevel=debug3
Created attachment 6367 [details] Revised patch to work in 17.11 Doug, attached you will find a patch we plan to include in 17.11.6. Please let me know if you see anything I messed up with. The code is largely the same as your with just style edits.
Doug this patch is now part of 17.11.6+, commit 9c3441909bc078. Please report any issues if any are seen. We are still mulling over attachment 5634 [details]. We will get back to you more on this. I am guessing since you installed this patch you have not had any quiesce issues?
Doug, The Ghal patch has been added to 18.08. In commit ee649a605814 we added a new slurm.conf option of CommunicationParameters. Then we added your patch in 6c8389533b74322 if CommunicationParameters=CheckGhalQuiesce. One other note is going forward we moved No*InAddrAny from TopologyParam to CommunicationParameters as well in commit 5e87370ddd729. The old method will still work, but is no longer documented. Please reopen if anything else is needed on this. Until 18.08 using your patch as a local mod is the best option.
*** Ticket 3769 has been marked as a duplicate of this ticket. ***