| Summary: | [849333] Backup controller take over problem | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Brian F Gilmer <brian.gilmer> |
| Component: | slurmctld | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex, da, dmjacobsen, tshaw, will |
| Version: | 16.05.9 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=3159 | ||
| Site: | CRAY | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | CSCS - Swiss National Supercomputing Centre | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | SN2546 |
| CLE Version: | Version Fixed: | 17.11.0-pre3 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
SFDC Case comment 0
scontrol show config slurm.conf backup controller slurmctld.log backup controller slurmctld.log |
||
|
Description
Brian F Gilmer
2017-03-16 15:26:46 MDT
Brian - are you able to get me a backtracked from the abort()? I'd also appreciate the full logs around that time. I'll followup with some architectural questions later - unfortunately I think some assumptions made within the slurmctld HA model make network partitioning events hard to recover from on Cray, but want to do some more background research first. Here is the backtrace that was in the case: Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/opt/slurm/16.05.8/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00007fbd730b70c7 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00007fbd730b70c7 in raise () from /lib64/libc.so.6 #1 0x00007fbd730b8478 in abort () from /lib64/libc.so.6 #2 0x000000000044183a in run_backup (callbacks=0x7ffd66693910) at backup.c:239 #3 0x00000000004447cd in main (argc=1, argv=0x7ffd66693af8) at controller.c:476 (gdb) I have the core file, but it is pretty clear where slurmctld exited. Created attachment 4218 [details] SFDC Case comment 0 Here is the case from the customer. Here are 3 of the 4 quiese events. The first is associated with a pause in the slurmcltd log: [2017-03-14T10:49:57.357] debug: updating assoc 1786 [2017-03-14T10:51:15.018] backfill: continuing to yield locks, 113 RPCs pending 2017-03-14T10:49:45.988428+01:00 Finished quiescing the high-speed network 2017-03-14T10:50:18.275010+01:00 Finished unquiescing the high-speed network 2017-03-14T10:53:12.722328+01:00 Finished quiescing the high-speed network 2017-03-14T10:53:48.966898+01:00 Finished unquiescing the high-speed network 2017-03-14T11:13:21.691587+01:00 Finished quiescing the high-speed network 2017-03-14T11:13:51.507933+01:00 Finished unquiescing the high-speed network Brian, One recommendation we give customers when using HA on Cray systems that will help avoid the "split-brain" scenario (where primary and backup controller are both trying to schedule jobs at the same time) is to set the "no_backup_scheduling" in the slurm.conf. This makes it so the backup controller won't try to schedule jobs but just continue to accept job submissions while it waits to reestablish contact with the primary controller. That way, it won't get into a situation of when it's trying to schedule jobs at the same time the primary is trying to schedule due to a connection timeout. Once contact with the primary is reestablished, the backup controller will update the primary. See the below link for more details: https://slurm.schedmd.com/slurm.conf.html#OPT_no_backup_scheduling We have had a lot of success at other Cray sites by using this method. Hopefully, you will consider making this change. Another thing that might be helpful is to increase the SlurmctldTimeout timeout to 500 in the slurm.conf: SlurmctldTimeout=500 This could also help the primary controller be more resistant to network errors but it may or may not be necessary for your site. Let me know if you have any further questions about these suggestions. Regards. Tim Hi Tim, Thanks for the suggestions but I don't think they are sufficient to address the problem. In this case the backup controller called abort() before it stated scheduling job so the "no_backup_scheduling" would not make a difference. That also does not resolve the problem of two controllers running. Increasing SlurmctldTimeout to 500 would not make a difference in this case because the primary was not able to handle incoming request. I was thinking along the lines of a new mechanism to start a switch over. Brian Cray Software Product Support Brian - You're right, in that no_backup_scheduling would have had a significant impact on this problem. Although increasing SlurmctldTimeout may help avoid problems related to what I presume was an Aries network quiesce caused by the HSN cable replacement. At the moment I'm still trying to piece together the exact failure mode they hit here, and have a few followup questions. Can you send up a copy of the slurm.conf for this system? It doesn't appear that we have it on file. Are they running burst_buffer/cray (DataWarp) on here? The StateSaveLocation - what filesystem is that on, and what mount options are used there? thanks, - Tim Created attachment 4231 [details]
scontrol show config
Created attachment 4232 [details]
slurm.conf
Created attachment 4233 [details]
backup controller slurmctld.log
Created attachment 4234 [details]
backup controller slurmctld.log
the other attachment is the primary controller.
> The StateSaveLocation - what filesystem is that on, and what mount options
> are used there?
daintsl01:~ # grep StateSaveLocation /etc/opt/slurm/slurm.conf
StateSaveLocation=/non_volatile/slurm/shared
daintsl01:~ #
daintsl01:~ # df | grep slurm
boot:/non_volatile/c8-1c0s12n0/var/spool/slurm 52403200
1313792 51089408 3% /var/spool/slurm
daintsl01:~ #
boot:~ # exportfs
/cray_home 148.187.32.0/255.255.224.0
/non_volatile 148.187.32.0/255.255.224.0
/var/opt/cray/imps
148.187.32.0/255.255.224.0
/ 148.187.32.0/255.255.224.0
_ping_controller() {
...
if (slurm_send_recv_rc_msg_only_one(&req, &rc, 0) < 0) {
error("_ping_controller/slurm_send_node_msg error: %m");
return SLURM_ERROR;
}
}
backup controller log
[2017-03-14T10:50:12.187] error: _ping_controller/slurm_send_node_msg error: Connection timed out
The controller is waiting on threads to process the ping request from the backup:
/*
* Process incoming RPCs until told to shutdown
*/
while (_wait_for_server_thread()) {
...
_wait_for_server_thread() {
...
if (slurmctld_config.server_thread_count < max_server_threads) {
slurmctld_config.server_thread_count++;
break;
} else {
/* wait for state change and retry,
* just a delay and not an error.
* This can happen when the epilog completes
* on a bunch of nodes at the same time, which
* can easily happen for highly parallel jobs. */
...
pthread_cond_wait(&server_thread_cond,
&slurmctld_config.thread_count_lock);
As evidence the backfill message:
if ((defer_rpc_cnt == 0) ||
(slurmctld_config.server_thread_count <= max_rpc_cnt))
break;
verbose("backfill: continuing to yield locks, %d RPCs pending",
slurmctld_config.server_thread_count);
}
----------
What is happening is that the backup controller is mistaking exhausted server thread count as the primary being out of service. What I am suggesting is that using the normal RPC processing thread is problematic given the load that the XC puts on the controller.
I think the fault detection regime can be improved to be more comprehensive or more direct. What would a person do to see if the controller was running? Maybe those steps could be coded either into the backup controller or into an independent monitor.
Just an update - we're working on patch that mitigates the most common unintentional takeover scenario here, and will ensure that the backup does not assert control from a primary that is still making changes to the shared state files. This should be in 17.02.3 when released; I will send you a link to the commit when done, or if you'd like it attached here (or if you'd like a preliminary version) please let me know. - Tim Thanks Tim, My concern is bigger than just a single failure mode. The backup controller switchover has been a problem at CSCS for years. At CSCS the backup controller failed twice this year. The reliability of Slurm is lower because the switch over mechanism is less reliable than the redundant systems. On the XC the switch over mechanism is coupled to the Aries network which is much less reliable and predictable than other TCP/IP network implementations. What I would like to see is a more independent mechanism for determining the state of health of the controller. And a more reliable communications architecture between the two controllers. Since the communication problem is specific to the XC is would be reasonable to offload that function into plug-in or an external process. The state of health of the controller needs to be internal and as independent as feasible. Something along the lines of the Linux watchdog. (In reply to Brian F Gilmer from comment #21) > Thanks Tim, > > > My concern is bigger than just a single failure mode. The backup controller > switchover has been a problem at CSCS for years. At CSCS the backup > controller failed twice this year. The reliability of Slurm is lower > because the switch over mechanism is less reliable than the redundant > systems. On the XC the switch over mechanism is coupled to the Aries > network which is much less reliable and predictable than other TCP/IP > network implementations. Certainly understood; I'm still hammering on our proposed fix for this, and won't send it over until I'm confident it will behave properly. The main issue we face on Cray systems is that the network connection between the controllers stops, but the storage used for StateSaveLocation remains accessible from both nodes. The approach we're looking into will start testing for liveness of the other slurmctld through those state files when the network is inaccessible. > What I would like to see is a more independent mechanism for determining the > state of health of the controller. And a more reliable communications > architecture between the two controllers. Since the communication problem > is specific to the XC is would be reasonable to offload that function into > plug-in or an external process. The state of health of the controller needs > to be internal and as independent as feasible. Something along the lines of > the Linux watchdog. We're discussing some changes in 17.11 that will add on to the fix in 17.02, and start using the persistent connection mechanisms that have been introduced. Those help avoid issues around RPC storms when the Aries network returns to service. - Tim Tim, I think we are on the same page. This has been an on-going problem on the XC so I like to seeing a long term engagement instead of a quick point solution. Brian Just a status update - I have a preliminary patch for 17.02 in testing right now, and will make that available once we've gone through a bit more verification. This is obviously a rather sensitive part of the code, and we want to make sure we thoroughly vet any changes to it. - Tim *** Ticket 3692 has been marked as a duplicate of this ticket. *** I'd decided to focus on hardening this on the upcoming 17.11 release, as the required changes are a bit too aggressive for a maintenance release.
There are a few further adjustments we plan to make, but to summarize the changes made thus far:
- A new "heartbeat" thread was added to the primary slurmctld. It operates independently of the main service threads and is thus not subject to performance issues (e.g., slurmctld is overloaded with outside RPC requests). That thread updates a file ("heartbeat" in StateSaveLocation) with a current timestamp.
- The backup controller will only assume control once (a) communication with the primary has been lost, and (b) the last recorded timestamp in the heartbeat file is at least SlurmctldTimeout seconds old. When those conditions are met, the backup will finally assume control.
This closes the two failure modes that have been reported:
1) An overloaded slurmctld cannot respond to the backup's ping fast enough to prevent it from asserting control. Solved by the independent heartbeat thread.
2) The network path used to communicate with the compute nodes, and between primary <-> backup does not match the path to the StateSaveLocation storage. A network outage in that scenario results in both primary and backup deciding to load the state files and prepare to service requests once the network returns. This is handled by the heartbeat file living on that storage, and being verified before the backup asserts control.
I will note there is still a potential failure mode that we have not addressed, and do not have an immediate plan to solve. If the primary looses its network connection, but maintains access to shared storage, then the backup will never assert control. I view that as a safer state than the prior behavior, where both primary and backup would be running as the master simultaneously, potentially leading to corruption of the state directory.
The relevant commits follow below, and will be in 17.11-pre3 and up. I will leave this bug open while a few last (comparatively minor) details are finalized.
cheers,
- Tim
commit 29d82d0d364b08adaf2306960e0dcd54be18ebd4
Author: Tim Wickberg <tim@schedmd.com>
Date: Fri Sep 8 17:34:09 2017 -0600
Rearrange some debug levels.
commit 4066aa868def436bdf1fddf02e8104d41b112bd6
Author: Tim Wickberg <tim@schedmd.com>
Date: Fri Sep 8 17:32:04 2017 -0600
Rename get_last_state_write() to _get_last_job_state_write().
commit 192eeb9b50f736b965c440cc2b157e280d3fc069
Author: Tim Wickberg <tim@schedmd.com>
Date: Fri Sep 8 17:26:26 2017 -0600
Change the backup to use the heartbeat file instead of the job_state.
Only cut over when the heartbeat file is not being updated any longer.
Bug 4142.
commit 26bf905c1ccaae153286eafc358dae81e72898bd
Author: Tim Wickberg <tim@schedmd.com>
Date: Fri Sep 8 17:02:27 2017 -0600
Add a separate heartbeat thread for HA.
Will write out a timestamp into a 'heartbeat' file in StateSaveLocation
every (SlurmctldTimeout / 4) seconds to demonstrate that the primary
controller still has access to the directory, and thus the backup
should avoid taking control.
Bug 4142.
|