Ticket 3592

Summary:	[849333] Backup controller take over problem
Product:	Slurm	Reporter:	Brian F Gilmer <brian.gilmer>
Component:	slurmctld	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, da, dmjacobsen, tshaw, will
Version:	16.05.9
Hardware:	Cray XC
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=3159
Site:	CRAY	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	CSCS - Swiss National Supercomputing Centre	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	SN2546
CLE Version:		Version Fixed:	17.11.0-pre3
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	SFDC Case comment 0 scontrol show config slurm.conf backup controller slurmctld.log backup controller slurmctld.log

Description Brian F Gilmer 2017-03-16 15:26:46 MDT

During a period with heavy 'updating assoc' activity the backup controller lost contact with the primary.

backup controller log
[2017-03-14T10:50:12.187] error: _ping_controller/slurm_send_node_msg error: Connection timed out
[2017-03-14T10:53:32.218] error: _ping_controller/slurm_send_node_msg error: Connection timed out
[2017-03-14T10:55:29.683] debug:  _slurm_recv_timeout at 0 of 4, timeout
[2017-03-14T10:55:29.683] error: slurm_receive_msg: Socket timed out on send/recv operation
[2017-03-14T10:55:29.715] error: _ping_controller/slurm_send_node_msg error: Socket timed out on send/recv operation
[2017-03-14T10:55:29.728] error: ControlMachine daintsl01 not responding, BackupController daintsl02 taking over
[2017-03-14T10:55:29.740] Terminate signal (SIGINT or SIGTERM) received

The backup controller was attempting to restore state when it called abort().
[2017-03-14T10:57:17.937] Warning: Note very large processing time from read_slurm_conf: usec=97297998 began=10:55:40.639
[2017-03-14T10:57:17.951] error: Unable to recover slurm state

The primary controller was clogged dealing with assoc updates.
[2017-03-14T10:49:57.357] debug:  updating assoc 1786
[2017-03-14T10:51:15.018] backfill: continuing to yield locks, 113 RPCs pending
[2017-03-14T10:51:15.039] debug:  aeld sync event
[2017-03-14T10:51:15.051] debug:  updating assoc 1785
...
[2017-03-14T10:53:00.085] backfill: continuing to yield locks, 256 RPCs pending
...
[2017-03-14T10:55:29.022] backfill: continuing to yield locks, 256 RPCs pending
...
[2017-03-14T10:55:29.536] backfill: continuing to yield locks, 256 RPCs pending

Both the primary and backup controller were running at the same time which cause the backup controller to abort().  Restarting both controllers cleared whatever was causing the association updates.

There was a network outage for 78 seconds starting at 10:49:45.  The SlurmctldTimeout of 120 sec should take care of the outages.  But the current method of using ping to see if the primary controller is up is confounded by the controller not being able to respond.

We may need a more robust method to initiate switch over to backup controller for the XC.

Comment 1 Tim Wickberg 2017-03-16 18:47:49 MDT

Brian - are you able to get me a backtracked from the abort()?

I'd also appreciate the full logs around that time.

I'll followup with some architectural questions later - unfortunately I think some assumptions made within the slurmctld HA model make network partitioning events hard to recover from on Cray, but want to do some more background research first.

Comment 3 Brian F Gilmer 2017-03-17 08:34:00 MDT

Here is the backtrace that was in the case:

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/slurm/16.05.8/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x00007fbd730b70c7 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fbd730b70c7 in raise () from /lib64/libc.so.6
#1  0x00007fbd730b8478 in abort () from /lib64/libc.so.6
#2  0x000000000044183a in run_backup (callbacks=0x7ffd66693910) at backup.c:239
#3  0x00000000004447cd in main (argc=1, argv=0x7ffd66693af8) at controller.c:476
(gdb)

I have the core file, but it is pretty clear where slurmctld exited.

Comment 4 Brian F Gilmer 2017-03-17 08:36:16 MDT

Created attachment 4218 [details]
SFDC Case comment 0

Here is the case from the customer.

Comment 5 Brian F Gilmer 2017-03-17 08:44:14 MDT

Here are 3 of the 4 quiese events.  The first is associated with a pause in the slurmcltd log:
[2017-03-14T10:49:57.357] debug:  updating assoc 1786
[2017-03-14T10:51:15.018] backfill: continuing to yield locks, 113 RPCs pending



2017-03-14T10:49:45.988428+01:00 Finished quiescing the high-speed network
2017-03-14T10:50:18.275010+01:00 Finished unquiescing the high-speed network

2017-03-14T10:53:12.722328+01:00 Finished quiescing the high-speed network
2017-03-14T10:53:48.966898+01:00 Finished unquiescing the high-speed network

2017-03-14T11:13:21.691587+01:00 Finished quiescing the high-speed network
2017-03-14T11:13:51.507933+01:00 Finished unquiescing the high-speed network

Comment 6 Tim Shaw 2017-03-17 11:18:48 MDT

Brian,

One recommendation we give customers when using HA on Cray systems that will help avoid the "split-brain" scenario (where primary and backup controller are both trying to schedule jobs at the same time) is to set the "no_backup_scheduling" in the slurm.conf.  This makes it so the backup controller won't try to schedule jobs but just continue to accept job submissions while it waits to reestablish contact with the primary controller.  That way, it won't get into a situation of when it's trying to schedule jobs at the same time the primary is trying to schedule due to a connection timeout.  Once contact with the primary is reestablished, the backup controller will update the primary.  See the below link for more details:

https://slurm.schedmd.com/slurm.conf.html#OPT_no_backup_scheduling

We have had a lot of success at other Cray sites by using this method.  Hopefully, you will consider making this change.

Another thing that might be helpful is to increase the SlurmctldTimeout timeout to 500 in the slurm.conf:

SlurmctldTimeout=500

This could also help the primary controller be more resistant to network errors but it may or may not be necessary for your site.

Let me know if you have any further questions about these suggestions.

Regards.

Tim

Comment 8 Brian F Gilmer 2017-03-17 12:41:53 MDT

Hi Tim,

Thanks for the suggestions but I don't think they are sufficient to address the problem.  In this case the backup controller called abort() before it stated scheduling job so the "no_backup_scheduling" would not make a difference.  That also does not resolve the problem of two controllers running.  Increasing SlurmctldTimeout to 500 would not make a difference in this case because the primary was not able to handle incoming request.

I was thinking along the lines of a new mechanism to start a switch over.

Brian
Cray Software Product Support

Comment 11 Tim Wickberg 2017-03-20 18:50:32 MDT

Brian -

You're right, in that no_backup_scheduling would have had a significant impact on this problem. Although increasing SlurmctldTimeout may help avoid problems related to what I presume was an Aries network quiesce caused by the HSN cable replacement.

At the moment I'm still trying to piece together the exact failure mode they hit here, and have a few followup questions.

Can you send up a copy of the slurm.conf for this system? It doesn't appear that we have it on file.

Are they running burst_buffer/cray (DataWarp) on here?

The StateSaveLocation - what filesystem is that on, and what mount options are used there?

thanks,
- Tim

Comment 13 Brian F Gilmer 2017-03-21 08:20:55 MDT

Created attachment 4231 [details]
scontrol show config

Comment 14 Brian F Gilmer 2017-03-21 08:22:10 MDT

Created attachment 4232 [details]
slurm.conf

Comment 15 Brian F Gilmer 2017-03-21 08:29:28 MDT

Created attachment 4233 [details]
backup controller slurmctld.log

Comment 16 Brian F Gilmer 2017-03-21 08:31:06 MDT

Created attachment 4234 [details]
backup controller slurmctld.log

the other attachment is the primary controller.

Comment 17 Brian F Gilmer 2017-03-21 08:41:25 MDT

> The StateSaveLocation - what filesystem is that on, and what mount options
> are used there?

daintsl01:~ # grep StateSaveLocation /etc/opt/slurm/slurm.conf
StateSaveLocation=/non_volatile/slurm/shared
daintsl01:~ # 
daintsl01:~ # df | grep slurm
boot:/non_volatile/c8-1c0s12n0/var/spool/slurm               52403200      
1313792      51089408   3% /var/spool/slurm
daintsl01:~ #

boot:~ # exportfs
/cray_home      148.187.32.0/255.255.224.0
/non_volatile   148.187.32.0/255.255.224.0
/var/opt/cray/imps
                148.187.32.0/255.255.224.0
/               148.187.32.0/255.255.224.0

Comment 18 Brian F Gilmer 2017-03-21 09:19:54 MDT

_ping_controller() {
...
	if (slurm_send_recv_rc_msg_only_one(&req, &rc, 0) < 0) {
		error("_ping_controller/slurm_send_node_msg error: %m");
		return SLURM_ERROR;
	}

}

backup controller log
[2017-03-14T10:50:12.187] error: _ping_controller/slurm_send_node_msg error: Connection timed out

The controller is waiting on threads to process the ping request from the backup:
	/*
	 * Process incoming RPCs until told to shutdown
	 */
	while (_wait_for_server_thread()) {
...
_wait_for_server_thread() {
...

		if (slurmctld_config.server_thread_count < max_server_threads) {
			slurmctld_config.server_thread_count++;
			break;
		} else {
			/* wait for state change and retry,
			 * just a delay and not an error.
			 * This can happen when the epilog completes
			 * on a bunch of nodes at the same time, which
			 * can easily happen for highly parallel jobs. */
...
			pthread_cond_wait(&server_thread_cond,
					  &slurmctld_config.thread_count_lock);

As evidence the backfill message:

		if ((defer_rpc_cnt == 0) ||
		    (slurmctld_config.server_thread_count <= max_rpc_cnt))
			break;
		verbose("backfill: continuing to yield locks, %d RPCs pending",
			slurmctld_config.server_thread_count);
	}


----------
What is happening is that the backup controller is mistaking exhausted server thread count as the primary being out of service.  What I am suggesting is that using the normal RPC processing thread is problematic given the load that the XC puts on the controller.

I think the fault detection regime can be improved to be more comprehensive or more direct.  What would a person do to see if the controller was running?  Maybe those steps could be coded either into the backup controller or into an independent monitor.

Comment 20 Tim Wickberg 2017-04-13 16:42:50 MDT

Just an update - we're working on patch that mitigates the most common unintentional takeover scenario here, and will ensure that the backup does not assert control from a primary that is still making changes to the shared state files.

This should be in 17.02.3 when released; I will send you a link to the commit when done, or if you'd like it attached here (or if you'd like a preliminary version) please let me know.

- Tim

Comment 21 Brian F Gilmer 2017-04-14 07:56:34 MDT

Thanks Tim,


My concern is bigger than just a single failure mode.  The backup controller switchover has been a problem at CSCS for years.  At CSCS the backup controller failed twice this year.  The reliability of Slurm is lower because the switch over mechanism is less reliable than the redundant systems.  On the XC the switch over mechanism is coupled to the Aries network which is much less reliable and predictable than other TCP/IP network implementations.

What I would like to see is a more independent mechanism for determining the state of health of the controller.  And a more reliable communications architecture between the two controllers.  Since the communication problem is specific to the XC is would be reasonable to offload that function into plug-in or an external process.  The state of health of the controller needs to be internal and as independent as feasible.  Something along the lines of the Linux watchdog.

Comment 22 Tim Wickberg 2017-04-18 21:03:01 MDT

(In reply to Brian F Gilmer from comment #21)
> Thanks Tim,
> 
> 
> My concern is bigger than just a single failure mode.  The backup controller
> switchover has been a problem at CSCS for years.  At CSCS the backup
> controller failed twice this year.  The reliability of Slurm is lower
> because the switch over mechanism is less reliable than the redundant
> systems.  On the XC the switch over mechanism is coupled to the Aries
> network which is much less reliable and predictable than other TCP/IP
> network implementations.

Certainly understood; I'm still hammering on our proposed fix for this, and won't send it over until I'm confident it will behave properly.

The main issue we face on Cray systems is that the network connection between the controllers stops, but the storage used for StateSaveLocation remains accessible from both nodes. The approach we're looking into will start testing for liveness of the other slurmctld through those state files when the network is inaccessible.

> What I would like to see is a more independent mechanism for determining the
> state of health of the controller.  And a more reliable communications
> architecture between the two controllers.  Since the communication problem
> is specific to the XC is would be reasonable to offload that function into
> plug-in or an external process.  The state of health of the controller needs
> to be internal and as independent as feasible.  Something along the lines of
> the Linux watchdog.

We're discussing some changes in 17.11 that will add on to the fix in 17.02, and start using the persistent connection mechanisms that have been introduced. Those help avoid issues around RPC storms when the Aries network returns to service.

- Tim

Comment 23 Brian F Gilmer 2017-04-25 11:51:21 MDT

Tim,

I think we are on the same page.  This has been an on-going problem on the XC so I like to seeing a long term engagement instead of a quick point solution.

Brian

Comment 24 Tim Wickberg 2017-05-03 10:40:22 MDT

Just a status update - I have a preliminary patch for 17.02 in testing right now, and will make that available once we've gone through a bit more verification. This is obviously a rather sensitive part of the code, and we want to make sure we thoroughly vet any changes to it.

- Tim

Comment 26 Tim Wickberg 2017-08-01 19:40:39 MDT

*** Ticket 3692 has been marked as a duplicate of this ticket. ***

Comment 30 Tim Wickberg 2017-09-08 17:47:13 MDT

I'd decided to focus on hardening this on the upcoming 17.11 release, as the required changes are a bit too aggressive for a maintenance release.

There are a few further adjustments we plan to make, but to summarize the changes made thus far:

- A new "heartbeat" thread was added to the primary slurmctld. It operates independently of the main service threads and is thus not subject to performance issues (e.g., slurmctld is overloaded with outside RPC requests). That thread updates a file ("heartbeat" in StateSaveLocation) with a current timestamp.

- The backup controller will only assume control once (a) communication with the primary has been lost, and (b) the last recorded timestamp in the heartbeat file is at least SlurmctldTimeout seconds old. When those conditions are met, the backup will finally assume control.

This closes the two failure modes that have been reported:
1) An overloaded slurmctld cannot respond to the backup's ping fast enough to prevent it from asserting control. Solved by the independent heartbeat thread.

2) The network path used to communicate with the compute nodes, and between primary <-> backup does not match the path to the StateSaveLocation storage. A network outage in that scenario results in both primary and backup deciding to load the state files and prepare to service requests once the network returns. This is handled by the heartbeat file living on that storage, and being verified before the backup asserts control.

I will note there is still a potential failure mode that we have not addressed, and do not have an immediate plan to solve. If the primary looses its network connection, but maintains access to shared storage, then the backup will never assert control. I view that as a safer state than the prior behavior, where both primary and backup would be running as the master simultaneously, potentially leading to corruption of the state directory.

The relevant commits follow below, and will be in 17.11-pre3 and up. I will leave this bug open while a few last (comparatively minor) details are finalized.

cheers,
- Tim


commit 29d82d0d364b08adaf2306960e0dcd54be18ebd4
Author: Tim Wickberg <tim@schedmd.com>
Date:   Fri Sep 8 17:34:09 2017 -0600

    Rearrange some debug levels.

commit 4066aa868def436bdf1fddf02e8104d41b112bd6
Author: Tim Wickberg <tim@schedmd.com>
Date:   Fri Sep 8 17:32:04 2017 -0600

    Rename get_last_state_write() to _get_last_job_state_write().

commit 192eeb9b50f736b965c440cc2b157e280d3fc069
Author: Tim Wickberg <tim@schedmd.com>
Date:   Fri Sep 8 17:26:26 2017 -0600

    Change the backup to use the heartbeat file instead of the job_state.
    
    Only cut over when the heartbeat file is not being updated any longer.
    
    Bug 4142.

commit 26bf905c1ccaae153286eafc358dae81e72898bd
Author: Tim Wickberg <tim@schedmd.com>
Date:   Fri Sep 8 17:02:27 2017 -0600

    Add a separate heartbeat thread for HA.
    
    Will write out a timestamp into a 'heartbeat' file in StateSaveLocation
    every (SlurmctldTimeout / 4) seconds to demonstrate that the primary
    controller still has access to the directory, and thus the backup
    should avoid taking control.
    
    Bug 4142.