3692 – Communication Issues between primary and backup controllers

Ticket 3692 - Communication Issues between primary and backup controllers

Summary: Communication Issues between primary and backup controllers

Status:	RESOLVED DUPLICATE of ticket 3592

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	16.05.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-04-13 08:50 MDT by Will French
Modified:	2017-08-01 19:40 MDT (History)
CC List:	0 users

See Also:
Site:	Vanderbilt
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (14.52 KB, text/plain) 2017-04-13 08:51 MDT, Will French	Details
slurmctld log on primary (1.08 MB, text/plain) 2017-04-13 09:52 MDT, Will French	Details
slurmctld log on backup (55.22 KB, text/plain) 2017-04-13 09:52 MDT, Will French	Details
sdiag output (8.54 KB, text/plain) 2017-04-13 13:05 MDT, Will French	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Will French 2017-04-13 08:50:26 MDT

Over the last few days we've been observing the following pattern:

- Our backup controller has trouble communicating with the primary controller, with messages like the following in the backup controller's logs:

Apr 12 23:23:13 slurmsched2 slurmctld[5443]: error: Invalid RPC received 2040 while in standby mode
Apr 12 23:23:20 slurmsched2 slurmctld[5443]: error: Invalid RPC received 2009 while in standby mode

- It looks like the backup controller eventually assumes the primary is dead, and assumes control:

Apr 12 23:52:06 slurmsched2 slurmctld[5443]: error: slurm_receive_msg: Socket timed out on send/recv operation
Apr 12 23:52:06 slurmsched2 slurmctld[5443]: error: _ping_controller/slurm_send_node_msg error: Socket timed out on send/recv operation
Apr 12 23:52:06 slurmsched2 slurmctld[5443]: error: ControlMachine slurmsched1 not responding, BackupController slurmsched2 taking over
Apr 12 23:52:06 slurmsched2 slurmctld[5443]: Terminate signal (SIGINT or SIGTERM) received
Apr 12 23:52:19 slurmsched2 slurmctld[5443]: layouts: loading entities/relations information
Apr 12 23:52:19 slurmsched2 slurmctld[5443]: Recovered state of 605 nodes
Apr 12 23:52:20 slurmsched2 slurmctld[5443]: Recovered state of 6 partitions

- A few minutes later the backup sees there is a problem and kills itself:

Apr 12 23:53:23 slurmsched2 slurmctld[5443]: error: Bad job state save file time. We wrote it at time 1492059146, but the file contains a time stamp of 1492059198.
Apr 12 23:53:23 slurmsched2 slurmctld[5443]: fatal: Two slurmctld daemons are running as primary. Shutting down this daemon to avoid inconsistent state due to split brain.


Do you have any insight into what might be going on here? 

I'm not sure if the socket timeout error is related to the "Invalid RPC" messages or not, but I assume it is. In other tickets, we've recently been tweaking our max_rpc_cnt per your recommendation to help address some SLURM sluggishness issues we occasionally encounter when users do silly things like submit a massive batch of very short jobs at once. Our current value for max_rpc_cnt is at 160. 

I'm attaching our slurm.conf. Let me know if you need any more data.

Thanks for the amazing support,
Will

Comment 1 Will French 2017-04-13 08:51:21 MDT

Created attachment 4352 [details]
slurm.conf

Comment 2 Tim Wickberg 2017-04-13 09:28:23 MDT

That is definitely not a good state to wind up in. I'd suggest leaving the backup controller offline until I can get you some further recommendations on how to prevent this. The split-brain scenario it is warning about can lead to corruption of the state files, which is obviously quite bad.

Are you able to attach a section of logs from both slurmsched1 and slurmsched2 from around that time? I'd be curious to know why the primary shrugged off its responsibilities.

Are these both physical servers, or VMs? And the 10.0.0.0 network they're attached to - is that Ethernet / IPoIB, or something else? And are there any devices in between the two controllers that could potentially block traffic?

Comment 3 Will French 2017-04-13 09:51:43 MDT

(In reply to Tim Wickberg from comment #2)
> That is definitely not a good state to wind up in. I'd suggest leaving the
> backup controller offline until I can get you some further recommendations
> on how to prevent this. The split-brain scenario it is warning about can
> lead to corruption of the state files, which is obviously quite bad.

We've stopped the controller on the backup for the time being.

> Are you able to attach a section of logs from both slurmsched1 and
> slurmsched2 from around that time? I'd be curious to know why the primary
> shrugged off its responsibilities.

I'm attaching these.


> Are these both physical servers, or VMs? And the 10.0.0.0 network they're
> attached to - is that Ethernet / IPoIB, or something else? And are there any
> devices in between the two controllers that could potentially block traffic?

Physical servers. Ethernet. And no, there are not any devices between the controllers that could block traffic.

Sorry for brevity, I'm off to a meeting.

Comment 4 Will French 2017-04-13 09:52:09 MDT

Created attachment 4353 [details]
slurmctld log on primary

Comment 5 Will French 2017-04-13 09:52:36 MDT

Created attachment 4354 [details]
slurmctld log on backup

Comment 6 Tim Wickberg 2017-04-13 12:44:07 MDT

> We've stopped the controller on the backup for the time being.

Ok - that should be safest for the time being.

> > Are these both physical servers, or VMs? And the 10.0.0.0 network they're
> > attached to - is that Ethernet / IPoIB, or something else? And are there any
> > devices in between the two controllers that could potentially block traffic?
> 
> Physical servers. Ethernet. And no, there are not any devices between the
> controllers that could block traffic.
> 
> Sorry for brevity, I'm off to a meeting.

No problem. Just wanted to make sure they weren't a worst-case scenario of Xen VM's and/or a firewall sitting in between.

Looking through the log, the takeover is correlated with a backfill scheduler pass completing with:

Apr 12 23:52:10 slurmsched1 slurmctld[44462]: backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt

This would lead to the ping messages from the backup controller not being handled on time, leading to the backup controller deciding to assert control. There are some related outstanding issues around this (the takeover message is , and I'm trying to find ways to possibly re-architect the failover code to handle this situation better. Unfortunately, the HA stack was designed around the assumption that the primary would disappear completely, and not just be overloaded.

A few error lines like:
primary.log:Apr 13 02:05:35 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmp677
primary.log:Apr 13 02:06:05 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmps13
primary.log:Apr 13 02:06:23 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmp676
primary.log:Apr 13 02:06:34 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmp677
primary.log:Apr 13 02:07:01 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for ipoda2
primary.log:Apr 13 02:07:05 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmp676
primary.log:Apr 13 02:07:29 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for rokasgate2
primary.log:Apr 13 02:07:56 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmps50
primary.log:Apr 13 02:08:20 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmps55

indicate you may have some additional nodes running slurmd that at attempting to register to the cluster - I'd suggest shutting slurmd off there. While this shouldn't hurt anything, it does clutter the logs and add some additional RPC traffic that you don't need.

At the moment, the best recommendations I can make are to spend a bit of time tuning the scheduler to avoid these overload situations.

Setting a lower bf_yield_interval=1000000 may help with this, as well as an increase of max_rpc_cnt=256 .

If you can attach current 'sdiag' output that'd help me understand your current workload and make some further recommendations. I'd like to get to a point where the "256 pending RPCs at cycle end" message does not show up at all before you re-enable the backup slurmctld.

- Tim

Comment 7 Will French 2017-04-13 13:04:41 MDT

> 
> A few error lines like:
> primary.log:Apr 13 02:05:35 slurmsched1 slurmctld[44462]: error:
> find_node_record: lookup failure for vmp677
> primary.log:Apr 13 02:06:05 slurmsched1 slurmctld[44462]: error:
> find_node_record: lookup failure for vmps13
> primary.log:Apr 13 02:06:23 slurmsched1 slurmctld[44462]: error:
> find_node_record: lookup failure for vmp676
> primary.log:Apr 13 02:06:34 slurmsched1 slurmctld[44462]: error:
> find_node_record: lookup failure for vmp677
> primary.log:Apr 13 02:07:01 slurmsched1 slurmctld[44462]: error:
> find_node_record: lookup failure for ipoda2
> primary.log:Apr 13 02:07:05 slurmsched1 slurmctld[44462]: error:
> find_node_record: lookup failure for vmp676
> primary.log:Apr 13 02:07:29 slurmsched1 slurmctld[44462]: error:
> find_node_record: lookup failure for rokasgate2
> primary.log:Apr 13 02:07:56 slurmsched1 slurmctld[44462]: error:
> find_node_record: lookup failure for vmps50
> primary.log:Apr 13 02:08:20 slurmsched1 slurmctld[44462]: error:
> find_node_record: lookup failure for vmps55

Some of these we are currently decommissioning so will be removed in the next few days, but others are gateways/login nodes where slurmd is not running. For example:

14:01:29-frenchwr@vmps55:~$ ps -ef | grep slurm
frenchwr 21104 20940  0 14:01 pts/0    00:00:00 grep --color=auto slurm
14:01:32-frenchwr@vmps55:~$ ps -ef | grep munge
munge     4311     1  0 Feb13 ?        00:10:36 /usr/sbin/munged --key-file /etc/munge/munge.key --num-threads 10
frenchwr 21111 20940  0 14:01 pts/0    00:00:00 grep --color=auto munge

Note that munge is running so jobs can be submitted from these servers. Here's another one:

14:03:27-frenchwr@vmps50:~$ ps -ef | grep slurm
frenchwr 29342 29178  0 14:03 pts/1    00:00:00 grep --color=auto slurm
14:03:30-frenchwr@vmps50:~$ ps -ef | grep munge
munge     5993     1  0 Mar22 ?        00:01:13 /usr/sbin/munged --key-file /etc/munge/munge.key --num-threads 10
frenchwr 29347 29178  0 14:03 pts/1    00:00:00 grep --color=auto munge


> 
> At the moment, the best recommendations I can make are to spend a bit of
> time tuning the scheduler to avoid these overload situations.
> 
> Setting a lower bf_yield_interval=1000000 may help with this, as well as an
> increase of max_rpc_cnt=256 .

I've made these changes. Thanks for the tips.

> 
> If you can attach current 'sdiag' output that'd help me understand your
> current workload and make some further recommendations. I'd like to get to a
> point where the "256 pending RPCs at cycle end" message does not show up at
> all before you re-enable the backup slurmctld.

I'm attaching this output.

Thank you, Tim!

Will

Comment 8 Will French 2017-04-13 13:05:03 MDT

Created attachment 4355 [details]
sdiag output

Comment 10 Tim Wickberg 2017-04-13 16:47:13 MDT

(In reply to Will French from comment #8)
> Created attachment 4355 [details]
> sdiag output

Thanks. I'm not seeing anything unusual in there, and think that the other tuning I'd mentioned should alleviate most of the symptoms. It appears that a "storm" of jobs all completing at the same time caused the spike in in-process threads, which then lead to the heartbeat to the backup not being sent in time to prevent it from asserting control.

We have a patch we're now testing internally that should prevent the backup controller from asserting control while the primary is still actively changing the job state files. That should be in the 17.02.3 release when made, although I can provide a version back-ported to the 16.05 branch when we've completed testing if you'd like to have that. (We do not expect to make any further 16.05 maintenance releases barring some exceptional issues, and thus will not land these changes to that maintenance branch.)

- Tim

Comment 14 Will French 2017-04-19 10:02:46 MDT

Hi again Tim,

Our secondary controller is still offline. As I mentioned in Bug 3554, the load on our cluster has been much lower this week, so there have been no timeout issues to report. 

> Setting a lower bf_yield_interval=1000000 may help with this, as well as an increase of max_rpc_cnt=256 .

Out of desperation I ended up reverting these changes late last week, as there were issues with socket timeouts and the like a few hours after I made the suggested changes. However, there also seemed to be some sluggishness on our GPFS filesystem around the same time, which may have actually been the source of the socket timeouts.

We will eventually want to incorporate the patch that you all develop, and we will keep our secondary controller offline until we have applied that patch. Just keep us posted.

Thanks,
Will

Comment 15 Tim Wickberg 2017-05-03 07:03:11 MDT

Just an update - we're still testing this patch internally. I'll make sure to get it over to you when we're a bit more confident in the approach - HA is a tricky thing to get right, and I don't want to ship this until I'm confident the approach will work out well.

- Tim

Comment 16 Tim Wickberg 2017-06-26 09:55:33 MDT

Just an update - a fix for this is still being worked on. Adjusting the HA code is obviously not something we do lightly.

If you're interested in running one of the preliminary versions of the patch please let me know and I can share that.

- Tim

Comment 17 Tim Wickberg 2017-08-01 19:40:39 MDT

Will - 

I'm tidying up a bit and marking this as a duplicate of bug 3592 which is tracking the HA modifications. A patch for that should be available soon.

- Tim

*** This ticket has been marked as a duplicate of ticket 3592 ***