Over the last few days we've been observing the following pattern: - Our backup controller has trouble communicating with the primary controller, with messages like the following in the backup controller's logs: Apr 12 23:23:13 slurmsched2 slurmctld[5443]: error: Invalid RPC received 2040 while in standby mode Apr 12 23:23:20 slurmsched2 slurmctld[5443]: error: Invalid RPC received 2009 while in standby mode - It looks like the backup controller eventually assumes the primary is dead, and assumes control: Apr 12 23:52:06 slurmsched2 slurmctld[5443]: error: slurm_receive_msg: Socket timed out on send/recv operation Apr 12 23:52:06 slurmsched2 slurmctld[5443]: error: _ping_controller/slurm_send_node_msg error: Socket timed out on send/recv operation Apr 12 23:52:06 slurmsched2 slurmctld[5443]: error: ControlMachine slurmsched1 not responding, BackupController slurmsched2 taking over Apr 12 23:52:06 slurmsched2 slurmctld[5443]: Terminate signal (SIGINT or SIGTERM) received Apr 12 23:52:19 slurmsched2 slurmctld[5443]: layouts: loading entities/relations information Apr 12 23:52:19 slurmsched2 slurmctld[5443]: Recovered state of 605 nodes Apr 12 23:52:20 slurmsched2 slurmctld[5443]: Recovered state of 6 partitions - A few minutes later the backup sees there is a problem and kills itself: Apr 12 23:53:23 slurmsched2 slurmctld[5443]: error: Bad job state save file time. We wrote it at time 1492059146, but the file contains a time stamp of 1492059198. Apr 12 23:53:23 slurmsched2 slurmctld[5443]: fatal: Two slurmctld daemons are running as primary. Shutting down this daemon to avoid inconsistent state due to split brain. Do you have any insight into what might be going on here? I'm not sure if the socket timeout error is related to the "Invalid RPC" messages or not, but I assume it is. In other tickets, we've recently been tweaking our max_rpc_cnt per your recommendation to help address some SLURM sluggishness issues we occasionally encounter when users do silly things like submit a massive batch of very short jobs at once. Our current value for max_rpc_cnt is at 160. I'm attaching our slurm.conf. Let me know if you need any more data. Thanks for the amazing support, Will
Created attachment 4352 [details] slurm.conf
That is definitely not a good state to wind up in. I'd suggest leaving the backup controller offline until I can get you some further recommendations on how to prevent this. The split-brain scenario it is warning about can lead to corruption of the state files, which is obviously quite bad. Are you able to attach a section of logs from both slurmsched1 and slurmsched2 from around that time? I'd be curious to know why the primary shrugged off its responsibilities. Are these both physical servers, or VMs? And the 10.0.0.0 network they're attached to - is that Ethernet / IPoIB, or something else? And are there any devices in between the two controllers that could potentially block traffic?
(In reply to Tim Wickberg from comment #2) > That is definitely not a good state to wind up in. I'd suggest leaving the > backup controller offline until I can get you some further recommendations > on how to prevent this. The split-brain scenario it is warning about can > lead to corruption of the state files, which is obviously quite bad. We've stopped the controller on the backup for the time being. > Are you able to attach a section of logs from both slurmsched1 and > slurmsched2 from around that time? I'd be curious to know why the primary > shrugged off its responsibilities. I'm attaching these. > Are these both physical servers, or VMs? And the 10.0.0.0 network they're > attached to - is that Ethernet / IPoIB, or something else? And are there any > devices in between the two controllers that could potentially block traffic? Physical servers. Ethernet. And no, there are not any devices between the controllers that could block traffic. Sorry for brevity, I'm off to a meeting.
Created attachment 4353 [details] slurmctld log on primary
Created attachment 4354 [details] slurmctld log on backup
> We've stopped the controller on the backup for the time being. Ok - that should be safest for the time being. > > Are these both physical servers, or VMs? And the 10.0.0.0 network they're > > attached to - is that Ethernet / IPoIB, or something else? And are there any > > devices in between the two controllers that could potentially block traffic? > > Physical servers. Ethernet. And no, there are not any devices between the > controllers that could block traffic. > > Sorry for brevity, I'm off to a meeting. No problem. Just wanted to make sure they weren't a worst-case scenario of Xen VM's and/or a firewall sitting in between. Looking through the log, the takeover is correlated with a backfill scheduler pass completing with: Apr 12 23:52:10 slurmsched1 slurmctld[44462]: backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt This would lead to the ping messages from the backup controller not being handled on time, leading to the backup controller deciding to assert control. There are some related outstanding issues around this (the takeover message is , and I'm trying to find ways to possibly re-architect the failover code to handle this situation better. Unfortunately, the HA stack was designed around the assumption that the primary would disappear completely, and not just be overloaded. A few error lines like: primary.log:Apr 13 02:05:35 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmp677 primary.log:Apr 13 02:06:05 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmps13 primary.log:Apr 13 02:06:23 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmp676 primary.log:Apr 13 02:06:34 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmp677 primary.log:Apr 13 02:07:01 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for ipoda2 primary.log:Apr 13 02:07:05 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmp676 primary.log:Apr 13 02:07:29 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for rokasgate2 primary.log:Apr 13 02:07:56 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmps50 primary.log:Apr 13 02:08:20 slurmsched1 slurmctld[44462]: error: find_node_record: lookup failure for vmps55 indicate you may have some additional nodes running slurmd that at attempting to register to the cluster - I'd suggest shutting slurmd off there. While this shouldn't hurt anything, it does clutter the logs and add some additional RPC traffic that you don't need. At the moment, the best recommendations I can make are to spend a bit of time tuning the scheduler to avoid these overload situations. Setting a lower bf_yield_interval=1000000 may help with this, as well as an increase of max_rpc_cnt=256 . If you can attach current 'sdiag' output that'd help me understand your current workload and make some further recommendations. I'd like to get to a point where the "256 pending RPCs at cycle end" message does not show up at all before you re-enable the backup slurmctld. - Tim
> > A few error lines like: > primary.log:Apr 13 02:05:35 slurmsched1 slurmctld[44462]: error: > find_node_record: lookup failure for vmp677 > primary.log:Apr 13 02:06:05 slurmsched1 slurmctld[44462]: error: > find_node_record: lookup failure for vmps13 > primary.log:Apr 13 02:06:23 slurmsched1 slurmctld[44462]: error: > find_node_record: lookup failure for vmp676 > primary.log:Apr 13 02:06:34 slurmsched1 slurmctld[44462]: error: > find_node_record: lookup failure for vmp677 > primary.log:Apr 13 02:07:01 slurmsched1 slurmctld[44462]: error: > find_node_record: lookup failure for ipoda2 > primary.log:Apr 13 02:07:05 slurmsched1 slurmctld[44462]: error: > find_node_record: lookup failure for vmp676 > primary.log:Apr 13 02:07:29 slurmsched1 slurmctld[44462]: error: > find_node_record: lookup failure for rokasgate2 > primary.log:Apr 13 02:07:56 slurmsched1 slurmctld[44462]: error: > find_node_record: lookup failure for vmps50 > primary.log:Apr 13 02:08:20 slurmsched1 slurmctld[44462]: error: > find_node_record: lookup failure for vmps55 Some of these we are currently decommissioning so will be removed in the next few days, but others are gateways/login nodes where slurmd is not running. For example: 14:01:29-frenchwr@vmps55:~$ ps -ef | grep slurm frenchwr 21104 20940 0 14:01 pts/0 00:00:00 grep --color=auto slurm 14:01:32-frenchwr@vmps55:~$ ps -ef | grep munge munge 4311 1 0 Feb13 ? 00:10:36 /usr/sbin/munged --key-file /etc/munge/munge.key --num-threads 10 frenchwr 21111 20940 0 14:01 pts/0 00:00:00 grep --color=auto munge Note that munge is running so jobs can be submitted from these servers. Here's another one: 14:03:27-frenchwr@vmps50:~$ ps -ef | grep slurm frenchwr 29342 29178 0 14:03 pts/1 00:00:00 grep --color=auto slurm 14:03:30-frenchwr@vmps50:~$ ps -ef | grep munge munge 5993 1 0 Mar22 ? 00:01:13 /usr/sbin/munged --key-file /etc/munge/munge.key --num-threads 10 frenchwr 29347 29178 0 14:03 pts/1 00:00:00 grep --color=auto munge > > At the moment, the best recommendations I can make are to spend a bit of > time tuning the scheduler to avoid these overload situations. > > Setting a lower bf_yield_interval=1000000 may help with this, as well as an > increase of max_rpc_cnt=256 . I've made these changes. Thanks for the tips. > > If you can attach current 'sdiag' output that'd help me understand your > current workload and make some further recommendations. I'd like to get to a > point where the "256 pending RPCs at cycle end" message does not show up at > all before you re-enable the backup slurmctld. I'm attaching this output. Thank you, Tim! Will
Created attachment 4355 [details] sdiag output
(In reply to Will French from comment #8) > Created attachment 4355 [details] > sdiag output Thanks. I'm not seeing anything unusual in there, and think that the other tuning I'd mentioned should alleviate most of the symptoms. It appears that a "storm" of jobs all completing at the same time caused the spike in in-process threads, which then lead to the heartbeat to the backup not being sent in time to prevent it from asserting control. We have a patch we're now testing internally that should prevent the backup controller from asserting control while the primary is still actively changing the job state files. That should be in the 17.02.3 release when made, although I can provide a version back-ported to the 16.05 branch when we've completed testing if you'd like to have that. (We do not expect to make any further 16.05 maintenance releases barring some exceptional issues, and thus will not land these changes to that maintenance branch.) - Tim
Hi again Tim, Our secondary controller is still offline. As I mentioned in Bug 3554, the load on our cluster has been much lower this week, so there have been no timeout issues to report. > Setting a lower bf_yield_interval=1000000 may help with this, as well as an increase of max_rpc_cnt=256 . Out of desperation I ended up reverting these changes late last week, as there were issues with socket timeouts and the like a few hours after I made the suggested changes. However, there also seemed to be some sluggishness on our GPFS filesystem around the same time, which may have actually been the source of the socket timeouts. We will eventually want to incorporate the patch that you all develop, and we will keep our secondary controller offline until we have applied that patch. Just keep us posted. Thanks, Will
Just an update - we're still testing this patch internally. I'll make sure to get it over to you when we're a bit more confident in the approach - HA is a tricky thing to get right, and I don't want to ship this until I'm confident the approach will work out well. - Tim
Just an update - a fix for this is still being worked on. Adjusting the HA code is obviously not something we do lightly. If you're interested in running one of the preliminary versions of the patch please let me know and I can share that. - Tim
Will - I'm tidying up a bit and marking this as a duplicate of bug 3592 which is tracking the HA modifications. A patch for that should be available soon. - Tim *** This ticket has been marked as a duplicate of ticket 3592 ***