Ticket 17401

Summary:	Clusters losing federation sync
Product:	Slurm	Reporter:	Paul Peltz <peltzpl>
Component:	Federation	Assignee:	Ben Glines <ben.glines>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	brian, marshall, orcuttlk
Version:	23.02.2
Hardware:	Linux
OS:	Linux
Site:	ORNL-OLCF	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld log and slurm.conf miller clusters debug logs fawbush logs all fed controller logs close conn patch 20231010 desynch second incident gdb capture cluster logs logs during federation out of sync episode

Description Paul Peltz 2023-08-10 10:42:18 MDT

Created attachment 31687 [details]
slurmctld log and slurm.conf

We have been having issues on our federated AFW cluster for a while now where federation would get out of sync between them. It usually only happened about once a month or so. Here recently since the 23.02 upgrade we've been seeing it much more frequently. On the order of every day, and we have nagios checks to detect this, but it disrupts scheduling every time it gets out of sync. Only the primary sibling will get jobs scheduled on it.

So far the only debug information I can find is the following:

[2023-08-02T18:20:31.491] error: slurm_receive_msg: No response to persist_init
[2023-08-02T18:20:31.491] error: _agent_thread: Failed to send RPC: Unspecified error

That happens across all three systems. One is the login cluster, and the two others are the compute clusters all of which are in the same afw federated instance. I've tried enabling the federated debug flag and boosting debug level to 2, but still don't see anything else in those time frames that reveal any more information about this. This sometimes also causes issues with the slurmdbd and mariadb as well. We are starting to see instances of:

[2023-08-10T09:10:43.033] thread_count over limit (100), waiting

They don't always line up with the federation sync breaking, but they may have something to do with it possibly. Is there something more I can enable to try and debug this issue better?

Comment 2 Marshall Garey 2023-08-11 16:46:51 MDT

We are aware of a few sync issues with federation and are handling them in an internal bug. Once we have a patch ready, are you interested in trying it out to see if it will solve your sync issues?

> I've tried enabling the federated debug flag and boosting debug level to 2, but still don't see anything else in those time frames that reveal any more information about this.
> Is there something more I can enable to try and debug this issue better?

The federation debug flag is the best option. I don't see any logs with the federation debug flag turned on. Do you have any logs with this flag turned on? It can give us some context around what is happening.

One of the causes of the federation sync issue is problems on the network. We have simulated this by using iptables to block ports temporarily. Do you occasionally have temporary network problems between the clusters?

Comment 3 Paul Peltz 2023-08-14 10:21:51 MDT

Created attachment 31755 [details]
miller clusters debug logs

Trying again with federation enabled.

Comment 4 Paul Peltz 2023-08-14 10:24:51 MDT

Created attachment 31756 [details]
fawbush logs

Comment 5 Paul Peltz 2023-08-14 11:16:16 MDT

We aren't ware of any network issues between the controller nodes, but I'll ask our networking team to investigate. Here are the detections of the clusters being out of sync over the last few days. These times are in Eastern, but some of the other logs are in UTC that were attached.

Service Warning[2023-08-11 12:53:48] SERVICE ALERT: afw-slurm.afw.ccs.ornl.gov;SLURM_SYNC;WARNING;SOFT;1;WARNING: Slurm is out of sync with cluster(s): Sibling: miller:172.30.167.196:6817 ID:1 FedState:ACTIVE Features:green PersistConnSend/Recv:No/No Synced:No

Service Warning[2023-08-13 10:18:28] SERVICE ALERT: afw-slurm.afw.ccs.ornl.gov;SLURM_SYNC;WARNING;SOFT;1;WARNING: Slurm is out of sync with cluster(s): Sibling: miller:172.30.167.196:6817 ID:1 FedState:ACTIVE Features:green PersistConnSend/Recv:No/No Synced:No

Comment 6 Marshall Garey 2023-08-14 16:54:35 MDT

Are there more clusters in the federation besides miller and fawbush?

Comment 7 Paul Peltz 2023-08-14 17:22:56 MDT

Yes, the login cluster, but it doesn't normally get out of sync. I can turn on debugging on it as well and capture the next event.

Comment 10 Paul Peltz 2023-08-22 11:08:45 MDT

Created attachment 31879 [details]
all fed controller logs

Incident today starting at about 11:28AM.

Comment 11 Marshall Garey 2023-08-23 16:53:59 MDT

Which cluster(s) was/were out of sync? Did you by chance run `scontrol -M<clustername> show fed` on each cluster?

Also, we have a patch that in theory should correct out of sync issues. I am unable to reproduce it, but do you have a test environment where you are able to reproduce it and can test a patch?

Comment 13 Paul Peltz 2023-08-24 07:31:45 MDT

Miller cluster was out of sync. I only did the scontrol show fed on the afw cluster though.

[root@hallc-mgmt02.hallc ~]# scontrol show fed
Federation: usafw
Self:       afw:172.30.254.230:6817 ID:2 FedState:ACTIVE Features:
Sibling:    fawbush:172.30.167.197:6817 ID:3 FedState:ACTIVE Features:green PersistConnSend/Recv:Yes/Yes Synced:Yes
Sibling:    miller::0 ID:1 FedState:ACTIVE Features:green PersistConnSend/Recv:Yes/Yes Synced:No

Yes, we'd be willing to run a patch, we are running 23.02.4.

Paul

Comment 14 Marshall Garey 2023-08-24 11:36:37 MDT

> Yes, we'd be willing to run a patch, we are running 23.02.4.

Okay, I am working to get it reviewed so we can actually share the patch with you.

Comment 16 Brian Christiansen 2023-08-25 10:27:57 MDT

Created attachment 31958 [details]
close conn patch

Hey Paul,

Would you be willing to run with this attached patch plus the following commits that will be in 23.02.5?

https://github.com/SchedMD/slurm/commit/81b247cebc
https://github.com/SchedMD/slurm/commit/8aa6000e2c

Thanks,
Brian

Comment 17 Paul Peltz 2023-08-25 10:38:18 MDT

That is fine. Do you know when you'll be releasing .5? We just need to be able to schedule that with our customer to do the update and it would be good to know the rough date to schedule with them.

Thanks,
Paul

Comment 18 Brian Christiansen 2023-08-25 11:36:46 MDT

Tentative date is Aug. 31st.

Comment 19 Brian Christiansen 2023-08-29 11:22:26 MDT

Just a heads up. We've pushed the tentative date to Sept. 5 for 23.02.5.

Comment 20 Marshall Garey 2023-09-01 16:55:43 MDT

Another update on the release date for 23.02.5: we changed the tentative date to Sept 7. It will probably happen on that day; if it does not, expect it to happen very soon after that. We want to get it released as soon as we can, but have a few fixes that we'd like to get in first.

Have you had a chance to test with the patches that Brian uploaded and commits 81b247cebc and 8aa6000e2c?

Comment 21 Paul Peltz 2023-09-05 06:30:26 MDT

No, when I read that message initially with the patches I didn't realize that you were saying that we could apply those to the .4 version. I didn't read it close enough. It looks like those patches only affect the slurmdbd and slurmctld right? We might be able to get those built before the .5 release, but can't guarantee we can get it scheduled to be updated before the release is out anyway.

Comment 22 Marshall Garey 2023-09-05 11:15:25 MDT

Yes, the commits and patches only affect slurmdbd and slurmctld.

Don't worry about trying to test before 23.02.5. It would be helpful to see whether you see this bug on a vanilla version of 23.02.5 (without the close conn patch uploaded here by Brian). Then, try testing the close conn patch. The close conn patch fixes a couple of possible race conditions by locking in places that need it. And then if for some reason the clusters get out of sync, the patch will let the clusters try to get in sync for some time. If that does not happen, then the cluster will close the connection to the other cluster, allowing for a chance to re-open the connection and get in sync again.

Comment 23 Paul Peltz 2023-09-15 14:56:13 MDT

We had another synchronization loss today after upgrading to .5 on Wednesday. I don't have any additional logging with federation debug flags set though. We'll try the patch next and see if that helps.

Comment 24 Marshall Garey 2023-09-15 15:02:02 MDT

Thanks for the update. The patch should make slurmctld detect a desync, and close and re-open the connection between clusters to recover from that.

Comment 25 Marshall Garey 2023-09-25 16:38:30 MDT

Have you had any out of sync issues since applying the patch?

Comment 26 Paul Peltz 2023-09-26 05:24:39 MDT

The patch is going into place this morning. We'll let you know how it goes.

Comment 27 Marshall Garey 2023-10-05 09:11:09 MDT

Were you able to deploy the patch?

Comment 28 Paul Peltz 2023-10-09 07:51:48 MDT

Yes, we applied the patch and haven't seen an incident of the clusters losing sync since then. We'll keep you updated if we see it go out of sync again.

Comment 29 Paul Peltz 2023-10-10 08:42:15 MDT

Clusters lost federation sync this morning. Debug flags were not enabled after the last patch, but I can upload logs if you would like. I've reenabled federation debug flags for the clusters for next time.

Comment 30 Marshall Garey 2023-10-10 08:47:40 MDT

(In reply to Paul Peltz from comment #29)
> Clusters lost federation sync this morning. Debug flags were not enabled
> after the last patch, but I can upload logs if you would like. I've
> reenabled federation debug flags for the clusters for next time.

Did it automatically recover, or did you manually restart the slurmctld's to fix the desync? To clarify, this patch does not prevent federation desync from occurring; however, it is designed to automatically recover from a desync, though it may take some time (perhaps a few minutes).

Comment 31 Paul Peltz 2023-10-10 08:49:55 MDT

No, it has not recovered. Our nagios check indicates it has been over 50 minutes since they lost sync and are still in that state.

Comment 32 Marshall Garey 2023-10-10 08:50:47 MDT

Okay, thanks. I would appreciate the logs that you have, even if not all of them have the federation debug flag.

Comment 33 Paul Peltz 2023-10-10 08:58:22 MDT

Created attachment 32664 [details]
20231010 desynch

Comment 34 Paul Peltz 2023-10-10 14:15:05 MDT

Created attachment 32690 [details]
second incident

This is the second incident today of losing sync and not reconnecting at approximately 20:00GMT in the logs.

Comment 35 Paul Peltz 2023-11-16 15:38:30 MST

Created attachment 33356 [details]
gdb capture

Larry captured the gdb output of another slurmdbd issue this afternoon.

Comment 37 Marshall Garey 2024-01-05 13:32:15 MST

Unfortunately, the gdb output to get threads is useless because no debug symbols exist.

Also unfortunately, no federation debug flag is enabled in the 20231010 logs, so there's not much to go on. Either that, or you had SlurmctldDebug=info rather than verbose, and the debug flags only log at debug level verbose. When you enabled the federation debug flag, was the slurmctld log level at info? Or was it at verbose or higher?

The little that I do see:

afw:

[2023-10-10T12:27:58.980] error: slurm_receive_msg: No response to persist_init
[2023-10-10T12:27:58.980] error: _agent_thread: Failed to send RPC: Unspecified error

[2023-10-10T13:38:55.517] error: _conn_readable: persistent connection for fd 17 experienced error[104]: Connection reset by peer

afw was up and not synced. Looks like afw tried to send a connection request but that failed.

fawbush:

[2023-10-10T12:27:31.957] fed_mgr_sibs_synced: sibling afw up but not synced yet
[2023-10-10T12:27:31.957] sched: schedule() returning, federation siblings not synced yet

miller:

[2023-10-10T12:27:28.952] error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:172.30.167.197:6817: Connection refused
[2023-10-10T12:27:28.952] error: fed_mgr: Unable to open connection to cluster fawbush using host 172.30.167.197(6817)

[2023-10-10T15:06:20.180] error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:172.30.254.230:6817: Connection refused
[2023-10-10T15:06:20.180] error: fed_mgr: Unable to open connection to cluster afw using host 172.30.254.230(6817)

Failed to connect with fawbush and afw.

man connect:

ECONNREFUSED
A connect() on a stream socket found no one listening on the remote address.

So the other clusters were not listening when miller tried to connect. That can happen if they are not up yet.
However, the other clusters should send a connection request when they come up.

I'm not sure what happened, and unfortunately it's really hard to know what happened without the federation debug flag.

Can you enable the federation debug flag and ensure that the slurmctld log level is at verbose or higher on all three clusters and let it stay up until this happens again?

Comment 38 Ben Glines 2024-05-01 09:22:41 MDT

Hello,

I've been working on a couple race conditions I've noticed with federation sync issues. If we could get your logs with the federation flag enabled and a higher debug level (preferably "debug", but at least "verbose") as Marshall mentioned, I may be able to determine if the issues I'm working on are related to your problems, and/or analyze your logs to find new problems.

Also I am curious, do you often restart slurmctld's in rapid succession, i.e. one right after the other? The race conditions I've seen at least include this scenario, so if you are doing this, staggering their starts may help in the mean time while we try to find/fix the issue.

Comment 39 Ben Glines 2024-05-01 09:27:37 MDT

Also please give an update on how often this has been happening and if you've figured anything else out in regards to the federation sync issues.

Comment 40 orcuttlk 2024-05-07 05:44:24 MDT

Hi Ben -
Thanks for the update. The out of sync condition is happening less often now. I would guess it now drops out of sync about every 7-10 days (although it was twice yesterday).

It is a 3 cluster federation. Most commonly the clusters "afw" and "fawbush" drop out of sync. The system appears to continue to operate normally and sometimes it resyncs itself within a few hours. The rest of the time I clear the alert by restarting the fawbush controller.

It looks like we have the controller logs set to debug so I will grab the last few days worth on all 3 controllers and upload them.

Comment 41 orcuttlk 2024-05-07 05:56:21 MDT

Created attachment 36284 [details]
cluster logs

Comment 42 orcuttlk 2024-05-08 05:59:39 MDT

ignore that upload - it didn't have the federation log flag set. I'm getting it set now and will pull logs again after the next occurrence.

Comment 43 Ben Glines 2024-05-08 09:31:22 MDT

(In reply to orcuttlk from comment #42)
> ignore that upload - it didn't have the federation log flag set. I'm getting
> it set now and will pull logs again after the next occurrence.
Okay, sounds good.

Comment 44 orcuttlk 2024-05-10 13:02:36 MDT

I just saw an out of sync condition occur. I restarted the "fawbush" controller, verified that it was back in sync, and immediately grabbed the logs from all 4 nodes. Uploading now.

Comment 45 orcuttlk 2024-05-10 13:03:28 MDT

Created attachment 36365 [details]
logs during federation out of sync episode

Comment 46 Ben Glines 2024-07-08 16:56:00 MDT

Unfortunately those logs don't really reveal anything more about the out of sync issue. I see a small hiccup after you restart fawbush, but things seemed to recover smoothly after you restarted fawbush, right?

Let's get a deeper look at what's going on in each process. Please get the following things when an out of sync event happens:
1. on the login cluster (the one that seems to not go out of sync) run this and attach the output:
> $ date && scontrol show federations
2. for all 3 clusters, get slurmctld backtraces:
> $ sudo gdb -ex 'set pagination off' -ex 't a a bt f' -ex 'set print pretty on' -batch -p $(pgrep slurmctld) > slurmctld_gdb.txt
3. on slurmdbd machine, get slurmdbd backtrace:
> $ sudo gdb -ex 'set pagination off' -ex 't a a bt f' -ex 'set print pretty on' -batch -p $(pgrep slurmdbd) > slurmdbd_gdb.txt
4. Attach slurmdbd and slurmctld logs.


Let me know if you have any questions

Comment 47 Ben Glines 2024-08-07 15:22:32 MDT

Any updates on this? I understand that getting backtraces from each slurmctld and the slurmdbd is quite cumbersome, but it's probably the best thing to do here since I'm not gathering any useful information from the logs.

Comment 48 orcuttlk 2024-08-19 12:09:01 MDT

Sorry - I was traveling and missed your request. I haven't seen any more out of sync conditions for a while and I'm getting ready to upgrade to 24.05.2. 

This may not be applicable any more. Lets leave it open a little longer and if I don't see the issue again and I can get the upgrade done then lets assume it no longer applies.

I will let you know.

Comment 49 Ben Glines 2024-08-19 16:08:51 MDT

(In reply to orcuttlk from comment #48)
> Sorry - I was traveling and missed your request. I haven't seen any more out
> of sync conditions for a while and I'm getting ready to upgrade to 24.05.2. 
> 
> This may not be applicable any more. Lets leave it open a little longer and
> if I don't see the issue again and I can get the upgrade done then lets
> assume it no longer applies.
> 
> I will let you know.
Okay thanks for the update. Setting to sev 4.

Comment 50 Ben Glines 2024-10-25 09:45:03 MDT

Closing now. Please reopen or open a new ticket if you see this issue again. (new ticket referencing this ticket would probably be best for sake of having a cleaner history to look at)