We are aware of a few sync issues with federation and are handling them in an internal bug. Once we have a patch ready, are you interested in trying it out to see if it will solve your sync issues?
> I've tried enabling the federated debug flag and boosting debug level to 2, but still don't see anything else in those time frames that reveal any more information about this.
> Is there something more I can enable to try and debug this issue better?
The federation debug flag is the best option. I don't see any logs with the federation debug flag turned on. Do you have any logs with this flag turned on? It can give us some context around what is happening.
One of the causes of the federation sync issue is problems on the network. We have simulated this by using iptables to block ports temporarily. Do you occasionally have temporary network problems between the clusters?
Created attachment 31755 [details]
miller clusters debug logs
Trying again with federation enabled.
Created attachment 31756 [details]
fawbush logs
We aren't ware of any network issues between the controller nodes, but I'll ask our networking team to investigate. Here are the detections of the clusters being out of sync over the last few days. These times are in Eastern, but some of the other logs are in UTC that were attached. Service Warning[2023-08-11 12:53:48] SERVICE ALERT: afw-slurm.afw.ccs.ornl.gov;SLURM_SYNC;WARNING;SOFT;1;WARNING: Slurm is out of sync with cluster(s): Sibling: miller:172.30.167.196:6817 ID:1 FedState:ACTIVE Features:green PersistConnSend/Recv:No/No Synced:No Service Warning[2023-08-13 10:18:28] SERVICE ALERT: afw-slurm.afw.ccs.ornl.gov;SLURM_SYNC;WARNING;SOFT;1;WARNING: Slurm is out of sync with cluster(s): Sibling: miller:172.30.167.196:6817 ID:1 FedState:ACTIVE Features:green PersistConnSend/Recv:No/No Synced:No Are there more clusters in the federation besides miller and fawbush? Yes, the login cluster, but it doesn't normally get out of sync. I can turn on debugging on it as well and capture the next event. Created attachment 31879 [details]
all fed controller logs
Incident today starting at about 11:28AM.
Which cluster(s) was/were out of sync? Did you by chance run `scontrol -M<clustername> show fed` on each cluster? Also, we have a patch that in theory should correct out of sync issues. I am unable to reproduce it, but do you have a test environment where you are able to reproduce it and can test a patch? Miller cluster was out of sync. I only did the scontrol show fed on the afw cluster though. [root@hallc-mgmt02.hallc ~]# scontrol show fed Federation: usafw Self: afw:172.30.254.230:6817 ID:2 FedState:ACTIVE Features: Sibling: fawbush:172.30.167.197:6817 ID:3 FedState:ACTIVE Features:green PersistConnSend/Recv:Yes/Yes Synced:Yes Sibling: miller::0 ID:1 FedState:ACTIVE Features:green PersistConnSend/Recv:Yes/Yes Synced:No Yes, we'd be willing to run a patch, we are running 23.02.4. Paul > Yes, we'd be willing to run a patch, we are running 23.02.4.
Okay, I am working to get it reviewed so we can actually share the patch with you.
Created attachment 31958 [details] close conn patch Hey Paul, Would you be willing to run with this attached patch plus the following commits that will be in 23.02.5? https://github.com/SchedMD/slurm/commit/81b247cebc https://github.com/SchedMD/slurm/commit/8aa6000e2c Thanks, Brian That is fine. Do you know when you'll be releasing .5? We just need to be able to schedule that with our customer to do the update and it would be good to know the rough date to schedule with them. Thanks, Paul Tentative date is Aug. 31st. Just a heads up. We've pushed the tentative date to Sept. 5 for 23.02.5. Another update on the release date for 23.02.5: we changed the tentative date to Sept 7. It will probably happen on that day; if it does not, expect it to happen very soon after that. We want to get it released as soon as we can, but have a few fixes that we'd like to get in first. Have you had a chance to test with the patches that Brian uploaded and commits 81b247cebc and 8aa6000e2c? No, when I read that message initially with the patches I didn't realize that you were saying that we could apply those to the .4 version. I didn't read it close enough. It looks like those patches only affect the slurmdbd and slurmctld right? We might be able to get those built before the .5 release, but can't guarantee we can get it scheduled to be updated before the release is out anyway. Yes, the commits and patches only affect slurmdbd and slurmctld. Don't worry about trying to test before 23.02.5. It would be helpful to see whether you see this bug on a vanilla version of 23.02.5 (without the close conn patch uploaded here by Brian). Then, try testing the close conn patch. The close conn patch fixes a couple of possible race conditions by locking in places that need it. And then if for some reason the clusters get out of sync, the patch will let the clusters try to get in sync for some time. If that does not happen, then the cluster will close the connection to the other cluster, allowing for a chance to re-open the connection and get in sync again. We had another synchronization loss today after upgrading to .5 on Wednesday. I don't have any additional logging with federation debug flags set though. We'll try the patch next and see if that helps. Thanks for the update. The patch should make slurmctld detect a desync, and close and re-open the connection between clusters to recover from that. Have you had any out of sync issues since applying the patch? The patch is going into place this morning. We'll let you know how it goes. Were you able to deploy the patch? Yes, we applied the patch and haven't seen an incident of the clusters losing sync since then. We'll keep you updated if we see it go out of sync again. Clusters lost federation sync this morning. Debug flags were not enabled after the last patch, but I can upload logs if you would like. I've reenabled federation debug flags for the clusters for next time. (In reply to Paul Peltz from comment #29) > Clusters lost federation sync this morning. Debug flags were not enabled > after the last patch, but I can upload logs if you would like. I've > reenabled federation debug flags for the clusters for next time. Did it automatically recover, or did you manually restart the slurmctld's to fix the desync? To clarify, this patch does not prevent federation desync from occurring; however, it is designed to automatically recover from a desync, though it may take some time (perhaps a few minutes). No, it has not recovered. Our nagios check indicates it has been over 50 minutes since they lost sync and are still in that state. Okay, thanks. I would appreciate the logs that you have, even if not all of them have the federation debug flag. Created attachment 32664 [details]
20231010 desynch
Created attachment 32690 [details]
second incident
This is the second incident today of losing sync and not reconnecting at approximately 20:00GMT in the logs.
Created attachment 33356 [details]
gdb capture
Larry captured the gdb output of another slurmdbd issue this afternoon.
Unfortunately, the gdb output to get threads is useless because no debug symbols exist.
Also unfortunately, no federation debug flag is enabled in the 20231010 logs, so there's not much to go on. Either that, or you had SlurmctldDebug=info rather than verbose, and the debug flags only log at debug level verbose. When you enabled the federation debug flag, was the slurmctld log level at info? Or was it at verbose or higher?
The little that I do see:
afw:
[2023-10-10T12:27:58.980] error: slurm_receive_msg: No response to persist_init
[2023-10-10T12:27:58.980] error: _agent_thread: Failed to send RPC: Unspecified error
[2023-10-10T13:38:55.517] error: _conn_readable: persistent connection for fd 17 experienced error[104]: Connection reset by peer
afw was up and not synced. Looks like afw tried to send a connection request but that failed.
fawbush:
[2023-10-10T12:27:31.957] fed_mgr_sibs_synced: sibling afw up but not synced yet
[2023-10-10T12:27:31.957] sched: schedule() returning, federation siblings not synced yet
miller:
[2023-10-10T12:27:28.952] error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:172.30.167.197:6817: Connection refused
[2023-10-10T12:27:28.952] error: fed_mgr: Unable to open connection to cluster fawbush using host 172.30.167.197(6817)
[2023-10-10T15:06:20.180] error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:172.30.254.230:6817: Connection refused
[2023-10-10T15:06:20.180] error: fed_mgr: Unable to open connection to cluster afw using host 172.30.254.230(6817)
Failed to connect with fawbush and afw.
man connect:
ECONNREFUSED
A connect() on a stream socket found no one listening on the remote address.
So the other clusters were not listening when miller tried to connect. That can happen if they are not up yet.
However, the other clusters should send a connection request when they come up.
I'm not sure what happened, and unfortunately it's really hard to know what happened without the federation debug flag.
Can you enable the federation debug flag and ensure that the slurmctld log level is at verbose or higher on all three clusters and let it stay up until this happens again?
Hello, I've been working on a couple race conditions I've noticed with federation sync issues. If we could get your logs with the federation flag enabled and a higher debug level (preferably "debug", but at least "verbose") as Marshall mentioned, I may be able to determine if the issues I'm working on are related to your problems, and/or analyze your logs to find new problems. Also I am curious, do you often restart slurmctld's in rapid succession, i.e. one right after the other? The race conditions I've seen at least include this scenario, so if you are doing this, staggering their starts may help in the mean time while we try to find/fix the issue. Also please give an update on how often this has been happening and if you've figured anything else out in regards to the federation sync issues. Hi Ben - Thanks for the update. The out of sync condition is happening less often now. I would guess it now drops out of sync about every 7-10 days (although it was twice yesterday). It is a 3 cluster federation. Most commonly the clusters "afw" and "fawbush" drop out of sync. The system appears to continue to operate normally and sometimes it resyncs itself within a few hours. The rest of the time I clear the alert by restarting the fawbush controller. It looks like we have the controller logs set to debug so I will grab the last few days worth on all 3 controllers and upload them. Created attachment 36284 [details]
cluster logs
ignore that upload - it didn't have the federation log flag set. I'm getting it set now and will pull logs again after the next occurrence. (In reply to orcuttlk from comment #42) > ignore that upload - it didn't have the federation log flag set. I'm getting > it set now and will pull logs again after the next occurrence. Okay, sounds good. I just saw an out of sync condition occur. I restarted the "fawbush" controller, verified that it was back in sync, and immediately grabbed the logs from all 4 nodes. Uploading now. Created attachment 36365 [details]
logs during federation out of sync episode
Unfortunately those logs don't really reveal anything more about the out of sync issue. I see a small hiccup after you restart fawbush, but things seemed to recover smoothly after you restarted fawbush, right? Let's get a deeper look at what's going on in each process. Please get the following things when an out of sync event happens: 1. on the login cluster (the one that seems to not go out of sync) run this and attach the output: > $ date && scontrol show federations 2. for all 3 clusters, get slurmctld backtraces: > $ sudo gdb -ex 'set pagination off' -ex 't a a bt f' -ex 'set print pretty on' -batch -p $(pgrep slurmctld) > slurmctld_gdb.txt 3. on slurmdbd machine, get slurmdbd backtrace: > $ sudo gdb -ex 'set pagination off' -ex 't a a bt f' -ex 'set print pretty on' -batch -p $(pgrep slurmdbd) > slurmdbd_gdb.txt 4. Attach slurmdbd and slurmctld logs. Let me know if you have any questions Any updates on this? I understand that getting backtraces from each slurmctld and the slurmdbd is quite cumbersome, but it's probably the best thing to do here since I'm not gathering any useful information from the logs. Sorry - I was traveling and missed your request. I haven't seen any more out of sync conditions for a while and I'm getting ready to upgrade to 24.05.2. This may not be applicable any more. Lets leave it open a little longer and if I don't see the issue again and I can get the upgrade done then lets assume it no longer applies. I will let you know. (In reply to orcuttlk from comment #48) > Sorry - I was traveling and missed your request. I haven't seen any more out > of sync conditions for a while and I'm getting ready to upgrade to 24.05.2. > > This may not be applicable any more. Lets leave it open a little longer and > if I don't see the issue again and I can get the upgrade done then lets > assume it no longer applies. > > I will let you know. Okay thanks for the update. Setting to sev 4. Closing now. Please reopen or open a new ticket if you see this issue again. (new ticket referencing this ticket would probably be best for sake of having a cleaner history to look at) |
Created attachment 31687 [details] slurmctld log and slurm.conf We have been having issues on our federated AFW cluster for a while now where federation would get out of sync between them. It usually only happened about once a month or so. Here recently since the 23.02 upgrade we've been seeing it much more frequently. On the order of every day, and we have nagios checks to detect this, but it disrupts scheduling every time it gets out of sync. Only the primary sibling will get jobs scheduled on it. So far the only debug information I can find is the following: [2023-08-02T18:20:31.491] error: slurm_receive_msg: No response to persist_init [2023-08-02T18:20:31.491] error: _agent_thread: Failed to send RPC: Unspecified error That happens across all three systems. One is the login cluster, and the two others are the compute clusters all of which are in the same afw federated instance. I've tried enabling the federated debug flag and boosting debug level to 2, but still don't see anything else in those time frames that reveal any more information about this. This sometimes also causes issues with the slurmdbd and mariadb as well. We are starting to see instances of: [2023-08-10T09:10:43.033] thread_count over limit (100), waiting They don't always line up with the federation sync breaking, but they may have something to do with it possibly. Is there something more I can enable to try and debug this issue better?