Summary: | Unable to register with dbd | ||
---|---|---|---|
Product: | Slurm | Reporter: | Matt Ezell <ezellma> |
Component: | Database | Assignee: | Oscar Hernández <oscar.hernandez> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | stephen |
Version: | 24.11.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | ORNL-OLCF | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Matt Ezell
2025-04-01 08:57:37 MDT
We have 2 clusters setup as `AccountingStorageExternalHost`. They are still running 24.05, so those connections likely don't work. If I comment out that directive, I no longer see the error. From reading your description, it sounds like you have a 'slurmctld' on 24.11.3 trying to connect to a 'slurmdbd' on 24.05.x through 'AccountingStorageExternalHost'. This is expected to fail, as the 'slurmdbd' should always be the same version or newer than other Slurm components. The solution would be to either disable the 'AccountingStorageExternalHost' field, or to upgrade the 'slurmdbd' components specified to 24.11.3. (It is not necessary to upgrade other components in those clusters.) https://slurm.schedmd.com/upgrades.html#slurmdbd Are there any issues you're still seeing after commenting out the 'AccountingStorageExternalHost'? If so, please enable the 'Protocol' debug flag on the controller temporarily and provide more recent log entries. https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags Best regards, Stephen (In reply to Stephen Kendall from comment #3) > From reading your description, it sounds like you have a 'slurmctld' on > 24.11.3 trying to connect to a 'slurmdbd' on 24.05.x through > 'AccountingStorageExternalHost'. This is expected to fail, as the 'slurmdbd' > should always be the same version or newer than other Slurm components. The > solution would be to either disable the 'AccountingStorageExternalHost' > field, or to upgrade the 'slurmdbd' components specified to 24.11.3. (It is > not necessary to upgrade other components in those clusters.) > https://slurm.schedmd.com/upgrades.html#slurmdbd Understood. My concern is that the 2 AccountingStorageExternalHost entries failing should not cause the connection to the main slurmdbd (at the correct version) to also fail. > Are there any issues you're still seeing after commenting out the > 'AccountingStorageExternalHost'? If so, please enable the 'Protocol' debug > flag on the controller temporarily and provide more recent log entries. > https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags No, things are working correctly with those disabled. I dropped this to sev-4. I would appreciate if you can check if the above interaction is a bug. > My concern is that the 2 AccountingStorageExternalHost entries failing should not
> cause the connection to the main slurmdbd (at the correct version) to also fail.
> . . .
> I would appreciate if you can check if the above interaction is a bug.
That's a good question. I am able to replicate those errors in the main 'slurmdbd' log file with a similar mixed-version setup. So far it doesn't seem like the errors are just cosmetic and don't actually interfere with the main accounting storage system. If there are any functional issues with the cluster, that would definitely be a bug, so let us know if you see any indications of such issues. We'll keep looking on our end and see what else we can find.
Best regards,
Stephen
Hi Matt, Just updating to let you know that I have been investigating a bit on the issues you reported. I have been running some tests setting AccountingStorageExternalHost and upgrading controller and slurmdbd in different ways, but I have not been able to reproduce the issue. In fact, when having the controller trying to contact slurmdbds running older versions, I got errors like: >[2025-05-01T11:51:09.015] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 2302 with slurmdbd localhost:23332 >[2025-05-01T11:51:09.030] error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to localhost:23332: Failed to unpack SLURM_PERSIST_INIT message >[2025-05-01T11:51:09.030] error: Sending PersistInit msg: Incompatible versions of client and server code Which directly report the, as Stephen mentioned, expected version mismatch error. But, according to the logs you shared, that was not your case. The only way I have been able to get your same errors, is by setting the same host in AccountingStorageHost and AccountingStorageExternalHost: >AccountingStorageHost = localhost >AccountingStoragePort = 23020 >AccountingStorageExternalHost = localhost:23020,localhost:23020 With that config, I get slurmdbd to complain with the same security violation you got: >[2025-05-01T13:39:08.002] error: CONN:12 Security violation, DBD_REGISTER_CTLD Which is expected, because the same slurmdbd service cannot be registered as external for a cluster that is already his main. And the controller also attempted to connect every 5s, twice because I had 2 entries in AccountingStorageExternalHost. Looking at your logs, it seems that the controller was also trying to register twice to the same host (slurmdbd slurm2.frontier.olcf.ornl.gov:6819) >[2025-04-01T10:50:39.001] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:6819 >[2025-04-01T10:50:39.004] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:6819 So, my current theory is that the controller never tried to connect to the older slurmdbds (we did not see version mismatch errors), but was trying to connect all the time to the same (updated)slurmdbd. So, I would appreciate if you could help me with that questions: By any chance would you still have the original line that caused the issues (AccountingStorageExternalHost)? Could it be possible that the hosts defined there were boht resolved to the updated slurmdbd? is slurm2.frontier.olcf.ornl.gov:6819 the main slurmdbd host? Thanks for your patience, Oscar Hi Matt, As mentioned in my last comment, I was able to reproduce your reported behavior, but only under circumstances in which that would be considered expected. I will proceed resolving this one. But, in case you can provide any of the extra details requested, feel free to re-open this one. Kind regards, Oscar |