Summary: | Unable to register with dbd | ||
---|---|---|---|
Product: | Slurm | Reporter: | Matt Ezell <ezellma> |
Component: | Database | Assignee: | Stephen Kendall <stephen> |
Status: | OPEN --- | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | stephen |
Version: | 24.11.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | ORNL-OLCF | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Matt Ezell
2025-04-01 08:57:37 MDT
We have 2 clusters setup as `AccountingStorageExternalHost`. They are still running 24.05, so those connections likely don't work. If I comment out that directive, I no longer see the error. From reading your description, it sounds like you have a 'slurmctld' on 24.11.3 trying to connect to a 'slurmdbd' on 24.05.x through 'AccountingStorageExternalHost'. This is expected to fail, as the 'slurmdbd' should always be the same version or newer than other Slurm components. The solution would be to either disable the 'AccountingStorageExternalHost' field, or to upgrade the 'slurmdbd' components specified to 24.11.3. (It is not necessary to upgrade other components in those clusters.) https://slurm.schedmd.com/upgrades.html#slurmdbd Are there any issues you're still seeing after commenting out the 'AccountingStorageExternalHost'? If so, please enable the 'Protocol' debug flag on the controller temporarily and provide more recent log entries. https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags Best regards, Stephen (In reply to Stephen Kendall from comment #3) > From reading your description, it sounds like you have a 'slurmctld' on > 24.11.3 trying to connect to a 'slurmdbd' on 24.05.x through > 'AccountingStorageExternalHost'. This is expected to fail, as the 'slurmdbd' > should always be the same version or newer than other Slurm components. The > solution would be to either disable the 'AccountingStorageExternalHost' > field, or to upgrade the 'slurmdbd' components specified to 24.11.3. (It is > not necessary to upgrade other components in those clusters.) > https://slurm.schedmd.com/upgrades.html#slurmdbd Understood. My concern is that the 2 AccountingStorageExternalHost entries failing should not cause the connection to the main slurmdbd (at the correct version) to also fail. > Are there any issues you're still seeing after commenting out the > 'AccountingStorageExternalHost'? If so, please enable the 'Protocol' debug > flag on the controller temporarily and provide more recent log entries. > https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags No, things are working correctly with those disabled. I dropped this to sev-4. I would appreciate if you can check if the above interaction is a bug. > My concern is that the 2 AccountingStorageExternalHost entries failing should not
> cause the connection to the main slurmdbd (at the correct version) to also fail.
> . . .
> I would appreciate if you can check if the above interaction is a bug.
That's a good question. I am able to replicate those errors in the main 'slurmdbd' log file with a similar mixed-version setup. So far it doesn't seem like the errors are just cosmetic and don't actually interfere with the main accounting storage system. If there are any functional issues with the cluster, that would definitely be a bug, so let us know if you see any indications of such issues. We'll keep looking on our end and see what else we can find.
Best regards,
Stephen
|