Ticket 22477

Summary: Unable to register with dbd
Product: Slurm Reporter: Matt Ezell <ezellma>
Component: DatabaseAssignee: Stephen Kendall <stephen>
Status: OPEN --- QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: stephen
Version: 24.11.3   
Hardware: Linux   
OS: Linux   
Site: ORNL-OLCF Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Matt Ezell 2025-04-01 08:57:37 MDT
Hopefully this is an easy issue - normally I would do more research myself, but I'm in the middle of an upgrade/outage.

I updated the slurmdbd and slurmctld to 24.11.3, but the slurmctld is failing to register with the slurmdbd.

Every 5 seconds, the controller is logging:
[2025-04-01T10:50:34.004] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:681
9
[2025-04-01T10:50:39.001] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:6819
[2025-04-01T10:50:39.004] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:6819
[2025-04-01T10:50:44.001] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:6819
[2025-04-01T10:50:44.003] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:6819

And the database logs:
[2025-04-01T10:50:39.001] debug:  REQUEST_PERSIST_INIT: CLUSTER:frontier VERSION:10752 UID:6826 IP:10.128.0.19 CONN:15
[2025-04-01T10:50:39.001] debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: request new connection 1
[2025-04-01T10:50:39.001] debug2: Attempting to connect to localhost:3306
[2025-04-01T10:50:39.001] debug2: DBD_REGISTER_CTLD: called in CONN 15 for frontier(6817)
[2025-04-01T10:50:39.001] debug2: slurmctld at ip:10.128.0.19, port:6817
[2025-04-01T10:50:39.002] error: CONN:15 Security violation, DBD_REGISTER_CTLD
[2025-04-01T10:50:39.002] error: Processing last message from connection 15(10.128.0.19) uid(6826)
[2025-04-01T10:50:39.003] debug:  REQUEST_PERSIST_INIT: CLUSTER:frontier VERSION:10752 UID:6826 IP:10.128.0.19 CONN:13
[2025-04-01T10:50:39.003] debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: request new connection 1
[2025-04-01T10:50:39.003] debug2: Attempting to connect to localhost:3306
[2025-04-01T10:50:39.004] debug2: DBD_REGISTER_CTLD: called in CONN 13 for frontier(6817)
[2025-04-01T10:50:39.004] debug2: slurmctld at ip:10.128.0.19, port:6817
[2025-04-01T10:50:39.005] error: CONN:13 Security violation, DBD_REGISTER_CTLD
[2025-04-01T10:50:39.005] error: Processing last message from connection 13(10.128.0.19) uid(6826)



[root@slurm1.frontier ~]# scontrol show config|grep SlurmUser
SlurmUser               = slurm(6826)
[root@slurm1.frontier ~]# id slurm
uid=6826(slurm) gid=9526(slurm) groups=9526(slurm),27493(pkpass),27480(stf002everest),2324(stf002),7832(cluster-admins),2001(systems),1099(ccsstaff),22738(service),24665(globus-sharing),26694(nccsstaff),2046(everest),2075(vizstaff),28724(stf020),24121(mfa4),25385(vendordistsw)


[root@slurm2.frontier ~]# grep SlurmUser /etc/slurm/slurmdbd.conf 
SlurmUser=slurm
[root@slurm2.frontier ~]# id slurm
uid=6826(slurm) gid=9526(slurm) groups=9526(slurm),27493(pkpass),27480(stf002everest),2324(stf002),7832(cluster-admins),2001(systems),1099(ccsstaff),22738(service),24665(globus-sharing),26694(nccsstaff),2046(everest),2075(vizstaff),28724(stf020),24121(mfa4),25385(vendordistsw)
Comment 1 Matt Ezell 2025-04-01 09:01:25 MDT
We have 2 clusters setup as `AccountingStorageExternalHost`. They are still running 24.05, so those connections likely don't work. If I comment out that directive, I no longer see the error.
Comment 3 Stephen Kendall 2025-04-01 10:07:47 MDT
From reading your description, it sounds like you have a 'slurmctld' on 24.11.3 trying to connect to a 'slurmdbd' on 24.05.x through 'AccountingStorageExternalHost'. This is expected to fail, as the 'slurmdbd' should always be the same version or newer than other Slurm components. The solution would be to either disable the 'AccountingStorageExternalHost' field, or to upgrade the 'slurmdbd' components specified to 24.11.3. (It is not necessary to upgrade other components in those clusters.)
https://slurm.schedmd.com/upgrades.html#slurmdbd

Are there any issues you're still seeing after commenting out the 'AccountingStorageExternalHost'? If so, please enable the 'Protocol' debug flag on the controller temporarily and provide more recent log entries.
https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags

Best regards,
Stephen
Comment 4 Matt Ezell 2025-04-01 10:17:46 MDT
(In reply to Stephen Kendall from comment #3)
> From reading your description, it sounds like you have a 'slurmctld' on
> 24.11.3 trying to connect to a 'slurmdbd' on 24.05.x through
> 'AccountingStorageExternalHost'. This is expected to fail, as the 'slurmdbd'
> should always be the same version or newer than other Slurm components. The
> solution would be to either disable the 'AccountingStorageExternalHost'
> field, or to upgrade the 'slurmdbd' components specified to 24.11.3. (It is
> not necessary to upgrade other components in those clusters.)
> https://slurm.schedmd.com/upgrades.html#slurmdbd

Understood. My concern is that the 2 AccountingStorageExternalHost entries failing should not cause the connection to the main slurmdbd (at the correct version) to also fail.

> Are there any issues you're still seeing after commenting out the
> 'AccountingStorageExternalHost'? If so, please enable the 'Protocol' debug
> flag on the controller temporarily and provide more recent log entries.
> https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags

No, things are working correctly with those disabled. I dropped this to sev-4. I would appreciate if you can check if the above interaction is a bug.
Comment 5 Stephen Kendall 2025-04-01 11:45:51 MDT
> My concern is that the 2 AccountingStorageExternalHost entries failing should not
> cause the connection to the main slurmdbd (at the correct version) to also fail.
> . . .
> I would appreciate if you can check if the above interaction is a bug.

That's a good question. I am able to replicate those errors in the main 'slurmdbd' log file with a similar mixed-version setup. So far it doesn't seem like the errors are just cosmetic and don't actually interfere with the main accounting storage system. If there are any functional issues with the cluster, that would definitely be a bug, so let us know if you see any indications of such issues. We'll keep looking on our end and see what else we can find.

Best regards,
Stephen