Ticket 22477

Summary:	Unable to register with dbd
Product:	Slurm	Reporter:	Matt Ezell <ezellma>
Component:	Database	Assignee:	Oscar Hernández <oscar.hernandez>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	stephen
Version:	24.11.3
Hardware:	Linux
OS:	Linux
Site:	ORNL-OLCF	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Matt Ezell 2025-04-01 08:57:37 MDT

Hopefully this is an easy issue - normally I would do more research myself, but I'm in the middle of an upgrade/outage.

I updated the slurmdbd and slurmctld to 24.11.3, but the slurmctld is failing to register with the slurmdbd.

Every 5 seconds, the controller is logging:
[2025-04-01T10:50:34.004] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:681
9
[2025-04-01T10:50:39.001] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:6819
[2025-04-01T10:50:39.004] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:6819
[2025-04-01T10:50:44.001] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:6819
[2025-04-01T10:50:44.003] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:6819

And the database logs:
[2025-04-01T10:50:39.001] debug:  REQUEST_PERSIST_INIT: CLUSTER:frontier VERSION:10752 UID:6826 IP:10.128.0.19 CONN:15
[2025-04-01T10:50:39.001] debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: request new connection 1
[2025-04-01T10:50:39.001] debug2: Attempting to connect to localhost:3306
[2025-04-01T10:50:39.001] debug2: DBD_REGISTER_CTLD: called in CONN 15 for frontier(6817)
[2025-04-01T10:50:39.001] debug2: slurmctld at ip:10.128.0.19, port:6817
[2025-04-01T10:50:39.002] error: CONN:15 Security violation, DBD_REGISTER_CTLD
[2025-04-01T10:50:39.002] error: Processing last message from connection 15(10.128.0.19) uid(6826)
[2025-04-01T10:50:39.003] debug:  REQUEST_PERSIST_INIT: CLUSTER:frontier VERSION:10752 UID:6826 IP:10.128.0.19 CONN:13
[2025-04-01T10:50:39.003] debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: request new connection 1
[2025-04-01T10:50:39.003] debug2: Attempting to connect to localhost:3306
[2025-04-01T10:50:39.004] debug2: DBD_REGISTER_CTLD: called in CONN 13 for frontier(6817)
[2025-04-01T10:50:39.004] debug2: slurmctld at ip:10.128.0.19, port:6817
[2025-04-01T10:50:39.005] error: CONN:13 Security violation, DBD_REGISTER_CTLD
[2025-04-01T10:50:39.005] error: Processing last message from connection 13(10.128.0.19) uid(6826)



[root@slurm1.frontier ~]# scontrol show config|grep SlurmUser
SlurmUser               = slurm(6826)
[root@slurm1.frontier ~]# id slurm
uid=6826(slurm) gid=9526(slurm) groups=9526(slurm),27493(pkpass),27480(stf002everest),2324(stf002),7832(cluster-admins),2001(systems),1099(ccsstaff),22738(service),24665(globus-sharing),26694(nccsstaff),2046(everest),2075(vizstaff),28724(stf020),24121(mfa4),25385(vendordistsw)


[root@slurm2.frontier ~]# grep SlurmUser /etc/slurm/slurmdbd.conf 
SlurmUser=slurm
[root@slurm2.frontier ~]# id slurm
uid=6826(slurm) gid=9526(slurm) groups=9526(slurm),27493(pkpass),27480(stf002everest),2324(stf002),7832(cluster-admins),2001(systems),1099(ccsstaff),22738(service),24665(globus-sharing),26694(nccsstaff),2046(everest),2075(vizstaff),28724(stf020),24121(mfa4),25385(vendordistsw)

Comment 1 Matt Ezell 2025-04-01 09:01:25 MDT

We have 2 clusters setup as `AccountingStorageExternalHost`. They are still running 24.05, so those connections likely don't work. If I comment out that directive, I no longer see the error.

Comment 3 Stephen Kendall 2025-04-01 10:07:47 MDT

From reading your description, it sounds like you have a 'slurmctld' on 24.11.3 trying to connect to a 'slurmdbd' on 24.05.x through 'AccountingStorageExternalHost'. This is expected to fail, as the 'slurmdbd' should always be the same version or newer than other Slurm components. The solution would be to either disable the 'AccountingStorageExternalHost' field, or to upgrade the 'slurmdbd' components specified to 24.11.3. (It is not necessary to upgrade other components in those clusters.)
https://slurm.schedmd.com/upgrades.html#slurmdbd

Are there any issues you're still seeing after commenting out the 'AccountingStorageExternalHost'? If so, please enable the 'Protocol' debug flag on the controller temporarily and provide more recent log entries.
https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags

Best regards,
Stephen

Comment 4 Matt Ezell 2025-04-01 10:17:46 MDT

(In reply to Stephen Kendall from comment #3)
> From reading your description, it sounds like you have a 'slurmctld' on
> 24.11.3 trying to connect to a 'slurmdbd' on 24.05.x through
> 'AccountingStorageExternalHost'. This is expected to fail, as the 'slurmdbd'
> should always be the same version or newer than other Slurm components. The
> solution would be to either disable the 'AccountingStorageExternalHost'
> field, or to upgrade the 'slurmdbd' components specified to 24.11.3. (It is
> not necessary to upgrade other components in those clusters.)
> https://slurm.schedmd.com/upgrades.html#slurmdbd

Understood. My concern is that the 2 AccountingStorageExternalHost entries failing should not cause the connection to the main slurmdbd (at the correct version) to also fail.

> Are there any issues you're still seeing after commenting out the
> 'AccountingStorageExternalHost'? If so, please enable the 'Protocol' debug
> flag on the controller temporarily and provide more recent log entries.
> https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags

No, things are working correctly with those disabled. I dropped this to sev-4. I would appreciate if you can check if the above interaction is a bug.

Comment 5 Stephen Kendall 2025-04-01 11:45:51 MDT

> My concern is that the 2 AccountingStorageExternalHost entries failing should not
> cause the connection to the main slurmdbd (at the correct version) to also fail.
> . . .
> I would appreciate if you can check if the above interaction is a bug.

That's a good question. I am able to replicate those errors in the main 'slurmdbd' log file with a similar mixed-version setup. So far it doesn't seem like the errors are just cosmetic and don't actually interfere with the main accounting storage system. If there are any functional issues with the cluster, that would definitely be a bug, so let us know if you see any indications of such issues. We'll keep looking on our end and see what else we can find.

Best regards,
Stephen

Comment 8 Oscar Hernández 2025-05-23 11:14:13 MDT

Hi Matt,

Just updating to let you know that I have been investigating a bit on the issues you reported.

I have been running some tests setting AccountingStorageExternalHost and upgrading controller and slurmdbd in different ways, but I have not been able to reproduce the issue.

In fact, when having the controller trying to contact slurmdbds running older versions, I got errors like:

>[2025-05-01T11:51:09.015] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 2302 with slurmdbd localhost:23332
>[2025-05-01T11:51:09.030] error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to localhost:23332: Failed to unpack SLURM_PERSIST_INIT message
>[2025-05-01T11:51:09.030] error: Sending PersistInit msg: Incompatible versions of client and server code
Which directly report the, as Stephen mentioned, expected version mismatch error. But, according to the logs you shared, that was not your case.

The only way I have been able to get your same errors, is by setting the same host in AccountingStorageHost and AccountingStorageExternalHost:

>AccountingStorageHost = localhost
>AccountingStoragePort = 23020
>AccountingStorageExternalHost = localhost:23020,localhost:23020
With that config, I get slurmdbd to complain with the same security violation you got:

>[2025-05-01T13:39:08.002] error: CONN:12 Security violation, DBD_REGISTER_CTLD
Which is expected, because the same slurmdbd service cannot be registered as external for a cluster that is already his main. And the controller also attempted to connect every 5s, twice because I had 2 entries in AccountingStorageExternalHost.

Looking at your logs, it seems that the controller was also trying to register twice to the same host (slurmdbd slurm2.frontier.olcf.ornl.gov:6819)

>[2025-04-01T10:50:39.001] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:6819
>[2025-04-01T10:50:39.004] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd slurm2.frontier.olcf.ornl.gov:6819
So, my current theory is that the controller never tried to connect to the older slurmdbds (we did not see version mismatch errors), but was trying to connect all the time to the same (updated)slurmdbd. So, I would appreciate if you could help me with that questions:

By any chance would you still have the original line that caused the issues (AccountingStorageExternalHost)? Could it be possible that the hosts defined there were boht resolved to the updated slurmdbd? is slurm2.frontier.olcf.ornl.gov:6819 the main slurmdbd host?

Thanks for your patience,
Oscar

Comment 9 Oscar Hernández 2025-06-03 07:06:35 MDT

Hi Matt,

As mentioned in my last comment, I was able to reproduce your reported behavior, but only under circumstances in which that would be considered expected.

I will proceed resolving this one. But, in case you can provide any of the extra details requested, feel free to re-open this one.

Kind regards,
Oscar