Ticket 19255

Summary:	slurmdbd not ready after systemctl start
Product:	Slurm	Reporter:	David Matthews <david.matthews>
Component:	slurmdbd	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	23.02.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=8066
Site:	Met Office	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description David Matthews 2024-03-08 09:18:23 MST

(This is really a duplicate of Bug 18222 but that was not raised by a supported site.)

I've run into a scripting problem when trying to issues sacctmgr commands after starting the slurmdbd service using systemctl. I now realise this is because the slurmdbd service uses "Type=simple" (unlike the slurmctld service which uses "Type=notify"). I can work around this (e.g. using sleep) now I understand why it's happening but it would be nicer if the systemctl start waited for the service to be usable.

Comment 1 Felip Moll 2024-04-04 07:13:34 MDT

(In reply to David Matthews from comment #0)
> (This is really a duplicate of Bug 18222 but that was not raised by a
> supported site.)
> 
> I've run into a scripting problem when trying to issues sacctmgr commands
> after starting the slurmdbd service using systemctl. I now realise this is
> because the slurmdbd service uses "Type=simple" (unlike the slurmctld
> service which uses "Type=notify"). I can work around this (e.g. using sleep)
> now I understand why it's happening but it would be nicer if the systemctl
> start waited for the service to be usable.

Hi David,

We will look into that but a couple of thoughts are, on one hand, that slurmctld can start up without slurmdbd. It will cache the work and send it to dbd when it is available, so the dependency is not a hard dependency, and I am not exactly sure why in bug 18222 they claim slurmctld crashes if there's no dbd. On the other hand sometimes slurmdbd needs to start up and perform maintenance operations, like a rollup or updating tables (after an upgrade). In that case the port might not available yet but the service is started so systemd shouldn't consider the service as down in this case.

Can you please describe your exact workflow to see if there's something else you can do instead of relying on systemd reported status?

Comment 2 Felip Moll 2024-04-04 07:28:07 MDT

Also note that systemd ordering is not a guarantee. Frequently slurmdbd is not in the same host than slurmctld.

Comment 3 David Matthews 2024-04-05 03:06:24 MDT

I encountered this as part of an automated cloud deployment. There is a script that puts all the Slurm config files in place, then starts slurmdbd and then runs a number of sacctmgr commands to configure accounts and QOSs. The sacctmgr commands fail instantly if slurmdbd is not responding.

Comment 4 Felip Moll 2024-04-09 06:55:50 MDT

(In reply to David Matthews from comment #3)
> I encountered this as part of an automated cloud deployment. There is a
> script that puts all the Slurm config files in place, then starts slurmdbd
> and then runs a number of sacctmgr commands to configure accounts and QOSs.
> The sacctmgr commands fail instantly if slurmdbd is not responding.

Our suggestion at the moment is to adapt your script to check for sacctmgr port availability. That would be more reliable than trusting systemd, plus would allow having slurmdbd in separate servers.

Making slurmctld dependant on slurmdbd is not always correct, as one can even run without accounting.

Does this make sense?

Comment 5 Felip Moll 2024-04-09 06:56:26 MDT

Typo:

> Our suggestion at the moment is to adapt your script to check for **slurmdbd** port availability.

Comment 6 David Matthews 2024-04-09 07:05:54 MDT

That's fine - thanks.

I can see that relying on systemd is not going to work given your earlier response
("sometimes slurmdbd needs to start up and perform maintenance operations, like a rollup or updating tables (after an upgrade). In that case the port might not available yet but the service is started")