Ticket 19255 - slurmdbd not ready after systemctl start
Summary: slurmdbd not ready after systemctl start
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 23.02.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-03-08 09:18 MST by David Matthews
Modified: 2024-04-09 07:05 MDT (History)
0 users

See Also:
Site: Met Office
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description David Matthews 2024-03-08 09:18:23 MST
(This is really a duplicate of Bug 18222 but that was not raised by a supported site.)

I've run into a scripting problem when trying to issues sacctmgr commands after starting the slurmdbd service using systemctl. I now realise this is because the slurmdbd service uses "Type=simple" (unlike the slurmctld service which uses "Type=notify"). I can work around this (e.g. using sleep) now I understand why it's happening but it would be nicer if the systemctl start waited for the service to be usable.
Comment 1 Felip Moll 2024-04-04 07:13:34 MDT
(In reply to David Matthews from comment #0)
> (This is really a duplicate of Bug 18222 but that was not raised by a
> supported site.)
> 
> I've run into a scripting problem when trying to issues sacctmgr commands
> after starting the slurmdbd service using systemctl. I now realise this is
> because the slurmdbd service uses "Type=simple" (unlike the slurmctld
> service which uses "Type=notify"). I can work around this (e.g. using sleep)
> now I understand why it's happening but it would be nicer if the systemctl
> start waited for the service to be usable.

Hi David,

We will look into that but a couple of thoughts are, on one hand, that slurmctld can start up without slurmdbd. It will cache the work and send it to dbd when it is available, so the dependency is not a hard dependency, and I am not exactly sure why in bug 18222 they claim slurmctld crashes if there's no dbd. On the other hand sometimes slurmdbd needs to start up and perform maintenance operations, like a rollup or updating tables (after an upgrade). In that case the port might not available yet but the service is started so systemd shouldn't consider the service as down in this case.

Can you please describe your exact workflow to see if there's something else you can do instead of relying on systemd reported status?
Comment 2 Felip Moll 2024-04-04 07:28:07 MDT
Also note that systemd ordering is not a guarantee. Frequently slurmdbd is not in the same host than slurmctld.
Comment 3 David Matthews 2024-04-05 03:06:24 MDT
I encountered this as part of an automated cloud deployment. There is a script that puts all the Slurm config files in place, then starts slurmdbd and then runs a number of sacctmgr commands to configure accounts and QOSs. The sacctmgr commands fail instantly if slurmdbd is not responding.
Comment 4 Felip Moll 2024-04-09 06:55:50 MDT
(In reply to David Matthews from comment #3)
> I encountered this as part of an automated cloud deployment. There is a
> script that puts all the Slurm config files in place, then starts slurmdbd
> and then runs a number of sacctmgr commands to configure accounts and QOSs.
> The sacctmgr commands fail instantly if slurmdbd is not responding.

Our suggestion at the moment is to adapt your script to check for sacctmgr port availability. That would be more reliable than trusting systemd, plus would allow having slurmdbd in separate servers.

Making slurmctld dependant on slurmdbd is not always correct, as one can even run without accounting.

Does this make sense?
Comment 5 Felip Moll 2024-04-09 06:56:26 MDT
Typo:

> Our suggestion at the moment is to adapt your script to check for **slurmdbd** port availability.
Comment 6 David Matthews 2024-04-09 07:05:54 MDT
That's fine - thanks.

I can see that relying on systemd is not going to work given your earlier response
("sometimes slurmdbd needs to start up and perform maintenance operations, like a rollup or updating tables (after an upgrade). In that case the port might not available yet but the service is started")