Ticket 18222 - slurmdbd: ready-ness notification
Summary: slurmdbd: ready-ness notification
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 23.02.6
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-11-16 13:34 MST by sergei.kozlukov
Modified: 2023-11-16 13:34 MST (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description sergei.kozlukov 2023-11-16 13:34:58 MST
At least in `23.02.06`, slurmdbd is started as Type = "simple" when used with systemd, and the corresponding unit reports as "active" immediately, even before the TCP port (6819) is open.

This leads to a situation where even with slurmdbd and slurmctld run on the same node, and slurmctld has After=slurmdbd.service, slurmctld might (would) actually start running before slurmdbd is ready to accept connections. In this case, slurmctld crashes with ''slurmctld: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused''.

I think it would be nice either for slurmctld to actually wait longer (AFAIU it ignores MessageTimeout if it's set to a big value, but maybe I'm misinterpreting), and/or for slurmdbd to have a readyness notification mechanism.

An ugly hack that I tried is to run `nc -z localhost 6819` in a loop in slurmdbd's `PostExecStart`. This way systemd doesn't mark it as ready until nc succeeds. This works but is clearly sub-optimal.

Thanks!