Ticket 18222

Summary: slurmdbd: ready-ness notification
Product: Slurm Reporter: sergei.kozlukov
Component: slurmdbdAssignee: Jacob Jenson <jacob>
Status: OPEN --- QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 23.02.6   
Hardware: Linux   
OS: Linux   
Site: -Other- Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description sergei.kozlukov 2023-11-16 13:34:58 MST
At least in `23.02.06`, slurmdbd is started as Type = "simple" when used with systemd, and the corresponding unit reports as "active" immediately, even before the TCP port (6819) is open.

This leads to a situation where even with slurmdbd and slurmctld run on the same node, and slurmctld has After=slurmdbd.service, slurmctld might (would) actually start running before slurmdbd is ready to accept connections. In this case, slurmctld crashes with ''slurmctld: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused''.

I think it would be nice either for slurmctld to actually wait longer (AFAIU it ignores MessageTimeout if it's set to a big value, but maybe I'm misinterpreting), and/or for slurmdbd to have a readyness notification mechanism.

An ugly hack that I tried is to run `nc -z localhost 6819` in a loop in slurmdbd's `PostExecStart`. This way systemd doesn't mark it as ready until nc succeeds. This works but is clearly sub-optimal.

Thanks!