Ticket 8066

Summary: Systemd service startup ordering and slurmdbd startup problem
Product: Slurm Reporter: Pär Lindfors <par.lindfors>
Component: OtherAssignee: Felip Moll <felip.moll>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek, nate
Version: 19.05.3   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8067
https://bugs.schedmd.com/show_bug.cgi?id=16075
https://bugs.schedmd.com/show_bug.cgi?id=19255
Site: SNIC Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: UPPMAX Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 21.08.0
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Pär Lindfors 2019-11-07 06:06:06 MST
This bug is actually two problems in one, as I can't figure out how to report them separately.

All my tests are with Slurm 19.05.3 on CentOS 7.7.

The systemd service files does not specify enough ordering for Slurm services. When mariadb, slurmdbd and slurmctld is running on the same host, systemd will happily attempt to start all three services simultaneously. This sort-of works but cause errors/warnings in logs and unnecessary delays.

This happens every boot, but can be tested by running "systemctl restart mariadb slurmdbd slurmctld" and checking the logs.

SlurmDBD logs errors because it tries to connect to MariaDB before it is ready. slurmdbd.log:

      error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
      error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.

According to my tests this can be avoided completely by adding a "After=mariadb.service" to the slurmdbd.service.

Slurmctld also fails to connect to slurmdbd as it is not running yet, and logs several errors. (This actually causes other problems, requiring a later slurmctld restart, but I will open a separate bug about that). Example output from slurmctld.log:

    slurmctld version 19.05.3-2 started on cluster terry
    ...
    error: slurm_persist_conn_open_without_init: failed to open persistent connection to terry-q:7031: Connection refused
    error: slurmdbd: Sending PersistInit msg: Connection refused
    error: Association database appears down, reading from state file.
    error: slurmdbd: Sending PersistInit msg: Connection refused

Adding "Before=slurmctld.service" to slurmdbd.service, or "After=slurmdbd.service" to slurmctld.service is unfortunately not enought to prevent this.

The slurmdbd service finishes startup before the daemon is actually ready.

My current work-around for this is to use ExecStartPost in slurmdbd.service which waits until the DbdPort is open (we use DbdPort=7031):

[Service]
ExecStartPost=-/bin/bash -c "while ! ss -nltH|awk '{print $$4}'|grep '*:7031'; do sleep 0.5;done"
TimeoutStartSec=0


For completeness, our current slurmdbd.service override looks like this.

/etc/systemd/system/slurmdbd.service.d/override.conf:
########################################
# Ensure services start in the proper order:
# mariadb -> slurmdbd -> slurmctld
[Unit]
After=mariadb.service
Before=slurmctld.service

[Service]
# Delay until the DbdPort is listening
ExecStartPost=-/bin/bash -c "while ! ss -nltH|awk '{print $$4}'|grep '*:7031'; do sleep 0.5;done"
# Disable start timeout. During Slurm major version upgrades starting
# SlurmDBD can take a long time because of database format
# changes. Killing the process during this would be bad.
TimeoutStartSec=0
########################################


It would be nice if the Before=/After= ordering could be added to the included service files. This should be safe for sites that are using separate hosts for mariadb, slurmdbd, slurmctld as this only specify ordering and are not requirements.

My ExecStartPost kludge should obviously not be included anywhere. But it would be good if the slurmdbd service startup could be fixed so that it does not finish before being ready.
Comment 3 Pär Lindfors 2019-11-07 11:15:01 MST
The slurmctld issue I referred to is now reported as bug 8067.
Comment 5 Felip Moll 2019-11-12 10:56:37 MST
Hi Pär,

I have investigated about your issue and despite it would seem correct to add an After= or Before= in the systemd unit files shipped with Slurm, it can also be problematic. Depending on the architecture one designs, services may or may not be in the same server, so making one service like slurmdbd to *always* wait for mariadb have two problems: first it won't work on installations where the database is in a different server than the one where slurmdbd resides. Second not everybody would use mariadbd for the database, MySQL or maybe others could be used.

Even more, in a typical cluster installation services may be managed by third party clustering software. It is not uncommon to use solutions like Pacemaker which already takes care of the ordering.

Therefore I think not hardcoding the After= in our unit files and leaving this possibility to the administration entirely is the most correct solution.

As for the second part regarding the ExecPre and slurmctld having to wait for slurmdbd, as you said it cannot be included neither due to it being a hackish solution.

In any case I will investigate why systemd thinks slurmdbd is up even if it still not listening on the port.

I also think the solution must come from bug 8067: slurmdbd/ctld should just not generate errors other than informative ones.

What do you think?
Comment 6 Pär Lindfors 2019-11-13 07:50:59 MST
> I have investigated about your issue and despite it would seem correct to
> add an After= or Before= in the systemd unit files shipped with Slurm, it
> can also be problematic. Depending on the architecture one designs, services
> may or may not be in the same server, so making one service like slurmdbd to
> *always* wait for mariadb have two problems: first it won't work on
> installations where the database is in a different server than the one where
> slurmdbd resides. Second not everybody would use mariadbd for the database,
> MySQL or maybe others could be used.

No, I think you have not understood what Before=,After= does. It
I tried explaining briefly at the end why it will not cause the
issues you mention.

Please check out the systemd.unit documentation:

https://www.freedesktop.org/software/systemd/man/systemd.unit.html#

Before=,After= only controls the ordering during service start-up
and shut-down. It does not initiate the any startup, add any
requirements of other services actually running. It is also still
possible to start/stop each service individually. So this will
not make slurmdbd startup wait for a not-enabled or even
non-existing mariadb service.

This is very different compared to options like Wants= or
Requires= which would cause the problems you describe.

Non-existing services are ignored completely. So you could
specify both mariadb and mysql if you
prefer. "After=mysql.service mariadb.service" For unit files in
RPM packages it would however make sense to only specify the one
used during build.

> Even more, in a typical cluster installation services may be managed by
> third party clustering software. It is not uncommon to use solutions like
> Pacemaker which already takes care of the ordering.

A Pacemaker setup should be completely unaffected by this change.

That some sites might be using clustering software is no reason
not to fix the startup ordering in the simple case where people
just install the RPMs.

> Therefore I think not hardcoding the After= in our unit files and leaving
> this possibility to the administration entirely is the most correct solution.

I think this was based on an incorrect understanding on what
Before=,After= does, see previous explanation.

> As for the second part regarding the ExecPre and slurmctld having to wait
> for slurmdbd, as you said it cannot be included neither due to it being a
> hackish solution.

That was mostly included to show that "Before=" is currently not sufficient. Before= still makes sense (or a After=slurmdbd in the slurmctld unit).

Also this made the hackish solution available for other that
might find this bug.

> In any case I will investigate why systemd thinks slurmdbd is up even if it
> still not listening on the port.

Please do!

I suspect that fixing this in in a way that avoids ugly external
hacks would require naitive systemd support in slurmdbd and calling
sd_notify() during startup:
https://www.freedesktop.org/software/systemd/man/sd_notify.html#

This would be a great new feature.
Comment 8 Felip Moll 2019-11-13 15:02:49 MST
Hi Pär,

First of all I apologize for having responded by memory and not checked systemd documentation.
You are right, the After= and Before= only applies to start jobs launched by systemd, essentially during the boot process (though I suppose there could be other situations that could be applied).

But yes, the general case is that After= or Before= alone, without any combination of Required= or Wants= which is what I really had in my mind should make the slurmdbd start after the mariadb or whatever we put in there. That would probably work though only for services like MariaDB which implements the sd_notify so are Type=notify services because on fork type services we would probably end in the same situation than we have between slurmdbd and slurmctld.

I will propose to add mariadb and/or mysql into the slurmdbd unit file. I am seeing that it may be sufficient to just put mysql, since mariadb seems to add a link from mysql.service to mariadb.service. I have still to check if it is a standard or something about my RH packaging.

Is this ok for you?


> > In any case I will investigate why systemd thinks slurmdbd is up even if it
> > still not listening on the port.
> 
> Please do!
> 
> I suspect that fixing this in in a way that avoids ugly external
> hacks would require naitive systemd support in slurmdbd and calling
> sd_notify() during startup:
> https://www.freedesktop.org/software/systemd/man/sd_notify.html#
> 
> This would be a great new feature.

That's easy. We are daemonizing in slurmdbd before opening any ports, this is the first thing we do. After we have daemonized the tracked pid by systemd dies with a rc of 0 and therefore systemd thinks the service is up and running. In fact it is though it is not fully initialized. Changing from a systemd forking type service to a notify type one requires to link with systemd libraries. I am studying this possibility but it would be a feature request.

In fact the last thing we do as part of init process is to open the port creating the rpc_mgr thread, which obviously cannot be done before as we need everything set to start receiving RPCs.

Does it makes sense now?

I re-apologize for having responded too quick and wrong at the first time.
Comment 15 Felip Moll 2020-01-10 06:07:37 MST
Hi Pär,

I am just adding here a message to still confirm that a fix is pending for review  by our QA team. The fix refers to the order on how we start services in systemd.

Other errors are being fixed in bug 8067.

Thanks for your enormous patience and understanding.
Comment 18 Felip Moll 2021-03-04 03:20:31 MST
(In reply to Felip Moll from comment #15)
> Hi Pär,
> 
> I am just adding here a message to still confirm that a fix is pending for
> review  by our QA team. The fix refers to the order on how we start services
> in systemd.
> 
> Other errors are being fixed in bug 8067.
> 
> Thanks for your enormous patience and understanding.

First of all sorry for the unusually long delay with this.

The fix is committed into master and will be there starting from 21.08.

commit 53bcf76ed35b3603bf17bd9e7c1fbb86a3e711e6
Author:     Felip Moll <felip.moll@schedmd.com>
AuthorDate: Wed Mar 3 14:52:00 2021 -0700

    slurmdbd.service - add "After" relationship for mariadb.
    
    Include all common names for MariaDB/MySQL services. Only use After= and
    not Requires= to avoid issues if the database is located on a different
    host, and to avoid needing to deduce the appropriate service name for
    MariaDB which will vary by distribution.
    
    Bug 8066.