Ticket 8109

Summary: slurmdbd shuts down after a while if started with systemctl
Product: Slurm Reporter: Rex Chen <shuningc>
Component: AccountingAssignee: Jason Booth <jbooth>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 19.05.3   
Hardware: Linux   
OS: Linux   
Site: AWS+Sixnines Social Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Rex Chen 2019-11-14 16:05:49 MST
Running slurm on Ubuntu 16.04, trying to set up accounting, whenever I use systemctl to run slurmdbd it stops after a while:

[2019-11-13T22:22:37.659] slurmdbd version 19.05.3-2 started
...
# Runs for a while
...
[2019-11-13T22:24:07.595] Terminate signal (SIGINT or SIGTERM) received
[2019-11-13T22:24:07.595] debug2: Closed connection 8 uid(997)
[2019-11-13T22:24:07.596] debug:  rpc_mgr shutting down
[2019-11-13T22:24:07.596] debug4: got 0 commits
[2019-11-13T22:24:07.596] debug4: got 0 commits
[2019-11-13T22:24:07.597] debug4: got 0 commits
[2019-11-13T22:24:07.597] debug4: got 0 commits
[2019-11-13T22:24:07.598] Unable to remove pidfile '/var/tmp/jette/slurmdbd.pid': No such file or directory
[2019-11-13T22:24:07.598] debug3: starting mysql cleaning up
[2019-11-13T22:24:07.598] debug3: finished mysql cleaning up

This issue does not happen if I run slurmdbd directly. 
Is slurmdbd not expected to work with systemctl?
Comment 1 Jason Booth 2019-11-15 11:31:50 MST
Hi Rex,

> Is slurmdbd not expected to work with systemctl?
Yes, in fact most sites use systemd with Slurm. I suspect that you may have an application stopping slurmdbd or sending it a sig 15 as demonstrated by the message quoted below:

> Terminate signal (SIGINT or SIGTERM) received

You may want to look at your sys logs or turn on some type of auditing to figure out who or what is calling systemclt. You can also run "systemctl status <service_name>". It is possible that systemd thinks the slurmdbd is in some type of failed state. In which case you may have the service misconfigured.
Comment 2 Rex Chen 2019-11-15 13:08:54 MST
Hi Jason,

I tried stopping slurmdbd and slurmctld and starting slurmdbd, then starting slurmctld, here are the logs from journalctl. 
Looks like slurmdbd just terminates because of timeout. However, before the timeout sacct was working and there is no error in the slurmdbd log. Again, everything seems to work if I run slurmdbd directly. Any ideas?

Nov 15 19:55:56 ip-172-31-24-191 systemd[1]: Stopped Slurm controller daemon.
Nov 15 19:55:56 ip-172-31-24-191 sudo[6060]: pam_unix(sudo:session): session closed for user root
Nov 15 19:56:01 ip-172-31-24-191 sudo[6064]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/systemctl stop slurmdbd
Nov 15 19:56:01 ip-172-31-24-191 sudo[6064]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Nov 15 19:56:01 ip-172-31-24-191 systemd[1]: Stopped Slurm database daemon.
Nov 15 19:56:01 ip-172-31-24-191 sudo[6064]: pam_unix(sudo:session): session closed for user root
Nov 15 19:56:06 ip-172-31-24-191 sudo[6067]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/systemctl start slurmdbd
Nov 15 19:56:06 ip-172-31-24-191 sudo[6067]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Nov 15 19:56:06 ip-172-31-24-191 systemd[1]: Starting Slurm database daemon...
Nov 15 19:56:06 ip-172-31-24-191 systemd[1]: slurmdbd.service: PID file /var/run/slurmdbd.pid not readable (yet?) after start: No such file or directory
Nov 15 19:56:11 ip-172-31-24-191 sudo[6067]: pam_unix(sudo:session): session closed for user root
Nov 15 19:56:18 ip-172-31-24-191 sudo[6081]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/systemctl start slurmctld
Nov 15 19:56:18 ip-172-31-24-191 sudo[6081]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Nov 15 19:56:18 ip-172-31-24-191 systemd[1]: Starting Slurm controller daemon...
Nov 15 19:56:18 ip-172-31-24-191 systemd[1]: slurmctld.service: PID file /var/run/slurmctld.pid not readable (yet?) after start: No such file or directory
Nov 15 19:56:18 ip-172-31-24-191 systemd[1]: Started Slurm controller daemon.
Nov 15 19:56:18 ip-172-31-24-191 sudo[6081]: pam_unix(sudo:session): session closed for user root
Nov 15 19:56:22 ip-172-31-24-191 sudo[6104]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/journalctl
Nov 15 19:56:22 ip-172-31-24-191 sudo[6104]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Nov 15 19:56:57 ip-172-31-24-191 sudo[6104]: pam_unix(sudo:session): session closed for user root
Nov 15 19:57:05 ip-172-31-24-191 sudo[6110]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/journalctl
Nov 15 19:57:05 ip-172-31-24-191 sudo[6110]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Nov 15 19:57:35 ip-172-31-24-191 sudo[6110]: pam_unix(sudo:session): session closed for user root
Nov 15 19:57:36 ip-172-31-24-191 systemd[1]: slurmdbd.service: Start operation timed out. Terminating.
Nov 15 19:57:36 ip-172-31-24-191 systemd[1]: Failed to start Slurm database daemon.
Nov 15 19:57:36 ip-172-31-24-191 systemd[1]: slurmdbd.service: Unit entered failed state.
Nov 15 19:57:36 ip-172-31-24-191 systemd[1]: slurmdbd.service: Failed with result 'timeout'.
Nov 15 19:57:41 ip-172-31-24-191 sudo[6122]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/journalctl


For reference this is my slurmdbd.conf:
# Sample /etc/slurmdbd.conf
#
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no
#ArchiveScript=/usr/sbin/slurm.dbd.archive
AuthType=auth/munge
DbdHost=ip-172-31-24-191
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month
DebugLevel=debug5
SlurmUser=slurm
LogFile=/var/log/slurmdbd.log
PidFile=/var/tmp/jette/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageUser=slurm
StoragePort=3306



Thanks,
Rex
Comment 3 Jason Booth 2019-11-15 14:07:04 MST
Hi Rex - just a few notes here based upon the config file you uploaded and the log files you have in the comments.

slurmdbd.conf

The PID file needs to be accessible by the SlurmUser and the path should be a valid path.

My guess is that this is not a valid path since it points to a tmp directory with one of Slurms founder's name in it. My guess is you copied and pasted this form an example. You may want to go over these settings and change them to meet your site's needs.
>PidFile=/var/tmp/jette/slurmdbd.pid
> SlurmUser=slurm

As mentioned above the slurm user should have access to the pid file in whatever directory you choose.
Comment 4 Rex Chen 2019-11-15 14:22:02 MST
Hi Jason,

The PID file was causing the issue, once we had the correct path and permission we are able to start slurmdbd with systemctl.

Thank you for your help!


Best,
Rex
Comment 5 Jason Booth 2019-11-15 14:24:21 MST
Resolving