| Summary: | slurmdbd shuts down after a while if started with systemctl | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Rex Chen <shuningc> |
| Component: | Accounting | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 19.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | AWS+Sixnines Social | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Rex Chen
2019-11-14 16:05:49 MST
Hi Rex, > Is slurmdbd not expected to work with systemctl? Yes, in fact most sites use systemd with Slurm. I suspect that you may have an application stopping slurmdbd or sending it a sig 15 as demonstrated by the message quoted below: > Terminate signal (SIGINT or SIGTERM) received You may want to look at your sys logs or turn on some type of auditing to figure out who or what is calling systemclt. You can also run "systemctl status <service_name>". It is possible that systemd thinks the slurmdbd is in some type of failed state. In which case you may have the service misconfigured. Hi Jason, I tried stopping slurmdbd and slurmctld and starting slurmdbd, then starting slurmctld, here are the logs from journalctl. Looks like slurmdbd just terminates because of timeout. However, before the timeout sacct was working and there is no error in the slurmdbd log. Again, everything seems to work if I run slurmdbd directly. Any ideas? Nov 15 19:55:56 ip-172-31-24-191 systemd[1]: Stopped Slurm controller daemon. Nov 15 19:55:56 ip-172-31-24-191 sudo[6060]: pam_unix(sudo:session): session closed for user root Nov 15 19:56:01 ip-172-31-24-191 sudo[6064]: ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/systemctl stop slurmdbd Nov 15 19:56:01 ip-172-31-24-191 sudo[6064]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0) Nov 15 19:56:01 ip-172-31-24-191 systemd[1]: Stopped Slurm database daemon. Nov 15 19:56:01 ip-172-31-24-191 sudo[6064]: pam_unix(sudo:session): session closed for user root Nov 15 19:56:06 ip-172-31-24-191 sudo[6067]: ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/systemctl start slurmdbd Nov 15 19:56:06 ip-172-31-24-191 sudo[6067]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0) Nov 15 19:56:06 ip-172-31-24-191 systemd[1]: Starting Slurm database daemon... Nov 15 19:56:06 ip-172-31-24-191 systemd[1]: slurmdbd.service: PID file /var/run/slurmdbd.pid not readable (yet?) after start: No such file or directory Nov 15 19:56:11 ip-172-31-24-191 sudo[6067]: pam_unix(sudo:session): session closed for user root Nov 15 19:56:18 ip-172-31-24-191 sudo[6081]: ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/systemctl start slurmctld Nov 15 19:56:18 ip-172-31-24-191 sudo[6081]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0) Nov 15 19:56:18 ip-172-31-24-191 systemd[1]: Starting Slurm controller daemon... Nov 15 19:56:18 ip-172-31-24-191 systemd[1]: slurmctld.service: PID file /var/run/slurmctld.pid not readable (yet?) after start: No such file or directory Nov 15 19:56:18 ip-172-31-24-191 systemd[1]: Started Slurm controller daemon. Nov 15 19:56:18 ip-172-31-24-191 sudo[6081]: pam_unix(sudo:session): session closed for user root Nov 15 19:56:22 ip-172-31-24-191 sudo[6104]: ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/journalctl Nov 15 19:56:22 ip-172-31-24-191 sudo[6104]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0) Nov 15 19:56:57 ip-172-31-24-191 sudo[6104]: pam_unix(sudo:session): session closed for user root Nov 15 19:57:05 ip-172-31-24-191 sudo[6110]: ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/journalctl Nov 15 19:57:05 ip-172-31-24-191 sudo[6110]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0) Nov 15 19:57:35 ip-172-31-24-191 sudo[6110]: pam_unix(sudo:session): session closed for user root Nov 15 19:57:36 ip-172-31-24-191 systemd[1]: slurmdbd.service: Start operation timed out. Terminating. Nov 15 19:57:36 ip-172-31-24-191 systemd[1]: Failed to start Slurm database daemon. Nov 15 19:57:36 ip-172-31-24-191 systemd[1]: slurmdbd.service: Unit entered failed state. Nov 15 19:57:36 ip-172-31-24-191 systemd[1]: slurmdbd.service: Failed with result 'timeout'. Nov 15 19:57:41 ip-172-31-24-191 sudo[6122]: ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/journalctl For reference this is my slurmdbd.conf: # Sample /etc/slurmdbd.conf # ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes ArchiveSteps=no ArchiveSuspend=no ArchiveTXN=no ArchiveUsage=no #ArchiveScript=/usr/sbin/slurm.dbd.archive AuthType=auth/munge DbdHost=ip-172-31-24-191 PurgeEventAfter=1month PurgeJobAfter=12month PurgeResvAfter=1month PurgeStepAfter=1month PurgeSuspendAfter=1month PurgeTXNAfter=12month PurgeUsageAfter=24month DebugLevel=debug5 SlurmUser=slurm LogFile=/var/log/slurmdbd.log PidFile=/var/tmp/jette/slurmdbd.pid StorageType=accounting_storage/mysql StorageUser=slurm StoragePort=3306 Thanks, Rex Hi Rex - just a few notes here based upon the config file you uploaded and the log files you have in the comments.
slurmdbd.conf
The PID file needs to be accessible by the SlurmUser and the path should be a valid path.
My guess is that this is not a valid path since it points to a tmp directory with one of Slurms founder's name in it. My guess is you copied and pasted this form an example. You may want to go over these settings and change them to meet your site's needs.
>PidFile=/var/tmp/jette/slurmdbd.pid
> SlurmUser=slurm
As mentioned above the slurm user should have access to the pid file in whatever directory you choose.
Hi Jason, The PID file was causing the issue, once we had the correct path and permission we are able to start slurmdbd with systemctl. Thank you for your help! Best, Rex Resolving |