| Summary: | Help setting up Slurm accounting | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Mitul Patel <mitul.patel> |
| Component: | Accounting | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 18.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | UT Arlington | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hi Mitul Patel, Can you attach the slurmdbd log file (/var/log/slurm/slurmdbd.log based on what you have provided)? It would also be helpful to attach the entire slurm.conf file to this ticket. Thanks! Hi, slurmdbd log file is empty. [root@hpcrnt ~]# cat /etc/slurm/slurm.conf # # Example slurm.conf file. Please run configurator.html # (in doc/html) to build a configuration file customized # for your environment. # # # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # #ClusterName=linux ClusterName=hpcrnt ControlMachine=hpcrnt EnforcePartLimits=YES #ControlAddr= #BackupController= #BackupAddr= # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 SrunPortRange=60001-60500 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/var/spool/slurm-states SlurmdSpoolDir=/var/spool/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmdPidFile=/var/run/slurm/slurmd.pid ProctrackType=proctrack/pgid RebootProgram="/usr/sbin/reboot" #PluginDir= #FirstJobId= #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= #TaskPlugin= #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= #SelectType=select/linear FastSchedule=1 #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd.log JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING #AccountingStorageType=accounting_storage/slurmdbd #AccountingStoreJobComment=YES JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 # #AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=edsmysqldb011t.uta.edu #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES # OpenHPC default configuration PropagateResourceLimitsExcept=MEMLOCK AccountingStorageType=accounting_storage/filetxt Epilog=/etc/slurm/slurm.epilog.clean #NodeName=compute-6-7-0 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 NodeName=cn-2f2800 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 NodeName=cn-2f2900 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 NodeName=cn-2f3001 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 NodeName=cn-2f3002 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3003 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3004 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3005 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3006 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3007 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3008 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3009 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3010 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3011 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3012 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3013 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3014 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3015 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #NodeName=cn-2f3016 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844 #PartitionName=normal Nodes=cn-2f3001,cn-2f3002,cn-2f3003,cn-2f3004,cn-2f3005,cn-2f3006,cn-2f3007,cn-2f3008,cn-2f3009,cn-2f3010,cn-2f3011,cn-2f3012,cn-2f3013,cn-2f3014,cn-2f3015,cn-2f3016 Default=YES MaxTime=48:00:00 State=UP #PartitionName=normal Nodes=cn-2f30[01-16] Default=YES MaxTime=48:00:00 State=UP #PartitionName=normal Nodes=cn-2f30[01-12] Default=YES MaxTime=48:00:00 State=UP PartitionName=normal Nodes=cn-2f30[01-02] Default=YES MaxTime=48:00:00 State=UP PartitionName=gpu Nodes=cn-2f2800,cn-2f2900 MaxTime=48:00:00 State=UP [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# ls -l /var/log/slurm/slurmdbd.log -rw-r--r-- 1 slurm slurm 0 Oct 21 09:56 /var/log/slurm/slurmdbd.log [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# cat /var/log/slurm/slurmdbd.log [root@hpcrnt ~]# Thanks, Mitul Patel If the log file is empty I believe we will need to start up the slurmdbd manually rather than use systemctl. Can you run "slurmdbd -D -vvv" as the "srv_slurm_acct" user? Running the command manually will cause slurmdbd to output all logging to the terminal rather than to a log file. Please send the output of that command. Hi, srv_slurm_acct is the user account setup by DB team. So, I changed that to slurm on slurmdbd.conf. I also added storage type as per error. [root@hpcrnt ~]# [root@hpcrnt ~]# su - srv_slurm_acct su: user srv_slurm_acct does not exist [root@hpcrnt ~]# [root@hpcrnt ~]# grep -i slurm /egc/passwd grep: /egc/passwd: No such file or directory [root@hpcrnt ~]# grep -i slurm /etc/passwd slurm:x:4000:4000::/home/slurm:/bin/bash [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# su - slurm Last login: Wed Apr 10 11:50:41 CDT 2019 on pts/11 [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ slurmdbd -D -vvv slurmdbd: fatal: Invalid user for SlurmUser srv_slurm_acct, ignored [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ cat /etc/slurm/slurmdbd.conf LogFile=/var/log/slurm/slurmdbd.log DbdHost=edsmysqldb011t.uta.edu # Replace by the slurmdbd server hostname (for example, slurmdbd.my.domain) DbdPort=2114 # The default value #SlurmUser=slurm SlurmUser=srv_slurm_acct StorageHost=localhost StoragePass=.... # The above defined database password StorageLoc=slurm_acct_db DebugLevel=verbose [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ vi /etc/slurm/slurmdbd.conf [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ slurmdbd -D -vvv slurmdbd: fatal: StorageType must be specified [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ vi /etc/slurm/slurmdbd.conf [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ cat /etc/slurm/slurmdbd.conf LogFile=/var/log/slurm/slurmdbd.log DbdHost=edsmysqldb011t.uta.edu # Replace by the slurmdbd server hostname (for example, slurmdbd.my.domain) DbdPort=2114 # The default value SlurmUser=slurm #SlurmUser=srv_slurm_acct StorageHost=localhost StoragePass=.... # The above defined database password StorageLoc=slurm_acct_db DebugLevel=verbose StorageType=accounting_storage/mysql [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ slurmdbd -D -vvv (null): _log_init: Unable to open logfile `/var/log/slurm/slurmdbd.log': Permission denied slurmdbd: error: chown(/var/log/slurm/slurmdbd.log, 4000, 4000): Permission denied slurmdbd: debug: Log file re-opened slurmdbd: error: Unable to open pidfile `/var/run/slurmdbd.pid': Permission denied slurmdbd: debug: Munge authentication plugin loaded slurmdbd: debug2: mysql_connect() called for db slurm_acct_db slurmdbd: debug2: Attempting to connect to localhost:3306 slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to localhost:3306 slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to localhost:3306 slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to localhost:3306 slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to localhost:3306 slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to localhost:3306 slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. ^C [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ ls -l /var/log/slurm/slurmdbd.log ls: cannot access /var/log/slurm/slurmdbd.log: Permission denied [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ exit logout [root@hpcrnt ~]# [root@hpcrnt /]# [root@hpcrnt /]# ls -l /var/log/slurm/slurmdbd.log -rw-r--r-- 1 slurm slurm 0 Oct 21 09:56 /var/log/slurm/slurmdbd.log [root@hpcrnt /]# [root@hpcrnt /]# ls -l /var/log/slurm total 100968 -rw-r--r-- 1 slurm slurm 0 Oct 21 09:56 slurmdbd.log -rw-------. 1 slurm slurm 103387781 Apr 10 2019 slurm.log [root@hpcrnt /]# [root@hpcrnt /]# [root@hpcrnt /]# ls -l /var/log/ | grep slurm drwxrwx---. 2 root root 43 Oct 21 09:56 slurm -rw------- 1 slurm slurm 7427035 Oct 22 16:54 slurmctld.log -rw-r--r-- 1 slurm slurm 24373 Apr 25 12:54 slurm_jobacct.log [root@hpcrnt /]# [root@hpcrnt /]# [root@hpcrnt /]# cd /var/log [root@hpcrnt log]# [root@hpcrnt log]# chown slurm slurm [root@hpcrnt log]# [root@hpcrnt log]# ls -l /var/log/ | grep slurm drwxrwx---. 2 slurm root 43 Oct 21 09:56 slurm -rw------- 1 slurm slurm 7443319 Oct 22 16:55 slurmctld.log -rw-r--r-- 1 slurm slurm 24373 Apr 25 12:54 slurm_jobacct.log [root@hpcrnt log]# [root@hpcrnt log]# su - slurm Last login: Tue Oct 22 16:44:17 CDT 2019 on pts/0 [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ slurmdbd -D -vvv slurmdbd: debug: Log file re-opened slurmdbd: error: Unable to open pidfile `/var/run/slurmdbd.pid': Permission denied slurmdbd: debug: Munge authentication plugin loaded slurmdbd: debug2: mysql_connect() called for db slurm_acct_db slurmdbd: debug2: Attempting to connect to localhost:3306 slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to localhost:3306 slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to localhost:3306 slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. ^C [slurm@hpcrnt ~]$ Thanks, Mitul Patel I think some things may have been misconfigured in the slurmdbd.conf file. I believe this is your slurmdbd.conf: LogFile=/var/log/slurm/slurmdbd.log DbdHost=edsmysqldb011t.uta.edu # Replace by the slurmdbd server hostname (for example, slurmdbd.my.domain) DbdPort=2114 # The default value SlurmUser=slurm #SlurmUser=srv_slurm_acct StorageHost=localhost StoragePass=.... # The above defined database password StorageLoc=slurm_acct_db DebugLevel=verbose StorageType=accounting_storage/mysql I believe you will want to add "StorageUser=srv_slurm_acct" and also switch DbdHost and StorageHost parameters. StorageUser and StorageHost should be the name and host of the machine running mysql/mariadb. DbdHost and SlurmUser should be the host and user of the machine running slurmdbd. Hi, I changed the file as recommended and still getting connection error. I see it's trying to connect to port 3306. Do I need to add "AccountingStoragePort=2114" in /etc/slurm/slurm.conf? I already have "DbdPort=2114" in /etc/slurm/slurmdbd.conf ############# This is the info that we got from DB team regarding database. Server: edsmysqldb011t.uta.edu Database: slurm_acct_db Port: 2114 Service ID: srv_slurm_acct Password: ...... ############# I am able to connect to DB server on port 2114. [root@hpcrnt ~]# telnet edsmysqldb011t.uta.edu 2114 Trying 129.107.56.232... Connected to edsmysqldb011t.uta.edu. Escape character is '^]'. U ^CConnection closed by foreign host. [root@hpcrnt ~]# ^C [root@hpcrnt ~]# ################ [slurm@hpcrnt ~]$ cat /etc/slurm/slurmdbd.conf LogFile=/var/log/slurm/slurmdbd.log DbdHost=hpcrnt.uta.edu # Should be host of machine running slurmdbd. DbdPort=2114 # Database Port SlurmUser=slurm # Should be user of maching running slurmdbd. StorageUser=srv_slurm_acct # Account of DB server StorageHost=edsmysqldb011t.uta.edu # Name of DB server StoragePass=............ # Database password StorageLoc=slurm_acct_db DebugLevel=verbose StorageType=accounting_storage/mysql [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ slurmdbd -D -vvv slurmdbd: debug: Log file re-opened slurmdbd: error: Unable to open pidfile `/var/run/slurmdbd.pid': Permission denied slurmdbd: debug: Munge authentication plugin loaded slurmdbd: debug2: mysql_connect() called for db slurm_acct_db slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:3306 slurmdbd: error: mysql_real_connect failed: 2003 Can't connect to MySQL server on 'edsmysqldb011t.uta.edu' (4) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:3306 slurmdbd: error: mysql_real_connect failed: 2003 Can't connect to MySQL server on 'edsmysqldb011t.uta.edu' (4) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:3306 slurmdbd: error: mysql_real_connect failed: 2003 Can't connect to MySQL server on 'edsmysqldb011t.uta.edu' (4) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:3306 ^C [slurm@hpcrnt ~]$ ############# [root@hpcrnt ~]# grep -i port /etc/slurm/slurm.conf SlurmctldPort=6817 SlurmdPort=6818 SrunPortRange=60001-60500 [root@hpcrnt ~]# [root@hpcrnt ~]# grep -i AccountingStoragePort /etc/slurm/slurm.conf [root@hpcrnt ~]# Thanks, Mitul Patel You will want to set "StoragePort=2114". That is the parameter that configures which port to connect to the myssql database. You will also want to unset or comment out DbdPort as that is the port the slurmdbd/slurmctld use to communicate and using the default value (rather than 2114) would be correct. Hi, I removed DBDPort and added StoragePort as requested. After that I started the service and got error. Talked with DB team and they told me unistall mariadb and install mysql. I am doing that now. [root@hpcrnt ~]# cat /etc/slurm/slurmdbd.conf LogFile=/var/log/slurm/slurmdbd.log DbdHost=hpcrnt.uta.edu # Should be host of machine running slurmdbd. StoragePort=2114 # DB Port SlurmUser=slurm # Should be user of maching running slurmdbd. StorageUser=srv_slurm_acct # Account of DB server StorageHost=edsmysqldb011t.uta.edu # Name of DB server StoragePass=........... # Database password StorageLoc=slurm_acct_db DebugLevel=verbose StorageType=accounting_storage/mysql [root@hpcrnt ~]# [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ slurmdbd -D -vvv slurmdbd: debug: Log file re-opened slurmdbd: error: Unable to open pidfile `/var/run/slurmdbd.pid': Permission denied slurmdbd: debug: Munge authentication plugin loaded slurmdbd: debug2: mysql_connect() called for db slurm_acct_db slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:2114 slurmdbd: error: mysql_real_connect failed: 2059 Authentication plugin 'sha256_password' cannot be loaded: /usr/lib64/mysql/plugin/sha256_password.so: cannot open shared object file: No such file or directory slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:2114 slurmdbd: error: mysql_real_connect failed: 2059 Authentication plugin 'sha256_password' cannot be loaded: /usr/lib64/mysql/plugin/sha256_password.so: cannot open shared object file: No such file or directory slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:2114 slurmdbd: error: mysql_real_connect failed: 2059 Authentication plugin 'sha256_password' cannot be loaded: /usr/lib64/mysql/plugin/sha256_password.so: cannot open shared object file: No such file or directory slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:2114 slurmdbd: error: mysql_real_connect failed: 2059 Authentication plugin 'sha256_password' cannot be loaded: /usr/lib64/mysql/plugin/sha256_password.so: cannot open shared object file: No such file or directory slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. ^C [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ ls -l /usr/lib64/mysql/plugin/sha256_password.so ls: cannot access /usr/lib64/mysql/plugin/sha256_password.so: No such file or directory [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ [slurm@hpcrnt ~]$ systemctl status slrumdbd Unit slrumdbd.service could not be found. [slurm@hpcrnt ~]$ systemctl status slurmdbd ● slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2019-10-22 15:51:55 CDT; 19h ago Process: 25241 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=1/FAILURE) [slurm@hpcrnt ~]$ Thanks, Mitul Patel Hi,
After uninstalling mariadb and installing MYSQL as requested by DB team.
I am able to connect now. I ran couple of jobs on test hpc. Is there a coomand to to pull accounting data?
[slurm@hpcrnt system]$ slurmdbd -D -vvv
slurmdbd: debug: Log file re-opened
slurmdbd: error: Unable to open pidfile `/var/run/slurmdbd.pid': Permission denied
slurmdbd: debug: Munge authentication plugin loaded
slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:2114
slurmdbd: debug2: innodb_buffer_pool_size: 3221225472
slurmdbd: debug2: innodb_log_file_size: 262144000
slurmdbd: debug2: innodb_lock_wait_timeout: 50
slurmdbd: error: Database settings not recommended values: innodb_lock_wait_timeout
slurmdbd: converting QOS table
slurmdbd: Conversion done: success!
slurmdbd: Accounting storage MYSQL plugin loaded
slurmdbd: Not running as root. Can't drop supplementary groups
slurmdbd: debug2: ArchiveDir = /tmp
slurmdbd: debug2: ArchiveScript = (null)
slurmdbd: debug2: AuthInfo = (null)
slurmdbd: debug2: AuthType = auth/munge
slurmdbd: debug2: CommitDelay = 0
slurmdbd: debug2: DbdAddr = hpcrnt.uta.edu
slurmdbd: debug2: DbdBackupHost = (null)
slurmdbd: debug2: DbdHost = hpcrnt.uta.edu
slurmdbd: debug2: DbdPort = 6819
slurmdbd: debug2: DebugFlags = (null)
slurmdbd: debug2: DebugLevel = 6
slurmdbd: debug2: DebugLevelSyslog = 10
slurmdbd: debug2: DefaultQOS = (null)
slurmdbd: debug2: LogFile = /var/log/slurm/slurmdbd.log
slurmdbd: debug2: MessageTimeout = 10
slurmdbd: debug2: Parameters = (null)
slurmdbd: debug2: PidFile = /var/run/slurmdbd.pid
slurmdbd: debug2: PluginDir = /usr/lib64/slurm
slurmdbd: debug2: PrivateData = none
slurmdbd: debug2: PurgeEventAfter = NONE
slurmdbd: debug2: PurgeJobAfter = NONE
slurmdbd: debug2: PurgeResvAfter = NONE
slurmdbd: debug2: PurgeStepAfter = NONE
slurmdbd: debug2: PurgeSuspendAfter = NONE
slurmdbd: debug2: PurgeTXNAfter = NONE
slurmdbd: debug2: PurgeUsageAfter = NONE
slurmdbd: debug2: SlurmUser = slurm(4000)
slurmdbd: debug2: StorageBackupHost = (null)
slurmdbd: debug2: StorageHost = edsmysqldb011t.uta.edu
slurmdbd: debug2: StorageLoc = slurm_acct_db
slurmdbd: debug2: StoragePort = 2114
slurmdbd: debug2: StorageType = accounting_storage/mysql
slurmdbd: debug2: StorageUser = srv_slurm_acct
slurmdbd: debug2: TCPTimeout = 2
slurmdbd: debug2: TrackWCKey = 0
slurmdbd: debug2: TrackSlurmctldDown= 0
slurmdbd: debug2: acct_storage_p_get_connection: request new connection 1
slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:2114
slurmdbd: slurmdbd version 18.08.7 started
slurmdbd: debug2: running rollup at Wed Oct 23 11:51:43 2019
slurmdbd: debug2: Everything rolled up
^Cslurmdbd: Terminate signal (SIGINT or SIGTERM) received
slurmdbd: debug: rpc_mgr shutting down
slurmdbd: Unable to remove pidfile '/var/run/slurmdbd.pid': No such file or directory
[slurm@hpcrnt system]$
[slurm@hpcrnt system]$
[slurm@hpcrnt system]$
[slurm@hpcrnt system]$ exit
logout
[root@hpcrnt ~]#
[root@hpcrnt ~]#
[root@hpcrnt ~]#
[root@hpcrnt ~]# systemctl restart slurmdbd
[root@hpcrnt ~]# systemctl status slurmdbd
● slurmdbd.service - Slurm DBD accounting daemon
Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2019-10-23 11:52:49 CDT; 1s ago
Process: 95839 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 95841 (slurmdbd)
CGroup: /system.slice/slurmdbd.service
└─95841 /usr/sbin/slurmdbd
Oct 23 11:52:49 hpcrnt.uta.edu systemd[1]: Starting Slurm DBD accounting daemon...
Oct 23 11:52:49 hpcrnt.uta.edu systemd[1]: Started Slurm DBD accounting daemon.
[root@hpcrnt ~]#
[root@hpcrnt ~]#
[patelmn@hpcrnt ~]$ sbatch batchHelloWorldC.slurm
Submitted batch job 34
[patelmn@hpcrnt ~]$ sacct --format="JobID,user,account,elapsed,Timelimit,MaxRSS,ReqMem,MaxVMSize,ncpus,ExitCode"
JobID User Account Elapsed Timelimit MaxRSS ReqMem MaxVMSize NCPUS ExitCode
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- --------
25 patelmn (null) 00:00:00 0n 0 0:0
26 patelmn (null) 00:00:00 0n 0 0:0
27 patelmn (null) 00:00:00 0n 0 0:0
31 patelmn (null) 00:00:01 0n 0 1:0
32 patelmn (null) 00:00:00 0n 0 1:0
33 patelmn (null) 00:00:00 0n 0 0:0
34 patelmn (null) 00:00:00 0n 0 0:0
[patelmn@hpcrnt ~]$
Thanks,
Mitul Patel
I'm glad things are working. You've already seen sacct, I would also take a look at sreport for other types of utilization reports. Now that things are working I will plan on closing this support ticket. Hi, when I run sreport. I am getting I am not running a supported accounting storage plugin. Do I need to change storage type on slurm.conf to -> "AccountingStorageType=accounting_storage/mysql" [root@hpcrnt ~]# sreport You are not running a supported accounting_storage plugin (accounting_storage/filetxt). Only 'accounting_storage/slurmdbd' and 'accounting_storage/mysql' are supported. [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# sacctmgr You are not running a supported accounting_storage plugin (accounting_storage/filetxt). Only 'accounting_storage/slurmdbd' and 'accounting_storage/mysql' are supported. [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# grep -i accounting /etc/slurm/slurm.conf # ACCOUNTING #AccountingStorageType=accounting_storage/slurmdbd #AccountingStoreJobComment=YES #AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=edsmysqldb011t.uta.edu #AccountingStoragePort=2114 #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= AccountingStorageType=accounting_storage/filetxt [root@hpcrnt ~]# Thanks, Mitul Patel AccountingStorageHost should point to the host running slurmdbd (not the host running mysql) and AccountingStorageType should be accounting_storage/slurmdbd. Please let me know if that fixes things. Hi, Added "AccountingStoragePort=2114" to slurm.conf. Waiting on sreport output [root@hpcrnt ~]# [root@hpcrnt ~]# grep -i port /etc/slurm/slurm.conf SlurmctldPort=6817 SlurmdPort=6818 SrunPortRange=60001-60500 AccountingStoragePort=2114 [root@hpcrnt ~]# [root@hpcrnt ~]# systemctl restart slurmctld [root@hpcrnt ~]# systemctl restart slurmdbd [root@hpcrnt ~]# [root@hpcrnt ~]# sreport Thanks, Mitul Patel Unfortunately that was not the right configuration parameter to change. That parameter should be deleted or commented out (so as to use the default). Port 2114 is for slurmdbd to communicate with mysql (based on your previous comments) NOT for sreport to slurmdbd communication. You will most likely need to delete AccountingStoragePort and change the two parameters I mentioned in my last comment: AccountingStorageType and AccountingStorageHost. Hi,
After commenting out "AccountingStoragePort=2114" I get previous message. I have attached both files. Slurm.conf and slurmdbd.conf
sreport: error: slurm_persist_conn_open_without_init: failed to open persistent connection to edsmysqldb011t.uta.edu:6819: Connection timed out
sreport: error: slurmdbd: Sending PersistInit msg: Connection timed out
sreport: fatal: Problem connecting to the database: Connection timed out
=====================================
[root@hpcrnt ~]#
[root@hpcrnt ~]# grep -i AccountingStorageType /etc/slurm/slurm.conf
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageType=accounting_storage/filetxt
AccountingStorageType=accounting_storage/slurmdbd
[root@hpcrnt ~]#
[root@hpcrnt ~]#
[root@hpcrnt ~]# grep -i AccountingStorageHost /etc/slurm/slurm.conf
AccountingStorageHost=edsmysqldb011t.uta.edu
[root@hpcrnt ~]#
[root@hpcrnt ~]# grep -i AccountingStoragePort /etc/slurm/slurm.conf
AccountingStoragePort=2114
[root@hpcrnt ~]#
[root@hpcrnt ~]# vi /etc/slurm/slurm.conf
[root@hpcrnt ~]#
[root@hpcrnt ~]# grep -i AccountingStoragePort /etc/slurm/slurm.conf
#AccountingStoragePort=2114
[root@hpcrnt ~]#
[root@hpcrnt ~]#
[root@hpcrnt ~]# systemctl restart slurmctld
[root@hpcrnt ~]# systemctl restart slurmctld
[root@hpcrnt ~]# systemctl restart slurmdbd
[root@hpcrnt ~]#
[root@hpcrnt ~]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2019-10-23 16:57:36 CDT; 12s ago
Process: 113549 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
Process: 113546 ExecStartPre=/usr/bin/chown -R slurm:slurm /var/run/slurm (code=exited, status=0/SUCCESS)
Process: 113544 ExecStartPre=/usr/bin/mkdir -m 0750 -p /var/run/slurm (code=exited, status=0/SUCCESS)
Main PID: 113552 (slurmctld)
CGroup: /system.slice/slurmctld.service
└─113552 /usr/sbin/slurmctld
Oct 23 16:57:36 hpcrnt.uta.edu systemd[1]: Starting Slurm controller daemon...
Oct 23 16:57:36 hpcrnt.uta.edu systemd[1]: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start.
Oct 23 16:57:36 hpcrnt.uta.edu systemd[1]: Started Slurm controller daemon.
[root@hpcrnt ~]#
[root@hpcrnt ~]# systemctl status slurmdmd
Unit slurmdmd.service could not be found.
[root@hpcrnt ~]# systemctl status slurmdbd
● slurmdbd.service - Slurm DBD accounting daemon
Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2019-10-23 16:57:42 CDT; 17s ago
Process: 113566 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 113568 (slurmdbd)
CGroup: /system.slice/slurmdbd.service
└─113568 /usr/sbin/slurmdbd
Oct 23 16:57:42 hpcrnt.uta.edu systemd[1]: Stopped Slurm DBD accounting daemon.
Oct 23 16:57:42 hpcrnt.uta.edu systemd[1]: Starting Slurm DBD accounting daemon...
Oct 23 16:57:42 hpcrnt.uta.edu systemd[1]: PID file /var/run/slurmdbd.pid not readable (yet?) after start.
Oct 23 16:57:42 hpcrnt.uta.edu systemd[1]: Started Slurm DBD accounting daemon.
[root@hpcrnt ~]#
[root@hpcrnt ~]#
[root@hpcrnt ~]# sreport
sreport: error: slurm_persist_conn_open_without_init: failed to open persistent connection to edsmysqldb011t.uta.edu:6819: Connection timed out
sreport: error: slurmdbd: Sending PersistInit msg: Connection timed out
sreport: fatal: Problem connecting to the database: Connection timed out
[root@hpcrnt ~]#
=====================================
slurm.conf
[root@hpcrnt ~]# cat /etc/slurm/slurm.conf
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
#ClusterName=linux
ClusterName=hpcrnt
ControlMachine=hpcrnt
EnforcePartLimits=YES
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
SrunPortRange=60001-60500
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm-states
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
ProctrackType=proctrack/pgid
RebootProgram="/usr/sbin/reboot"
#PluginDir=
#FirstJobId=
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SelectType=select/linear
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStoreJobComment=YES
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=edsmysqldb011t.uta.edu
#AccountingStoragePort=2114
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
# OpenHPC default configuration
PropagateResourceLimitsExcept=MEMLOCK
#AccountingStorageType=accounting_storage/filetxt
AccountingStorageType=accounting_storage/slurmdbd
Epilog=/etc/slurm/slurm.epilog.clean
#NodeName=compute-6-7-0 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
NodeName=cn-2f2800 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
NodeName=cn-2f2900 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
NodeName=cn-2f3001 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
NodeName=cn-2f3002 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3003 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3004 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3005 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3006 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3007 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3008 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3009 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3010 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3011 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3012 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3013 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3014 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3015 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3016 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#PartitionName=normal Nodes=cn-2f3001,cn-2f3002,cn-2f3003,cn-2f3004,cn-2f3005,cn-2f3006,cn-2f3007,cn-2f3008,cn-2f3009,cn-2f3010,cn-2f3011,cn-2f3012,cn-2f3013,cn-2f3014,cn-2f3015,cn-2f3016 Default=YES MaxTime=48:00:00 State=UP
#PartitionName=normal Nodes=cn-2f30[01-16] Default=YES MaxTime=48:00:00 State=UP
#PartitionName=normal Nodes=cn-2f30[01-12] Default=YES MaxTime=48:00:00 State=UP
PartitionName=normal Nodes=cn-2f30[01-02] Default=YES MaxTime=48:00:00 State=UP
PartitionName=gpu Nodes=cn-2f2800,cn-2f2900 MaxTime=48:00:00 State=UP
[root@hpcrnt ~]#
=====================================
Slurmdbd.conf
[root@hpcrnt ~]# cat /etc/slurm/slurmdbd.conf
LogFile=/var/log/slurm/slurmdbd.log
DbdHost=hpcrnt.uta.edu # Should be host of machine running slurmdbd.
StoragePort=2114 # DB Port
SlurmUser=slurm # Should be user of maching running slurmdbd.
StorageUser=srv_slurm_acct # Account of DB server
StorageHost=edsmysqldb011t.uta.edu # Name of DB server
StoragePass=............ # Database password
StorageLoc=slurm_acct_db
DebugLevel=verbose
StorageType=accounting_storage/mysql
[root@hpcrnt ~]#
Thanks,
Mitul Patel
This parameter in your slurm.conf looks like it might be wrong: AccountingStorageHost=edsmysqldb011t.uta.edu As I mentioned in comment 12, slurm.conf AccountingStorageHost should point to the host running slurmdbd (hpcrnt.uta.edu). Could you try changing it to: AccountingStorageHost=hpcrnt.uta.edu Hi,
It looks like it's working. I do not see any jobs report that was run previously. I am assuming it wiped out when Slurmdb was setup. Is there a way to look at that?
I am going to run couple of jobs and see if I see report.
Is there a way to get all job report. I tried "sreport -a" and does not work.
[root@hpcrnt ~]# sreport job
too few arguments for keyword:job
[root@hpcrnt ~]# sreport jobs
invalid keyword: jobs
[root@hpcrnt ~]#
[root@hpcrnt ~]# sreport
sreport: exit
[root@hpcrnt ~]# sreport -a job SizesByAccount All_Clusters
--------------------------------------------------------------------------------
Job Sizes 2019-10-22T00:00:00 - 2019-10-22T23:59:59 (86400 secs)
Time reported in Minutes
--------------------------------------------------------------------------------
Cluster Account 0-49 CPUs 50-249 CPUs 250-499 CPUs 500-999 CPUs >= 1000 CPUs % of cluster
--------- --------- ------------- ------------- ------------- ------------- ------------- ------------
[root@hpcrnt ~]#
[root@hpcrnt ~]# sreport -a All_Clusters
invalid keyword: All_Clusters
[root@hpcrnt ~]# sreport -a job All_Clusters
Not valid report All_Clusters
Valid job reports are, "SizesByAccount, SizesByAccountAndWcKey, and SizesByWckey"
[root@hpcrnt ~]#
[root@hpcrnt ~]# sreport -a jobs All_Clusters
invalid keyword: jobs
[root@hpcrnt ~]#
[root@hpcrnt ~]# sreport -a cluster All_Clusters
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2019-10-22T00:00:00 - 2019-10-22T23:59:59 (86400 secs)
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
Cluster Account Login Proper Name Used Energy
--------- --------------- --------- --------------- -------- --------
[root@hpcrnt ~]#
[root@hpcrnt ~]# sbatch
^C
[root@hpcrnt ~]#
[root@hpcrnt ~]#
[root@hpcrnt ~]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 2-00:00:00 2 idle cn-2f[3001-3002]
gpu up 2-00:00:00 2 idle cn-2f[2800,2900]
[root@hpcrnt ~]#
[root@hpcrnt ~]#
[root@hpcrnt ~]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[root@hpcrnt ~]#
[root@hpcrnt ~]# sacct --format="JobID,user,account,elapsed,Timelimit,MaxRSS,ReqMem,MaxVMSize,ncpus,ExitCode"
JobID User Account Elapsed Timelimit MaxRSS ReqMem MaxVMSize NCPUS ExitCode
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- --------
sacct: error: slurmdbd: Unknown error 1064
[root@hpcrnt ~]#
[root@hpcrnt ~]#
[root@hpcrnt ~]#
[root@hpcrnt ~]#
[root@hpcrnt ~]# sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
sacct: error: slurmdbd: Unknown error 1064
[root@hpcrnt ~]#
Thanks,
Mitul Patel
If the slurmctld->slurmdbd communication path is now working it will start collecting accounting data. Any data from before will not have been collected. You can use "sdiag" "DBD Agent queue size" to check that (see the sdiag man page for more information). The reports that are available are outlined in the sreport documentation. I recommend that you read that page to find the report that will work for you: These links contain the relevant information about what is available: https://slurm.schedmd.com/sreport.html https://slurm.schedmd.com/sdiag.html https://slurm.schedmd.com/sacct.html It appears that accounting/slurmdbd is now setup on your system. If you have any further issues feel free to open another ticket. |
Hi, I would like help setting up Slurm Accounting. We already have DB setup on another server and would like to setup accounting on root node. I did follow Slurm online doc and getting error. I added DB server to /etc/slurm/slurmdbd.conf and also to slurm.conf Please take a look and let me know what I am missing to setup Accounting. -------------------- [root@hpcrnt ~]# cat /etc/redhat-release CentOS Linux release 7.6.1810 (Core) [root@hpcrnt ~]# [root@hpcrnt ~]# ps -ef | grep -i slurm slurm 25200 1 0 15:51 ? 00:00:00 /usr/sbin/slurmctld root 25694 24922 0 16:00 pts/0 00:00:00 grep --color=auto -i slurm [root@hpcrnt ~]# [root@hpcrnt ~]# systemctl status slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2019-10-22 15:51:15 CDT; 1s ago Process: 25197 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Process: 25195 ExecStartPre=/usr/bin/chown -R slurm:slurm /var/run/slurm (code=exited, status=0/SUCCESS) Process: 25193 ExecStartPre=/usr/bin/mkdir -m 0750 -p /var/run/slurm (code=exited, status=0/SUCCESS) Main PID: 25200 (slurmctld) CGroup: /system.slice/slurmctld.service └─25200 /usr/sbin/slurmctld Oct 22 15:51:15 hpcrnt.uta.edu systemd[1]: Stopped Slurm controller daemon. Oct 22 15:51:15 hpcrnt.uta.edu systemd[1]: Starting Slurm controller daemon... Oct 22 15:51:15 hpcrnt.uta.edu systemd[1]: Failed to read PID from file /var/run/slurm/slurmctld.pid: Invalid argument Oct 22 15:51:15 hpcrnt.uta.edu systemd[1]: Started Slurm controller daemon. [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# systemctl status slurmdbd ● slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2019-10-21 14:10:31 CDT; 1 day 1h ago Oct 21 14:10:31 hpcrnt.uta.edu systemd[1]: Starting Slurm DBD accounting daemon... Oct 21 14:10:31 hpcrnt.uta.edu systemd[1]: slurmdbd.service: control process exited, code=exited status=1 Oct 21 14:10:31 hpcrnt.uta.edu systemd[1]: Failed to start Slurm DBD accounting daemon. Oct 21 14:10:31 hpcrnt.uta.edu systemd[1]: Unit slurmdbd.service entered failed state. Oct 21 14:10:31 hpcrnt.uta.edu systemd[1]: slurmdbd.service failed. [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# systemctl restart slurmdbd Job for slurmdbd.service failed because the control process exited with error code. See "systemctl status slurmdbd.service" and "journalctl -xe" for details. [root@hpcrnt ~]# [root@hpcrnt ~]# systemctl status slurmdbd ● slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2019-10-22 15:51:55 CDT; 1s ago Process: 25241 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=1/FAILURE) Oct 22 15:51:55 hpcrnt.uta.edu systemd[1]: Starting Slurm DBD accounting daemon... Oct 22 15:51:55 hpcrnt.uta.edu systemd[1]: slurmdbd.service: control process exited, code=exited status=1 Oct 22 15:51:55 hpcrnt.uta.edu systemd[1]: Failed to start Slurm DBD accounting daemon. Oct 22 15:51:55 hpcrnt.uta.edu systemd[1]: Unit slurmdbd.service entered failed state. Oct 22 15:51:55 hpcrnt.uta.edu systemd[1]: slurmdbd.service failed. [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# cat /etc/slurm/slurmdbd.conf LogFile=/var/log/slurm/slurmdbd.log DbdHost=edsmysqldb011t.uta.edu # Replace by the slurmdbd server hostname (for example, slurmdbd.my.domain) DbdPort=2114 # The default value #SlurmUser=slurm SlurmUser=srv_slurm_acct StorageHost=localhost StoragePass=........ # The above defined database password StorageLoc=slurm_acct_db DebugLevel=verbose [root@hpcrnt ~]# [root@hpcrnt ~]# [root@hpcrnt ~]# grep -i host /etc/slurm/slurm.conf AccountingStorageHost=edsmysqldb011t.uta.edu [root@hpcrnt ~]# [root@hpcrnt ~]# Thanks, Mitul Patel