Ticket 7974

Summary: Help setting up Slurm accounting
Product: Slurm Reporter: Mitul Patel <mitul.patel>
Component: AccountingAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 18.08.6   
Hardware: Linux   
OS: Linux   
Site: UT Arlington Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Mitul Patel 2019-10-22 15:04:38 MDT
Hi,

I would like help setting up Slurm Accounting. We already have DB setup on another server and would like to setup accounting on root node.

I did follow Slurm online doc and getting error.


I added DB server to /etc/slurm/slurmdbd.conf and also to slurm.conf

Please take a look and let me know what I am missing to setup Accounting.


--------------------




[root@hpcrnt ~]# cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# ps -ef | grep -i slurm
slurm     25200      1  0 15:51 ?        00:00:00 /usr/sbin/slurmctld
root      25694  24922  0 16:00 pts/0    00:00:00 grep --color=auto -i slurm
[root@hpcrnt ~]# 
[root@hpcrnt ~]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-10-22 15:51:15 CDT; 1s ago
  Process: 25197 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
  Process: 25195 ExecStartPre=/usr/bin/chown -R slurm:slurm /var/run/slurm (code=exited, status=0/SUCCESS)
  Process: 25193 ExecStartPre=/usr/bin/mkdir -m 0750 -p /var/run/slurm (code=exited, status=0/SUCCESS)
 Main PID: 25200 (slurmctld)
   CGroup: /system.slice/slurmctld.service
           └─25200 /usr/sbin/slurmctld

Oct 22 15:51:15 hpcrnt.uta.edu systemd[1]: Stopped Slurm controller daemon.
Oct 22 15:51:15 hpcrnt.uta.edu systemd[1]: Starting Slurm controller daemon...
Oct 22 15:51:15 hpcrnt.uta.edu systemd[1]: Failed to read PID from file /var/run/slurm/slurmctld.pid: Invalid argument
Oct 22 15:51:15 hpcrnt.uta.edu systemd[1]: Started Slurm controller daemon.
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# systemctl status slurmdbd
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2019-10-21 14:10:31 CDT; 1 day 1h ago

Oct 21 14:10:31 hpcrnt.uta.edu systemd[1]: Starting Slurm DBD accounting daemon...
Oct 21 14:10:31 hpcrnt.uta.edu systemd[1]: slurmdbd.service: control process exited, code=exited status=1
Oct 21 14:10:31 hpcrnt.uta.edu systemd[1]: Failed to start Slurm DBD accounting daemon.
Oct 21 14:10:31 hpcrnt.uta.edu systemd[1]: Unit slurmdbd.service entered failed state.
Oct 21 14:10:31 hpcrnt.uta.edu systemd[1]: slurmdbd.service failed.
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# systemctl restart slurmdbd
Job for slurmdbd.service failed because the control process exited with error code. See "systemctl status slurmdbd.service" and "journalctl -xe" for details.
[root@hpcrnt ~]# 
[root@hpcrnt ~]# systemctl status slurmdbd
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2019-10-22 15:51:55 CDT; 1s ago
  Process: 25241 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=1/FAILURE)

Oct 22 15:51:55 hpcrnt.uta.edu systemd[1]: Starting Slurm DBD accounting daemon...
Oct 22 15:51:55 hpcrnt.uta.edu systemd[1]: slurmdbd.service: control process exited, code=exited status=1
Oct 22 15:51:55 hpcrnt.uta.edu systemd[1]: Failed to start Slurm DBD accounting daemon.
Oct 22 15:51:55 hpcrnt.uta.edu systemd[1]: Unit slurmdbd.service entered failed state.
Oct 22 15:51:55 hpcrnt.uta.edu systemd[1]: slurmdbd.service failed.
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# cat /etc/slurm/slurmdbd.conf
LogFile=/var/log/slurm/slurmdbd.log
DbdHost=edsmysqldb011t.uta.edu		 # Replace by the slurmdbd server hostname (for example, slurmdbd.my.domain)
DbdPort=2114			   	 # The default value
#SlurmUser=slurm
SlurmUser=srv_slurm_acct
StorageHost=localhost
StoragePass=........		# The above defined database password
StorageLoc=slurm_acct_db
DebugLevel=verbose
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# grep -i host /etc/slurm/slurm.conf
AccountingStorageHost=edsmysqldb011t.uta.edu
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 






Thanks,
Mitul Patel
Comment 1 Douglas Wightman 2019-10-22 15:19:36 MDT
Hi Mitul Patel,

Can you attach the slurmdbd log file (/var/log/slurm/slurmdbd.log based on what you have provided)?

It would also be helpful to attach the entire slurm.conf file to this ticket.

Thanks!
Comment 2 Mitul Patel 2019-10-22 15:38:51 MDT
Hi,

slurmdbd log file is empty.





[root@hpcrnt ~]# cat /etc/slurm/slurm.conf
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
#ClusterName=linux
ClusterName=hpcrnt
ControlMachine=hpcrnt
EnforcePartLimits=YES 

#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
SrunPortRange=60001-60500
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm-states
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
ProctrackType=proctrack/pgid
RebootProgram="/usr/sbin/reboot"
#PluginDir=
#FirstJobId=
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SelectType=select/linear
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStoreJobComment=YES
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=edsmysqldb011t.uta.edu
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
# OpenHPC default configuration
PropagateResourceLimitsExcept=MEMLOCK
AccountingStorageType=accounting_storage/filetxt
Epilog=/etc/slurm/slurm.epilog.clean
#NodeName=compute-6-7-0 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
NodeName=cn-2f2800 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
NodeName=cn-2f2900 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
NodeName=cn-2f3001 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
NodeName=cn-2f3002 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3003 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3004 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3005 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3006 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3007 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3008 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3009 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3010 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3011 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3012 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3013 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3014 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3015 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3016 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844

#PartitionName=normal Nodes=cn-2f3001,cn-2f3002,cn-2f3003,cn-2f3004,cn-2f3005,cn-2f3006,cn-2f3007,cn-2f3008,cn-2f3009,cn-2f3010,cn-2f3011,cn-2f3012,cn-2f3013,cn-2f3014,cn-2f3015,cn-2f3016 Default=YES MaxTime=48:00:00 State=UP
#PartitionName=normal Nodes=cn-2f30[01-16] Default=YES MaxTime=48:00:00 State=UP
#PartitionName=normal Nodes=cn-2f30[01-12] Default=YES MaxTime=48:00:00 State=UP
PartitionName=normal Nodes=cn-2f30[01-02] Default=YES MaxTime=48:00:00 State=UP
PartitionName=gpu Nodes=cn-2f2800,cn-2f2900 MaxTime=48:00:00 State=UP
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# ls -l /var/log/slurm/slurmdbd.log
-rw-r--r-- 1 slurm slurm 0 Oct 21 09:56 /var/log/slurm/slurmdbd.log
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# cat /var/log/slurm/slurmdbd.log
[root@hpcrnt ~]# 


Thanks,
Mitul Patel
Comment 3 Douglas Wightman 2019-10-22 15:43:01 MDT
If the log file is empty I believe we will need to start up the slurmdbd manually rather than use systemctl. 

Can you run "slurmdbd -D -vvv" as the "srv_slurm_acct" user?

Running the command manually will cause slurmdbd to output all logging to the terminal rather than to a log file. Please send the output of that command.
Comment 4 Mitul Patel 2019-10-22 15:58:51 MDT
Hi,

srv_slurm_acct is the user account setup by DB team. So, I changed that to slurm on slurmdbd.conf. I also added storage type as per error.




[root@hpcrnt ~]# 
[root@hpcrnt ~]# su - srv_slurm_acct
su: user srv_slurm_acct does not exist
[root@hpcrnt ~]# 
[root@hpcrnt ~]# grep -i slurm /egc/passwd
grep: /egc/passwd: No such file or directory
[root@hpcrnt ~]# grep -i slurm /etc/passwd
slurm:x:4000:4000::/home/slurm:/bin/bash
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# su - slurm
Last login: Wed Apr 10 11:50:41 CDT 2019 on pts/11
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ slurmdbd -D -vvv
slurmdbd: fatal: Invalid user for SlurmUser srv_slurm_acct, ignored
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ cat /etc/slurm/slurmdbd.conf
LogFile=/var/log/slurm/slurmdbd.log
DbdHost=edsmysqldb011t.uta.edu		 # Replace by the slurmdbd server hostname (for example, slurmdbd.my.domain)
DbdPort=2114			   	 # The default value
#SlurmUser=slurm
SlurmUser=srv_slurm_acct
StorageHost=localhost
StoragePass=....		# The above defined database password
StorageLoc=slurm_acct_db
DebugLevel=verbose
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ vi /etc/slurm/slurmdbd.conf
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ slurmdbd -D -vvv
slurmdbd: fatal: StorageType must be specified
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ vi /etc/slurm/slurmdbd.conf
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ cat /etc/slurm/slurmdbd.conf
LogFile=/var/log/slurm/slurmdbd.log
DbdHost=edsmysqldb011t.uta.edu		 # Replace by the slurmdbd server hostname (for example, slurmdbd.my.domain)
DbdPort=2114			   	 # The default value
SlurmUser=slurm
#SlurmUser=srv_slurm_acct
StorageHost=localhost
StoragePass=....		# The above defined database password
StorageLoc=slurm_acct_db
DebugLevel=verbose
StorageType=accounting_storage/mysql
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ slurmdbd -D -vvv
(null): _log_init: Unable to open logfile `/var/log/slurm/slurmdbd.log': Permission denied
slurmdbd: error: chown(/var/log/slurm/slurmdbd.log, 4000, 4000): Permission denied
slurmdbd: debug:  Log file re-opened
slurmdbd: error: Unable to open pidfile `/var/run/slurmdbd.pid': Permission denied
slurmdbd: debug:  Munge authentication plugin loaded
slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
^C
[slurm@hpcrnt ~]$ 





[slurm@hpcrnt ~]$ ls -l /var/log/slurm/slurmdbd.log
ls: cannot access /var/log/slurm/slurmdbd.log: Permission denied
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ exit
logout
[root@hpcrnt ~]# 
[root@hpcrnt /]# 
[root@hpcrnt /]# ls -l /var/log/slurm/slurmdbd.log
-rw-r--r-- 1 slurm slurm 0 Oct 21 09:56 /var/log/slurm/slurmdbd.log
[root@hpcrnt /]# 
[root@hpcrnt /]# ls -l /var/log/slurm
total 100968
-rw-r--r--  1 slurm slurm         0 Oct 21 09:56 slurmdbd.log
-rw-------. 1 slurm slurm 103387781 Apr 10  2019 slurm.log
[root@hpcrnt /]# 
[root@hpcrnt /]# 
[root@hpcrnt /]# ls -l /var/log/ | grep slurm
drwxrwx---. 2 root  root            43 Oct 21 09:56 slurm
-rw-------  1 slurm slurm      7427035 Oct 22 16:54 slurmctld.log
-rw-r--r--  1 slurm slurm        24373 Apr 25 12:54 slurm_jobacct.log
[root@hpcrnt /]# 
[root@hpcrnt /]# 
[root@hpcrnt /]# cd /var/log
[root@hpcrnt log]# 
[root@hpcrnt log]# chown slurm slurm
[root@hpcrnt log]# 
[root@hpcrnt log]# ls -l /var/log/ | grep slurm
drwxrwx---. 2 slurm root            43 Oct 21 09:56 slurm
-rw-------  1 slurm slurm      7443319 Oct 22 16:55 slurmctld.log
-rw-r--r--  1 slurm slurm        24373 Apr 25 12:54 slurm_jobacct.log
[root@hpcrnt log]# 
[root@hpcrnt log]# su - slurm
Last login: Tue Oct 22 16:44:17 CDT 2019 on pts/0
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ slurmdbd -D -vvv
slurmdbd: debug:  Log file re-opened
slurmdbd: error: Unable to open pidfile `/var/run/slurmdbd.pid': Permission denied
slurmdbd: debug:  Munge authentication plugin loaded
slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'patelmn'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
^C
[slurm@hpcrnt ~]$


Thanks,
Mitul Patel
Comment 5 Douglas Wightman 2019-10-22 16:10:48 MDT
I think some things may have been misconfigured in the slurmdbd.conf file. I believe this is your slurmdbd.conf:

LogFile=/var/log/slurm/slurmdbd.log
DbdHost=edsmysqldb011t.uta.edu		 # Replace by the slurmdbd server hostname (for example, slurmdbd.my.domain)
DbdPort=2114			   	 # The default value
SlurmUser=slurm
#SlurmUser=srv_slurm_acct
StorageHost=localhost
StoragePass=....		# The above defined database password
StorageLoc=slurm_acct_db
DebugLevel=verbose
StorageType=accounting_storage/mysql

I believe you will want to add "StorageUser=srv_slurm_acct" and also switch DbdHost and StorageHost parameters.  StorageUser and StorageHost should be the name and host of the machine running mysql/mariadb. DbdHost and SlurmUser should be the host and user of the machine running slurmdbd.
Comment 6 Mitul Patel 2019-10-22 17:11:30 MDT
Hi,

I changed the file as recommended and still getting connection error. I see it's trying to connect to port 3306.

Do I need to add "AccountingStoragePort=2114" in /etc/slurm/slurm.conf?

I already have "DbdPort=2114" in /etc/slurm/slurmdbd.conf


#############
This is the info that we got from DB team regarding database.



Server:              edsmysqldb011t.uta.edu
Database:            slurm_acct_db
Port:                2114
Service ID:          srv_slurm_acct
Password:            ......

#############

I am able to connect to DB server on port 2114.


[root@hpcrnt ~]# telnet edsmysqldb011t.uta.edu 2114
Trying 129.107.56.232...
Connected to edsmysqldb011t.uta.edu.
Escape character is '^]'.
U
^CConnection closed by foreign host.
[root@hpcrnt ~]# ^C
[root@hpcrnt ~]#


################


[slurm@hpcrnt ~]$ cat /etc/slurm/slurmdbd.conf
LogFile=/var/log/slurm/slurmdbd.log
DbdHost=hpcrnt.uta.edu			 # Should be host of machine running slurmdbd.
DbdPort=2114			   	 # Database Port
SlurmUser=slurm				 # Should be user of maching running slurmdbd.
StorageUser=srv_slurm_acct		 # Account of DB server
StorageHost=edsmysqldb011t.uta.edu	 # Name of DB server
StoragePass=............		 # Database password
StorageLoc=slurm_acct_db
DebugLevel=verbose
StorageType=accounting_storage/mysql
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ slurmdbd -D -vvv
slurmdbd: debug:  Log file re-opened
slurmdbd: error: Unable to open pidfile `/var/run/slurmdbd.pid': Permission denied
slurmdbd: debug:  Munge authentication plugin loaded
slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:3306
slurmdbd: error: mysql_real_connect failed: 2003 Can't connect to MySQL server on 'edsmysqldb011t.uta.edu' (4)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:3306
slurmdbd: error: mysql_real_connect failed: 2003 Can't connect to MySQL server on 'edsmysqldb011t.uta.edu' (4)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:3306
slurmdbd: error: mysql_real_connect failed: 2003 Can't connect to MySQL server on 'edsmysqldb011t.uta.edu' (4)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:3306
^C
[slurm@hpcrnt ~]$ 


#############


[root@hpcrnt ~]# grep -i port /etc/slurm/slurm.conf
SlurmctldPort=6817
SlurmdPort=6818
SrunPortRange=60001-60500
[root@hpcrnt ~]# 
[root@hpcrnt ~]# grep -i AccountingStoragePort /etc/slurm/slurm.conf
[root@hpcrnt ~]# 




Thanks,
Mitul Patel
Comment 7 Douglas Wightman 2019-10-23 08:11:29 MDT
You will want to set "StoragePort=2114". That is the parameter that configures which port to connect to the myssql database. You will also want to unset or comment out DbdPort as that is the port the slurmdbd/slurmctld use to communicate and using the default value (rather than 2114) would be correct.
Comment 8 Mitul Patel 2019-10-23 10:04:45 MDT
Hi,


I removed DBDPort and added StoragePort as requested. After that I started the service and got error. Talked with DB team and they told me unistall mariadb and install mysql. I am doing that now.



[root@hpcrnt ~]# cat /etc/slurm/slurmdbd.conf
LogFile=/var/log/slurm/slurmdbd.log
DbdHost=hpcrnt.uta.edu       # Should be host of machine running slurmdbd.
StoragePort=2114         # DB Port
SlurmUser=slurm        # Should be user of maching running slurmdbd.
StorageUser=srv_slurm_acct     # Account of DB server
StorageHost=edsmysqldb011t.uta.edu   # Name of DB server
StoragePass=...........    # Database password
StorageLoc=slurm_acct_db
DebugLevel=verbose
StorageType=accounting_storage/mysql
[root@hpcrnt ~]# 




[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ slurmdbd -D -vvv
slurmdbd: debug:  Log file re-opened
slurmdbd: error: Unable to open pidfile `/var/run/slurmdbd.pid': Permission denied
slurmdbd: debug:  Munge authentication plugin loaded
slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:2114
slurmdbd: error: mysql_real_connect failed: 2059 Authentication plugin 'sha256_password' cannot be loaded: /usr/lib64/mysql/plugin/sha256_password.so: cannot open shared object file: No such file or directory
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:2114
slurmdbd: error: mysql_real_connect failed: 2059 Authentication plugin 'sha256_password' cannot be loaded: /usr/lib64/mysql/plugin/sha256_password.so: cannot open shared object file: No such file or directory
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:2114
slurmdbd: error: mysql_real_connect failed: 2059 Authentication plugin 'sha256_password' cannot be loaded: /usr/lib64/mysql/plugin/sha256_password.so: cannot open shared object file: No such file or directory
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:2114
slurmdbd: error: mysql_real_connect failed: 2059 Authentication plugin 'sha256_password' cannot be loaded: /usr/lib64/mysql/plugin/sha256_password.so: cannot open shared object file: No such file or directory
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
^C
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ ls -l /usr/lib64/mysql/plugin/sha256_password.so
ls: cannot access /usr/lib64/mysql/plugin/sha256_password.so: No such file or directory
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ 
[slurm@hpcrnt ~]$ systemctl status slrumdbd
Unit slrumdbd.service could not be found.
[slurm@hpcrnt ~]$ systemctl status slurmdbd
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2019-10-22 15:51:55 CDT; 19h ago
  Process: 25241 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=1/FAILURE)
[slurm@hpcrnt ~]$ 



Thanks,
Mitul Patel
Comment 9 Mitul Patel 2019-10-23 12:31:57 MDT
Hi,

After uninstalling mariadb and installing MYSQL as requested by DB team.

I am able to connect now. I ran couple of jobs on test hpc. Is there a coomand to to pull accounting data?



[slurm@hpcrnt system]$ slurmdbd -D -vvv
slurmdbd: debug:  Log file re-opened
slurmdbd: error: Unable to open pidfile `/var/run/slurmdbd.pid': Permission denied
slurmdbd: debug:  Munge authentication plugin loaded
slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:2114
slurmdbd: debug2: innodb_buffer_pool_size: 3221225472
slurmdbd: debug2: innodb_log_file_size: 262144000
slurmdbd: debug2: innodb_lock_wait_timeout: 50
slurmdbd: error: Database settings not recommended values: innodb_lock_wait_timeout
slurmdbd: converting QOS table
slurmdbd: Conversion done: success!
slurmdbd: Accounting storage MYSQL plugin loaded
slurmdbd: Not running as root. Can't drop supplementary groups
slurmdbd: debug2: ArchiveDir        = /tmp
slurmdbd: debug2: ArchiveScript     = (null)
slurmdbd: debug2: AuthInfo          = (null)
slurmdbd: debug2: AuthType          = auth/munge
slurmdbd: debug2: CommitDelay       = 0
slurmdbd: debug2: DbdAddr           = hpcrnt.uta.edu
slurmdbd: debug2: DbdBackupHost     = (null)
slurmdbd: debug2: DbdHost           = hpcrnt.uta.edu
slurmdbd: debug2: DbdPort           = 6819
slurmdbd: debug2: DebugFlags        = (null)
slurmdbd: debug2: DebugLevel        = 6
slurmdbd: debug2: DebugLevelSyslog  = 10
slurmdbd: debug2: DefaultQOS        = (null)
slurmdbd: debug2: LogFile           = /var/log/slurm/slurmdbd.log
slurmdbd: debug2: MessageTimeout    = 10
slurmdbd: debug2: Parameters        = (null)
slurmdbd: debug2: PidFile           = /var/run/slurmdbd.pid
slurmdbd: debug2: PluginDir         = /usr/lib64/slurm
slurmdbd: debug2: PrivateData       = none
slurmdbd: debug2: PurgeEventAfter   = NONE
slurmdbd: debug2: PurgeJobAfter     = NONE
slurmdbd: debug2: PurgeResvAfter    = NONE
slurmdbd: debug2: PurgeStepAfter    = NONE
slurmdbd: debug2: PurgeSuspendAfter = NONE
slurmdbd: debug2: PurgeTXNAfter = NONE
slurmdbd: debug2: PurgeUsageAfter = NONE
slurmdbd: debug2: SlurmUser         = slurm(4000)
slurmdbd: debug2: StorageBackupHost = (null)
slurmdbd: debug2: StorageHost       = edsmysqldb011t.uta.edu
slurmdbd: debug2: StorageLoc        = slurm_acct_db
slurmdbd: debug2: StoragePort       = 2114
slurmdbd: debug2: StorageType       = accounting_storage/mysql
slurmdbd: debug2: StorageUser       = srv_slurm_acct
slurmdbd: debug2: TCPTimeout        = 2
slurmdbd: debug2: TrackWCKey        = 0
slurmdbd: debug2: TrackSlurmctldDown= 0
slurmdbd: debug2: acct_storage_p_get_connection: request new connection 1
slurmdbd: debug2: Attempting to connect to edsmysqldb011t.uta.edu:2114
slurmdbd: slurmdbd version 18.08.7 started
slurmdbd: debug2: running rollup at Wed Oct 23 11:51:43 2019
slurmdbd: debug2: Everything rolled up



^Cslurmdbd: Terminate signal (SIGINT or SIGTERM) received
slurmdbd: debug:  rpc_mgr shutting down
slurmdbd: Unable to remove pidfile '/var/run/slurmdbd.pid': No such file or directory
[slurm@hpcrnt system]$ 
[slurm@hpcrnt system]$ 
[slurm@hpcrnt system]$ 
[slurm@hpcrnt system]$ exit
logout
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# systemctl restart slurmdbd
[root@hpcrnt ~]# systemctl status slurmdbd
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2019-10-23 11:52:49 CDT; 1s ago
  Process: 95839 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 95841 (slurmdbd)
   CGroup: /system.slice/slurmdbd.service
           └─95841 /usr/sbin/slurmdbd

Oct 23 11:52:49 hpcrnt.uta.edu systemd[1]: Starting Slurm DBD accounting daemon...
Oct 23 11:52:49 hpcrnt.uta.edu systemd[1]: Started Slurm DBD accounting daemon.
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 



[patelmn@hpcrnt ~]$ sbatch batchHelloWorldC.slurm 
Submitted batch job 34
[patelmn@hpcrnt ~]$ sacct --format="JobID,user,account,elapsed,Timelimit,MaxRSS,ReqMem,MaxVMSize,ncpus,ExitCode"
       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
25             patelmn     (null)   00:00:00                               0n                     0      0:0 
26             patelmn     (null)   00:00:00                               0n                     0      0:0 
27             patelmn     (null)   00:00:00                               0n                     0      0:0 
31             patelmn     (null)   00:00:01                               0n                     0      1:0 
32             patelmn     (null)   00:00:00                               0n                     0      1:0 
33             patelmn     (null)   00:00:00                               0n                     0      0:0 
34             patelmn     (null)   00:00:00                               0n                     0      0:0 
[patelmn@hpcrnt ~]$ 





Thanks,
Mitul Patel
Comment 10 Douglas Wightman 2019-10-23 12:55:32 MDT
I'm glad things are working.

You've already seen sacct, I would also take a look at sreport for other types of utilization reports.

Now that things are working I will plan on closing this support ticket.
Comment 11 Mitul Patel 2019-10-23 15:30:21 MDT
Hi,

when I run sreport. I am getting I am not running a supported accounting storage plugin.

Do I need to change storage type on slurm.conf to -> "AccountingStorageType=accounting_storage/mysql"



[root@hpcrnt ~]# sreport
You are not running a supported accounting_storage plugin
(accounting_storage/filetxt).
Only 'accounting_storage/slurmdbd' and 'accounting_storage/mysql' are supported.
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# sacctmgr
You are not running a supported accounting_storage plugin
(accounting_storage/filetxt).
Only 'accounting_storage/slurmdbd' and 'accounting_storage/mysql' are supported.
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# grep -i accounting /etc/slurm/slurm.conf
# ACCOUNTING
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStoreJobComment=YES
#AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=edsmysqldb011t.uta.edu
#AccountingStoragePort=2114
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
AccountingStorageType=accounting_storage/filetxt
[root@hpcrnt ~]# 



Thanks,
Mitul Patel
Comment 12 Douglas Wightman 2019-10-23 15:36:59 MDT
AccountingStorageHost should point to the host running slurmdbd (not the host running mysql) and AccountingStorageType should be accounting_storage/slurmdbd. Please let me know if that fixes things.
Comment 13 Mitul Patel 2019-10-23 15:47:28 MDT
Hi,

Added "AccountingStoragePort=2114" to slurm.conf.

Waiting on sreport output



[root@hpcrnt ~]# 
[root@hpcrnt ~]# grep -i port /etc/slurm/slurm.conf
SlurmctldPort=6817
SlurmdPort=6818
SrunPortRange=60001-60500
AccountingStoragePort=2114
[root@hpcrnt ~]# 
[root@hpcrnt ~]# systemctl restart slurmctld
[root@hpcrnt ~]# systemctl restart slurmdbd
[root@hpcrnt ~]# 
[root@hpcrnt ~]# sreport


Thanks,
Mitul Patel
Comment 14 Douglas Wightman 2019-10-23 15:50:54 MDT
Unfortunately that was not the right configuration parameter to change. That parameter should be deleted or commented out (so as to use the default). Port 2114 is for slurmdbd to communicate with mysql (based on your previous comments) NOT for sreport to slurmdbd communication.

You will most likely need to delete AccountingStoragePort and change the two parameters I mentioned in my last comment: AccountingStorageType and AccountingStorageHost.
Comment 15 Mitul Patel 2019-10-23 16:02:38 MDT
Hi,

After commenting out "AccountingStoragePort=2114" I get previous message. I have attached both files. Slurm.conf and slurmdbd.conf
 

sreport: error: slurm_persist_conn_open_without_init: failed to open persistent connection to edsmysqldb011t.uta.edu:6819: Connection timed out
sreport: error: slurmdbd: Sending PersistInit msg: Connection timed out
sreport: fatal: Problem connecting to the database: Connection timed out




=====================================


[root@hpcrnt ~]# 
[root@hpcrnt ~]# grep -i AccountingStorageType /etc/slurm/slurm.conf
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageType=accounting_storage/filetxt
AccountingStorageType=accounting_storage/slurmdbd
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# grep -i AccountingStorageHost /etc/slurm/slurm.conf
AccountingStorageHost=edsmysqldb011t.uta.edu
[root@hpcrnt ~]# 
[root@hpcrnt ~]# grep -i AccountingStoragePort /etc/slurm/slurm.conf
AccountingStoragePort=2114
[root@hpcrnt ~]# 
[root@hpcrnt ~]# vi /etc/slurm/slurm.conf
[root@hpcrnt ~]# 
[root@hpcrnt ~]# grep -i AccountingStoragePort /etc/slurm/slurm.conf
#AccountingStoragePort=2114
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# systemctl restart slurmctld
[root@hpcrnt ~]# systemctl restart slurmctld
[root@hpcrnt ~]# systemctl restart slurmdbd
[root@hpcrnt ~]# 
[root@hpcrnt ~]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2019-10-23 16:57:36 CDT; 12s ago
  Process: 113549 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
  Process: 113546 ExecStartPre=/usr/bin/chown -R slurm:slurm /var/run/slurm (code=exited, status=0/SUCCESS)
  Process: 113544 ExecStartPre=/usr/bin/mkdir -m 0750 -p /var/run/slurm (code=exited, status=0/SUCCESS)
 Main PID: 113552 (slurmctld)
   CGroup: /system.slice/slurmctld.service
           └─113552 /usr/sbin/slurmctld

Oct 23 16:57:36 hpcrnt.uta.edu systemd[1]: Starting Slurm controller daemon...
Oct 23 16:57:36 hpcrnt.uta.edu systemd[1]: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start.
Oct 23 16:57:36 hpcrnt.uta.edu systemd[1]: Started Slurm controller daemon.
[root@hpcrnt ~]# 
[root@hpcrnt ~]# systemctl status slurmdmd
Unit slurmdmd.service could not be found.
[root@hpcrnt ~]# systemctl status slurmdbd
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2019-10-23 16:57:42 CDT; 17s ago
  Process: 113566 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 113568 (slurmdbd)
   CGroup: /system.slice/slurmdbd.service
           └─113568 /usr/sbin/slurmdbd

Oct 23 16:57:42 hpcrnt.uta.edu systemd[1]: Stopped Slurm DBD accounting daemon.
Oct 23 16:57:42 hpcrnt.uta.edu systemd[1]: Starting Slurm DBD accounting daemon...
Oct 23 16:57:42 hpcrnt.uta.edu systemd[1]: PID file /var/run/slurmdbd.pid not readable (yet?) after start.
Oct 23 16:57:42 hpcrnt.uta.edu systemd[1]: Started Slurm DBD accounting daemon.
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# sreport
sreport: error: slurm_persist_conn_open_without_init: failed to open persistent connection to edsmysqldb011t.uta.edu:6819: Connection timed out
sreport: error: slurmdbd: Sending PersistInit msg: Connection timed out
sreport: fatal: Problem connecting to the database: Connection timed out
[root@hpcrnt ~]# 







=====================================
slurm.conf


[root@hpcrnt ~]# cat /etc/slurm/slurm.conf
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
#ClusterName=linux
ClusterName=hpcrnt
ControlMachine=hpcrnt
EnforcePartLimits=YES 

#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
SrunPortRange=60001-60500
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm-states
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
ProctrackType=proctrack/pgid
RebootProgram="/usr/sbin/reboot"
#PluginDir=
#FirstJobId=
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SelectType=select/linear
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStoreJobComment=YES
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=edsmysqldb011t.uta.edu
#AccountingStoragePort=2114
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
# OpenHPC default configuration
PropagateResourceLimitsExcept=MEMLOCK
#AccountingStorageType=accounting_storage/filetxt
AccountingStorageType=accounting_storage/slurmdbd
Epilog=/etc/slurm/slurm.epilog.clean
#NodeName=compute-6-7-0 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
NodeName=cn-2f2800 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
NodeName=cn-2f2900 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
NodeName=cn-2f3001 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
NodeName=cn-2f3002 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3003 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3004 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3005 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3006 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3007 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3008 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3009 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3010 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3011 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3012 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3013 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3014 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3015 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844
#NodeName=cn-2f3016 Sockets=2 CoresPerSocket=22 ThreadsPerCore=1 RealMemory=257844

#PartitionName=normal Nodes=cn-2f3001,cn-2f3002,cn-2f3003,cn-2f3004,cn-2f3005,cn-2f3006,cn-2f3007,cn-2f3008,cn-2f3009,cn-2f3010,cn-2f3011,cn-2f3012,cn-2f3013,cn-2f3014,cn-2f3015,cn-2f3016 Default=YES MaxTime=48:00:00 State=UP
#PartitionName=normal Nodes=cn-2f30[01-16] Default=YES MaxTime=48:00:00 State=UP
#PartitionName=normal Nodes=cn-2f30[01-12] Default=YES MaxTime=48:00:00 State=UP
PartitionName=normal Nodes=cn-2f30[01-02] Default=YES MaxTime=48:00:00 State=UP
PartitionName=gpu Nodes=cn-2f2800,cn-2f2900 MaxTime=48:00:00 State=UP
[root@hpcrnt ~]# 






=====================================
Slurmdbd.conf



[root@hpcrnt ~]# cat /etc/slurm/slurmdbd.conf
LogFile=/var/log/slurm/slurmdbd.log
DbdHost=hpcrnt.uta.edu			 # Should be host of machine running slurmdbd.
StoragePort=2114		   	 # DB Port
SlurmUser=slurm				 # Should be user of maching running slurmdbd.
StorageUser=srv_slurm_acct		 # Account of DB server
StorageHost=edsmysqldb011t.uta.edu	 # Name of DB server
StoragePass=............		 # Database password
StorageLoc=slurm_acct_db
DebugLevel=verbose
StorageType=accounting_storage/mysql
[root@hpcrnt ~]# 



Thanks,
Mitul Patel
Comment 16 Douglas Wightman 2019-10-23 16:11:20 MDT
This parameter in your slurm.conf looks like it might be wrong:

AccountingStorageHost=edsmysqldb011t.uta.edu

As I mentioned in comment 12, slurm.conf AccountingStorageHost should point to the host running slurmdbd (hpcrnt.uta.edu).  Could you try changing it to:

AccountingStorageHost=hpcrnt.uta.edu
Comment 17 Mitul Patel 2019-10-23 16:34:49 MDT
Hi,

It looks like it's working. I do not see any jobs report that was run previously. I am assuming it wiped out when Slurmdb was setup. Is there a way to look at that?

I am going to run couple of jobs and see if I see report.


Is there a way to get all job report. I tried "sreport -a" and does not work.




[root@hpcrnt ~]# sreport job
too few arguments for keyword:job
[root@hpcrnt ~]# sreport jobs
invalid keyword: jobs
[root@hpcrnt ~]# 
[root@hpcrnt ~]# sreport
sreport: exit
[root@hpcrnt ~]# sreport -a job SizesByAccount All_Clusters
--------------------------------------------------------------------------------
Job Sizes 2019-10-22T00:00:00 - 2019-10-22T23:59:59 (86400 secs)
Time reported in Minutes
--------------------------------------------------------------------------------
  Cluster   Account     0-49 CPUs   50-249 CPUs  250-499 CPUs  500-999 CPUs  >= 1000 CPUs % of cluster 
--------- --------- ------------- ------------- ------------- ------------- ------------- ------------ 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# sreport -a All_Clusters
invalid keyword: All_Clusters
[root@hpcrnt ~]# sreport -a job All_Clusters
Not valid report All_Clusters
Valid job reports are, "SizesByAccount, SizesByAccountAndWcKey, and  SizesByWckey"
[root@hpcrnt ~]# 
[root@hpcrnt ~]# sreport -a jobs All_Clusters
invalid keyword: jobs
[root@hpcrnt ~]# 
[root@hpcrnt ~]# sreport -a cluster All_Clusters
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2019-10-22T00:00:00 - 2019-10-22T23:59:59 (86400 secs)
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name     Used   Energy 
--------- --------------- --------- --------------- -------- -------- 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# sbatch
^C
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 2-00:00:00      2   idle cn-2f[3001-3002]
gpu          up 2-00:00:00      2   idle cn-2f[2800,2900]
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[root@hpcrnt ~]# 
[root@hpcrnt ~]# sacct --format="JobID,user,account,elapsed,Timelimit,MaxRSS,ReqMem,MaxVMSize,ncpus,ExitCode"
       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
sacct: error: slurmdbd: Unknown error 1064
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# 
[root@hpcrnt ~]# sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
sacct: error: slurmdbd: Unknown error 1064
[root@hpcrnt ~]# 






Thanks,
Mitul Patel
Comment 18 Douglas Wightman 2019-10-23 16:48:00 MDT
If the slurmctld->slurmdbd communication path is now working it will start collecting accounting data. Any data from before will not have been collected. You can use "sdiag" "DBD Agent queue size" to check that (see the sdiag man page for more information).

The reports that are available are outlined in the sreport documentation. I recommend that you read that page to find the report that will work for you:

These links contain the relevant information about what is available:

https://slurm.schedmd.com/sreport.html
https://slurm.schedmd.com/sdiag.html
https://slurm.schedmd.com/sacct.html
Comment 19 Douglas Wightman 2019-10-24 15:29:22 MDT
It appears that accounting/slurmdbd is now setup on your system. If you have any further issues feel free to open another ticket.