Ticket 9453

Summary: slurmdbd frequent crashes - no errors recorded
Product: Slurm Reporter: Nathan Elger <elger>
Component: DatabaseAssignee: Marshall Garey <marshall>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: albert.gil
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=5817
https://bugs.schedmd.com/show_bug.cgi?id=5632
Site: U of South Carolina Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 16.05.08 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmctld log 7-21-2020
slurmdbd log 7-21-2020
slurmctld log 7-21-2020
slurmdbd log 7-29
slurmctld log 7-29
mariadb log

Description Nathan Elger 2020-07-22 12:01:43 MDT
We're dealing with an issue with slurmdbd crashing frequently, sometimes as often as every 15 minutes. This behavior has been going on for about 3 weeks, and has been steadily getting worse.

We are using Centos 7.3, Slurm 16.05.08 and slurmdb is running on MariaDB 5.5 as part of a Bright Cluster Manager 7.3 installation. This configuration has been running with very little intervention for almost 4 years. There are about 8m jobs in the db and we usually do around 2k-5k jobs a day.

The headnode has 128G of system RAM and runs the bright cluster manager infrastructure, as well as slurmctld and the slurmdbd. The typical course of the issue is slurmdbd slowly (or recently, quickly) gobbles up all available memory, gets killed, and restarts cleanly. I have also seen recently that the mariadb tmp space in /var/tmp starts growing incredibly large, I have seen it as big as 120GB. Several times this has filled up the 200G /var mount and the headnode crashes and requires a reboot.

I bumped the slurmdb logging up to debug3 but I haven't seen any errors being logged, just slurmdb service restart info and generic connection information. We've stood up xdmod to handle our historical data reporting so I've recently started purging the db slowly to try to get down to a years worth of records, at most. I have also set the innodb_buffer_pool_size to ~80% of system ram as recommended.

Here are my password sanitized my.cnf, my.cnf.d/innodb.cnf, slurm.conf and slurmdb.conf Most of these settings are the default bright configs.

#
##my.cnf
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
port=3306
user=mysql
key_buffer_size=1500M
delay_key_write=on
myisam_sort_buffer_size=400M
max_allowed_packet=1000M
bulk_insert_buffer_size=256M
# Default to using old password format for compatibility with mysql 3.x
# clients (those using the mysqlclient10 compatibility package).
old_passwords=1
myisam-recover=FORCE
skip-external-locking
table_open_cache=64
sort_buffer_size=512K
net_buffer_length=8K
read_buffer_size=256K
read_rnd_buffer_size=512K
log-bin=mysql-bin
binlog_format=mixed
server-id=1
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Settings user and group are ignored when systemd is used.
# If you need to run mysqld under a different user or group,
# customize your systemd unit file for mariadb according to the
# instructions in http://fedoraproject.org/wiki/Systemd

# some suggestions from wget http://mysqltuner.pl/ -O mysqltuner.pl
query_cache_size=8M
tmp_table_size=16M
max_heap_table_size=16M
thread_cache_size=4
join_buffer_size=128M

[mysqld_safe]
log-error=/var/log/mariadb/mariadb.log
pid-file=/var/run/mariadb/mariadb.pid
socket=/var/lib/mysql/mysql.sock


[mysqldump]
socket=/var/lib/mysql/mysql.sock
quick
max_allowed_packet=16M


[mysql]
no-auto-rehash


[myisamchk]
key_buffer_size=20M
sort_buffer_size=20M
read_buffer=2M
write_buffer=2M


[mysqlhotcopy]
interactive-timeout


[mysqld_multi]
mysqld=/usr/bin/mysqld_safe
mysqladmin=/usr/bin/mysqladmin
log=/var/log/mysqld_multi.log


#
# include all files from the config directory
#
!includedir /etc/my.cnf.d
 
----------------------------------------------------

#
##innodb.cnf
[mysqld]
innodb_buffer_pool_size=100000M
innodb_lock_wait_timeout=900

----------------------------------------------------

#
# slurmdbd.conf file.
#
# See the slurmdbd.conf man page for more information.
#
# Archive info
#ArchiveJobs=yes
#ArchiveDir="/tmp"
#ArchiveSteps=yes
#ArchiveScript=
#JobPurge=12
#StepPurge=1
#
PurgeEventAfter=1000days
PurgeJobAfter=1000days
PurgeResvAfter=1000days
PurgeStepAfter=1000days
PurgeSuspendAfter=1000days
# Authentication info
AuthType=auth/munge
#AuthInfo=/var/run/munge/munge.socket.2
#
# slurmDBD info
DbdAddr=DBD_ADDR
#DbdPort=7031
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=debug3
#DefaultQOS=normal,standby
LogFile=/var/log/slurmdbd
PidFile=/var/run/slurmdbd.pid
#PluginDir=/cm/shared/apps/slurm/current/lib64:/usr/lib64:/cm/shared/apps/slurm/current/lib64/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
#
# Database info
StorageType=accounting_storage/mysql
StorageHost=master
#StoragePort=1234
StorageUser=slurm
StorageLoc=slurm_acct_db

# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
DbdHost=hn1-hyperion
# END AUTOGENERATED SECTION   -- DO NOT REMOVE

----------------------------------------------------

#
# See the slurm.conf man page for more information.
#

ClusterName=SLURM_CLUSTER
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave
SlurmdSpoolDir=/cm/local/apps/slurm/var/spool
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
#ProctrackType=proctrack/pgid
ProctrackType=proctrack/cgroup
#PluginDir=
CacheGroups=0
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/cgroup
#TrackWCKey=no
#TreeWidth=50
#TmpFs=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd

#JobCompType=jobcomp/filetxt
#JobCompLoc=/cm/local/apps/slurm/var/spool/job_comp.log

#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
AccountingStorageEnforce=limits
#JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
# AccountingStorageLoc=slurm_acct_db
# AccountingStoragePass=SLURMDBD_USERPASS

# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
# Scheduler
SchedulerType=sched/backfill
# Master nodes
ControlMachine=hn1-hyperion
ControlAddr=hn1-hyperion
AccountingStorageHost=hn1-hyperion
# Nodes
NodeName=node[003-010]  CoresPerSocket=10 Sockets=4
NodeName=node[363-382]  CoresPerSocket=12 Sockets=4 Gres=gpu:2
NodeName=hn1-thoth,node[017-238]  CoresPerSocket=14 Sockets=2
NodeName=node[011-016,241,242]  CoresPerSocket=14 Sockets=2 Gres=gpu:2
NodeName=ultra  CoresPerSocket=22 Sockets=24 NodeAddr=10.16.255.249
NodeName=node[243-362]  CoresPerSocket=24 Sockets=2
NodeName=node[383-406]  CoresPerSocket=24 Sockets=2 Gres=gpu:2
NodeName=minksy1  CoresPerSocket=8 Sockets=2 Gres=gpu:4
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=GANG,SUSPEND ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[026-172,174-182,184,187-189,200,209-213,219,220]
PartitionName=business Default=NO MinNodes=1 AllowGroups=business DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[219,220]
PartitionName=BigMem Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=GANG,SUSPEND ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[003-010]
PartitionName=soph Default=NO MinNodes=1 AllowGroups=soph DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[217,218]
PartitionName=neset-lab Default=NO MinNodes=1 AllowGroups=neset-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node210
PartitionName=gpu Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[011-013,015,241,242]
PartitionName=varshna-lab Default=NO MinNodes=1 AllowGroups=varshna-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node209
PartitionName=vatch-lab Default=NO MinNodes=1 AllowGroups=vatch-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[014,016]
PartitionName=farouk-lab Default=NO MinNodes=1 AllowGroups=farouk-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node211
PartitionName=heyden-lab Default=NO MinNodes=1 AllowGroups=heyden-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[190-199,221-238]
PartitionName=scopatz-lab Default=NO MinNodes=1 AllowGroups=scopatz-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[007-010]
PartitionName=hoy-lab Default=NO MinNodes=1 AllowGroups=hoy-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[184,212,213]
PartitionName=no-kill Default=NO MinNodes=1 AllowGroups=nokill DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=hn1-thoth,node[017-025]
PartitionName=flora-lab Default=NO MinNodes=1 AllowGroups=flora-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[214-216]
PartitionName=jerlab-gpu Default=NO MinNodes=1 AllowGroups=jer-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node015
PartitionName=jerlab Default=NO MinNodes=1 AllowGroups=jer-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=GANG,SUSPEND ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[183,185,186,201-208]
PartitionName=microbiome Default=NO MinNodes=1 AllowGroups=microbiome DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node200
PartitionName=test Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO
PartitionName=msmoms Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=ultra
PartitionName=ddlab Default=NO MinNodes=1 AllowGroups=ddlab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node026
PartitionName=vendemia-lab Default=NO MinNodes=1 AllowGroups=vendemia-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[187-189]
PartitionName=gcai-lab Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node178
PartitionName=minsky Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=minksy1
PartitionName=ziehl-lab Default=NO MinNodes=1 AllowGroups=ziehl-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node173
PartitionName=defq-48core Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[281,321-362]
PartitionName=gpu-v100-16gb Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[384-386,389-406]
PartitionName=gpu-v100-32gb Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[363-380]
PartitionName=jerlab-48core Default=NO MinNodes=1 AllowGroups=jer-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[243,279,280,282-286,311]
PartitionName=hulab-48core Default=NO MinNodes=1 AllowGroups=hu-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[254-278]
PartitionName=caicedolab-48core Default=NO MinNodes=1 AllowGroups=caicedo-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[245,317]
PartitionName=besmannlab-48core Default=NO MinNodes=1 AllowGroups=besmann-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[246-253]
PartitionName=grosslab-48core Default=NO MinNodes=1 AllowGroups=gross-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node244
PartitionName=heydenlab-48core Default=NO MinNodes=1 AllowGroups=heyden-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[287-306]
PartitionName=qiwang-gpu Default=NO MinNodes=1 AllowGroups=qiwang-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node383
PartitionName=hannolab-48core Default=NO MinNodes=1 AllowGroups=hanno-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[307-310]
PartitionName=nagarkatti-lab Default=NO MinNodes=1 AllowGroups=nagarkatti-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node172
PartitionName=ge-lab Default=NO MinNodes=1 AllowGroups=gelab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[312-316]
PartitionName=AI_Center Default=NO MinNodes=1 AllowGroups=aic DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[381,382]
PartitionName=floralab-48core Default=NO MinNodes=1 AllowGroups=flora-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node320
PartitionName=yiwang-48core Default=NO MinNodes=1 AllowGroups=yiwang-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[318,319]
PartitionName=jamshidi-lab Default=NO MinNodes=1 AllowGroups=jamshidi-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[387,388]
# Generic resources types
GresTypes=gpu,mic
# Epilog/Prolog parameters
PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob
Prolog=/cm/local/apps/cmd/scripts/prolog
Epilog=/cm/local/apps/cmd/scripts/epilog
# Fast Schedule option
FastSchedule=0
# Power Saving
SuspendTime=-1 # this disables power saving
SuspendTimeout=30
ResumeTimeout=60
SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff
ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron
# END AUTOGENERATED SECTION   -- DO NOT REMOVE

SelectType=select/cons_res
SelectTypeParameters=CR_Core


PreemptMode=GANG,SUSPEND
PreemptType=preempt/partition_prio

MaxArraySize=2000
FirstJobId=4600000
#NodeName=ultra NodeAddr=10.16.255.249 CoresPerSocket=22 Sockets=24
#PartitionName=msmoms Default=NO MinNodes=1 AllowGroups=all DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO PriorityJobFactor=1000 PriorityTier=1000 OverSubscribe=NO State=UP Nodes=ultra
Comment 1 Marshall Garey 2020-07-22 17:51:43 MDT
Hi Nathan,

I'm looking at this issue. I'm still thinking about the problem and may ask for more information, but let's start with this:


Even though you said the slurmdbd log doesn't have errors, can you upload about an hour of the slurmdbd log file which includes a period of time where the slurmdbd got killed, restarted, and got killed again? Can you also upload the slurmctld log during that same time frame (so I can get some more context about what's going on)?


Is anybody running sacct queries that request lots of jobs? Have you seen any sacct queries fail?

The way Slurm implements List structures is when memory is allocated for a list and then later free'd, Slurm may not free the memory to the OS but keeps a pointer to it, essentially caching the memory for later use. This means that the memory footprint of slurmdbd (or slurmctld for that matter) can grow if large queries require a list bigger than any of the currently cached lists. This is very old code and was found to be unnecessary and even slower on modern systems, so it was removed in commit fc952769 which is in Slurm 20.02. This could be one likely explanation for the growing memory footprint of slurmdbd.

Another potential issue: Slurm limits the amount of data it can send back to a client (like sacct) to 3 GB. However, when someone requests more data than this with an sacct query, slurmdbd still loads all of that data in memory with a mysql query. Then it checks the size of data, sees that it's too big, and sends an error to the client (sacct). But that memory might not be free'd to the OS and remain cached in slurmdbd memory for later use. This is described in bug 5817, which is a public bug you can view. This particular problem was experienced by another site in bug 5632. The workaround was to prevent users from doing overly aggressive queries with sacct.



Can you also run:

sacctmgr show runaway

If it says there are runaway jobs and would you like to fix them, type N for no but upload the output. I'd like to see what's going on first before fixing any runaway jobs.

I also wonder if 80% of the RAM for innodb_buffer_pool_size is too much - I'm wondering if the remaining amount of RAM on the server is sufficient for slurmdbd, slurmctld, Bright, the OS, and anything else going on. Don't worry about changing this for now, but it's something we might consider depending on the diagnosis of the problem.



Slurm 16.05 is very old and isn't technically supported right now. I understand your site has a temporary support contract right now so I'll do my best to help diagnose and fix (or have a good workaround for) this issue, but won't be able to provide any patches for 16.05.
Comment 4 Nathan Elger 2020-07-23 12:59:49 MDT
Created attachment 15149 [details]
slurmctld log 7-21-2020
Comment 5 Nathan Elger 2020-07-23 13:00:19 MDT
Created attachment 15150 [details]
slurmdbd log 7-21-2020
Comment 6 Marshall Garey 2020-07-27 11:16:28 MDT
Hey Nathan,

I'm looking through the slurmdbd log file you uploaded. I noticed that the slurmctld log file is empty (0 bytes) - can you try getting that again and uploading it? Are you able to answer my other questions from comment 1?

> Is anybody running sacct queries that request lots of jobs? Have you seen any sacct queries fail?
> Can you also run sacctmgr show runaway? (and press N for no)

I'll ask more things once I'm done studying the slurmdbd log file.
Comment 7 Nathan Elger 2020-07-27 11:28:23 MDT
Created attachment 15177 [details]
slurmctld log 7-21-2020
Comment 8 Nathan Elger 2020-07-27 11:45:56 MDT
That's strange, I see it as the correct size:

slurmctld log 7-21-2020 (4.78 MB, text/plain)
2020-07-23 12:59 MDT, Nathan Elger 

...and my comment is gone/never posted as well. I have uploaded the file again and checked to make sure it is complete.

"Is anybody running sacct queries that request lots of jobs? Have you seen any sacct queries fail?"

No, I don't see any users running sacct for any slurm info. The only notable use of sacct that we use is running xdmod's shredder once a week to gather data.

show runaway still shows 0 jobs. I cleared the runaways a few weeks ago when the problem first manifested, it was ~50 jobs. I did not actually set the innodb buffer size option until a few days before I submitted this bug. After setting it, it seemed like the slurmdb crashes increased in frequency, but the large writes to /var/tmp/systemdxxx-mariadb/tmp actually slowed down and stopped growing larger than 1-2GB. In fact, after a day or two the slurmdb service stopped crashing entirely. At the time of this comment, the service has not crashed in 5 days. Prior to this, it has been crashing reliably dozens of times a day for over a month. The only changes that I had made since the last headnode crash/reboot were the innodb_buffer_pool_size set to 100G and stepping the slurmdb purge settings from 1200 to 1000 days as shown in my posted slurmdb.conf.
Comment 9 Nathan Elger 2020-07-27 11:48:10 MDT
Is there a good slurm-side way to track sacct usage by users other than combing histories and submit scripts?
Comment 11 Marshall Garey 2020-07-27 15:13:42 MDT
RE the 0 size log file, I think I copied it to another location before it finished downloading and thus it didn't have any data in it, so your original file wasn't empty. User error on my part.


> How to see usage of sacct other than combing histories and submit scripts?
There is the command sacctmgr show stats, but that command doesn't exist in 16.05, it only exists in 17.02 and up. It's very much like the command sdiag - sdiag shows stats for slurmctld, sacctmgr show stats shows stats for slurmdbd. You'd be interested in the bottom section, which shows RPC stats by type and by user (though not which user did which RPC's). Sample from my command in 20.02:

Remote Procedure Call statistics by message type
        SLURM_PERSIST_INIT       ( 6500) count:8      ave_time:1054   total_time:8438
        DBD_FINI                 ( 1401) count:7      ave_time:621    total_time:4348
        DBD_GET_FEDERATIONS      ( 1494) count:5      ave_time:864    total_time:4321
        DBD_REGISTER_CTLD        ( 1434) count:3      ave_time:28434  total_time:85303
        DBD_GET_ASSOCS           ( 1410) count:3      ave_time:7087   total_time:21263
        DBD_GET_RES              ( 1478) count:3      ave_time:1119   total_time:3357
        DBD_GET_USERS            ( 1415) count:3      ave_time:575    total_time:1727
        DBD_CLUSTER_TRES         ( 1407) count:3      ave_time:516    total_time:1548
        DBD_GET_TRES             ( 1486) count:3      ave_time:501    total_time:1505
        DBD_GET_QOS              ( 1448) count:3      ave_time:451    total_time:1354
        DBD_GET_JOBS_COND        ( 1444) count:2      ave_time:3462   total_time:6925
        DBD_GET_STATS            ( 1489) count:2      ave_time:94     total_time:188

Remote Procedure Call statistics by user
        marshall            (      1000) count:45     ave_time:3117   total_time:140277


Unfortunately this isn't available in 16.05 so you'd have to upgrade to get it.

sacct is fine to run and shouldn't be problematic even when run fairly often, but it does cause problems when requesting large amounts of data that result in large mysql queries (as I explained).


> I did not actually set the innodb buffer size option until a few days before I submitted this bug.
> After setting it, it seemed like the slurmdb crashes increased in frequency, but the large writes to
> /var/tmp/systemdxxx-mariadb/tmp actually slowed down and stopped growing larger than 1-2GB. In fact,
> after a day or two the slurmdb service stopped crashing entirely. At the time of this comment, the
> service has not crashed in 5 days.

Since you haven't seen any slurmdbd crash since increasing innodb_buffer_pool_size I assume that fixed whatever problems you were seeing. Let's keep this bug open for a bit so you can monitor slurmdbd. If it crashes again, I'd be interested in the slurmdbd log file again and also the output of show engine innodb status. You might also look at our accounting web page (https://slurm.schedmd.com/accounting.html). In addition to increasing innodb_buffer_pool_size and innodb_lock_wait_timeout (which you've already done), we also recommend increasing innodb_log_file_size.





RE the log files - I didn't see anything very relevant to this particular bug. Here are my thoughts on what I saw anyway:

I didn't see anything too interesting in the slurmctld log file besides the abundance of this set of errors:

[2020-07-21T10:17:31.786] error: slurm_receive_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid credential 
[2020-07-21T10:17:31.786] error: slurm_receive_msg: Protocol authentication error
[2020-07-21T10:17:31.796] error: slurm_receive_msg [10.16.2.34:33698]: Protocol authentication error
[2020-07-21T10:17:31.796] error: invalid type trying to be freed 65534
[2020-07-21T10:17:32.807] error: Munge decode failed: Invalid credential

That looks like some of your nodes is having an issue with munge (MESSAGE_NODE_REGISTRATION_STATUS is a node "registering" with slurmctld). That would be good to look into, but it doesn't have anything to do with the database. It also looks like slurmctld log level is probably at info (3) so there wasn't much to go off of.


slurmdbd log file:

* 12 of these at the beginning of the log file (looks like after a restart of slurmdbd) and one each hour (at rollup):
[2020-07-21T09:52:26.711] error: id_assoc 321 doesn't have any tres

I've seen these errors before and I know we've done some fixes where this error has shown up, so it's possible whatever caused this has been fixed in a newer Slurm version.


* Invalid permissions for log file directory:

[2020-07-21T10:17:23.141] error: chdir(/var/log): Permission denied
[2020-07-21T10:17:23.141] chdir to /var/tmp

Make sure that slurmdbd has the correct permissions for the file specified by LogFile in slurmdbd.conf.



* There are users in the database where slurmdbd can't find UID's:

[2020-07-21T10:17:23.248] debug:  post user: couldn't get a uid for user guoshuai
[2020-07-21T10:17:23.253] debug:  post user: couldn't get a uid for user hulb
[2020-07-21T10:17:23.261] debug:  post user: couldn't get a uid for user jturner

It might be good to delete those users with sacctmgr del user if they aren't active users anymore.


* These happen once per minute on the minute:

[2020-07-21T09:54:00.877] debug:  DBD_INIT: CLUSTER:slurm_cluster VERSION:7680 UID:0 IP:10.16.255.254 CONN:13

This indicates something connecting to slurmdbd. Is there some cron job doing a query (running sacct, sacctgmr, or sreport) every minute? If so, what is it doing? If not, then I guess it's something within slurmctld in 16.05. I didn't see slurmctld connecting to slurmdbd every minute anywhere, but that said that RPC hasn't been used since 16.05 (it was replaced).
Comment 12 Nathan Elger 2020-07-28 13:27:09 MDT
Marshall,

Thanks for the update and the advice. I will continue to monitor the slurmdb crash and update this bug if it crops up again. 

"I didn't see anything too interesting in the slurmctld log file besides the abundance of this set of errors:

[2020-07-21T10:17:31.786] error: slurm_receive_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid credential"

Yeah I recently upgraded BCM and I had to wait to reboot a few nodes, which makes them complain about the munge credential. I have fixed these since the log upload.


"slurmdbd log file:

* 12 of these at the beginning of the log file (looks like after a restart of slurmdbd) and one each hour (at rollup):
[2020-07-21T09:52:26.711] error: id_assoc 321 doesn't have any tres

I've seen these errors before and I know we've done some fixes where this error has shown up, so it's possible whatever caused this has been fixed in a newer Slurm version."

Yeah I have always seen this error and I've never been able to figure out the source but its never seemed to cause any problems, so I just left it alone.


"* Invalid permissions for log file directory:

[2020-07-21T10:17:23.141] error: chdir(/var/log): Permission denied
[2020-07-21T10:17:23.141] chdir to /var/tmp

Make sure that slurmdbd has the correct permissions for the file specified by LogFile in slurmdbd.conf."

Fixed


"* There are users in the database where slurmdbd can't find UID's:

[2020-07-21T10:17:23.248] debug:  post user: couldn't get a uid for user guoshuai
[2020-07-21T10:17:23.253] debug:  post user: couldn't get a uid for user hulb
[2020-07-21T10:17:23.261] debug:  post user: couldn't get a uid for user jturner

It might be good to delete those users with sacctmgr del user if they aren't active users anymore."

Fixed, the del user command took a very long time to complete, 2+ minutes. 

"* These happen once per minute on the minute:

[2020-07-21T09:54:00.877] debug:  DBD_INIT: CLUSTER:slurm_cluster VERSION:7680 UID:0 IP:10.16.255.254 CONN:13

This indicates something connecting to slurmdbd. Is there some cron job doing a query (running sacct, sacctgmr, or sreport) every minute? If so, what is it doing? If not, then I guess it's something within slurmctld in 16.05. I didn't see slurmctld connecting to slurmdbd every minute anywhere, but that said that RPC hasn't been used since 16.05 (it was replaced)."

There is nothing that should be running that command with such regularity, unless it is something in BCM 7.3 that is monitoring the scheduler. I will open a ticket with them to check if this is expected.

I know we need to look into updating slurm, I just feel like it will be tricky as it is built into this bright cluster manager deployment. I read that you can update the slurmdb up to 2 releases ahead of the slurm controller. If that is true, I will try to update the db to the latest 18 release. I also read that it is problematic to attempt the database migration on mariadb 5.5. Do you recommend that we update to mariadb 10 first?

Thanks again for your help
Comment 13 Nathan Elger 2020-07-29 09:30:45 MDT
Created attachment 15219 [details]
slurmdbd log 7-29
Comment 14 Nathan Elger 2020-07-29 09:33:33 MDT
Created attachment 15220 [details]
slurmctld log 7-29
Comment 15 Nathan Elger 2020-07-29 09:43:12 MDT
Hi Marshall,

The crash has cropped up again, along with the huge mariadb dir in /var/tmp-

[root@hn1-hyperion ~]# du -hs /var/tmp/*
0       /var/tmp/abrt
0       /var/tmp/lockINTEL
0       /var/tmp/systemd-private-62293ed4114f48e8ab6a2090fb1349c1-colord.service-MyKJRd
0       /var/tmp/systemd-private-62293ed4114f48e8ab6a2090fb1349c1-httpd.service-FaeVn8
0       /var/tmp/systemd-private-62293ed4114f48e8ab6a2090fb1349c1-named.service-sSxR6v
0       /var/tmp/systemd-private-62293ed4114f48e8ab6a2090fb1349c1-ntpd.service-aZ2kcV
0       /var/tmp/systemd-private-62293ed4114f48e8ab6a2090fb1349c1-rtkit-daemon.service-qxag1p
0       /var/tmp/systemd-private-7d7ca43e4cf84a5c9fa4b14a7058114f-colord.service-31tvnD
0       /var/tmp/systemd-private-7d7ca43e4cf84a5c9fa4b14a7058114f-httpd.service-a2Ww7T
0       /var/tmp/systemd-private-7d7ca43e4cf84a5c9fa4b14a7058114f-mariadb.service-1p6QSm
0       /var/tmp/systemd-private-7d7ca43e4cf84a5c9fa4b14a7058114f-named.service-ymF7BB
0       /var/tmp/systemd-private-7d7ca43e4cf84a5c9fa4b14a7058114f-ntpd.service-17zESF
0       /var/tmp/systemd-private-7d7ca43e4cf84a5c9fa4b14a7058114f-rtkit-daemon.service-9Vg3t8
0       /var/tmp/systemd-private-b77aedda9d2e485192f9d3000ba37997-colord.service-3lldCn
0       /var/tmp/systemd-private-b77aedda9d2e485192f9d3000ba37997-httpd.service-8EcJwr
106G    /var/tmp/systemd-private-b77aedda9d2e485192f9d3000ba37997-mariadb.service-SBzvv6
0       /var/tmp/systemd-private-b77aedda9d2e485192f9d3000ba37997-named.service-eXG5jQ
0       /var/tmp/systemd-private-b77aedda9d2e485192f9d3000ba37997-ntpd.service-Eui6Ap
0       /var/tmp/systemd-private-b77aedda9d2e485192f9d3000ba37997-rtkit-daemon.service-UGIMkA
20K     /var/tmp/yum-root-oRMbnC

I bumped the slurmctld logging up to debug3 and captured a crash in the attached logs. The crash occured at 11:12am. Please let me know if you need any further info. 

I have been crawling user submit scripts and histories to see if anyone has been making use of sacct or sacctmgr but so far I haven't found anything obvious. I have several users that use bulk submit scripts with some logic to query squeue to get job info. Could a large number or rapid use of squeue or sinfo calls cause this problem? Some of these scripts have been in use for a long time and I've never seen any problems with them.
Comment 16 Marshall Garey 2020-07-30 16:24:28 MDT
I'll look at your logs and get back to you. Has the slurmdbd been continuing to crash regularly like it was before, or did it only crash once?

squeue and sinfo don't communicate with slurmdbd at all, so that shouldn't affect slurmdbd. However, job submissions do cause slurmctld to send RPC's to slurmdbd to have slurmdbd insert a new row into the job table. When the job runs, slurmdbd will insert additional rows into the step table for each job step (srun) in the job. Normally I wouldn't expect job submissions to cause a huge spike in memory usage in the database. However, there are a couple things about this situation that make me suspicious.


(1) Since sacctmgr del user took 2+ minutes to complete (which results in a relatively simple set of queries), it's apparent that a even single queries are a bit slow. If a few queries are that slow then perhaps many job submissions could cause big issues. Reducing the size of the database more might help - if possible, I'd recommend purging more jobs, but I don't know what your site's policy on database records is. Another thing that we have done to make the database faster is add more indexes to the database. I know that some indexes have been added since Slurm 16.05, and I'm not sure what indexes exist in 16.05.



(2) Since slurmctld and slurmdbd are on the same node, memory/CPU usage in one will affect the other. So I also wonder if the CPU and memory usage in slurmctld spiked and caused some issues in slurmdbd. Having a loop that continuously calls squeue and sinfo without any sleep calls in between will definitely be problematic for slurmctld, since they'll be generating tons of RPC's. We strongly recommend against that approach. If users need to poll slurmctld, they ought to space them at least a second in between, perhaps more.

Could you run the command sdiag and upload the output next time this happens? I'll need sdiag before restarting slurmctld since the stats are reset on restarting slurmctld. (The stats also reset every day at midnight.)

Do your users know about job arrays? A job array is a way to submit multiple jobs with a single command, which results in a single RPC to slurmctld and is much faster than submitting many jobs in parallel. Job arrays are ideal for a workload running the same program many times but with different parameters.

https://slurm.schedmd.com/sbatch.html#OPT_array

If you're willing to risk another slurmdbd crash, could you (or one of your users) run a bulk submit script similar to what your users do and monitor the slurmctld/slurmdbd node and slurmctld and slurmdbd processes? I'd also like the output of sdiag run once every 5 minutes during this experiment.





Now responding to your previous comment:

> Fixed, the del user command took a very long time to complete, 2+ minutes. 

Normally this shouldn't take that long. I'm guessing because the database is so large even simple mysql queries are taking awhile. Also what kind of hardware and filesystem is the database on? (ssd or hdd, local vs shared filesystem).



> I know we need to look into updating slurm, I just feel like it will be tricky
> as it is built into this bright cluster manager deployment. I read that you can
> update the slurmdb up to 2 releases ahead of the slurm controller. If that is
> true, I will try to update the db to the latest 18 release. I also read that it
> is problematic to attempt the database migration on mariadb 5.5. Do you
> recommend that we update to mariadb 10 first?

Yes, we recommend updating to mariadb first.

Also yes, you can update by 2 major releases. However, you can't upgrade directly to 18.08. The major releases are numbered year.month (like Ubuntu) and are on a 9-month release cycle. So you are currently on 16.05. The next major release is 17.02, and the one after that is 17.11, and the one after that is 18.08. 17.02 and 17.11 are different major releases. So 18.08 is actually 3 major releases above 16.05 and you may lose data if you try to upgrade directly to 18.08. Our current supported releases are 19.05 and 20.02. To get to 19.05, you'd first have to upgrade to 17.11 and then could upgrade to 19.05.

Also, the slurmdbd must be upgraded first, then the slurmctld, then slurmd and slurmstepd, then all client commands. Or, everything can be upgraded at once. All binaries must be within 2 major releases of each other, so if you wanted to upgrade to 19.05, you would have to upgrade everything to 17.11 first, then could upgrade everything to 19.05.

Definitely create backups of the database and the slurmctld state files (which reside in "StateSaveLocation" which is defined in slurm.conf) before upgrading.


This is documented on our upgrading guide - I recommend reading it before you plan an upgrade.

https://slurm.schedmd.com/quickstart_admin.html#upgrade

The rest of that web page (quickstart_admin.html) may have useful information for you, too.

If/when you do plan an upgrade, I'd recommend upgrading all the way to 19.05 or 20.02.
Comment 17 Nathan Elger 2020-07-31 08:18:41 MDT
Hi Marshall,

We're back to the constant crashes, every hour or so right now. Here is the sdiag output right before and right after a crash:

[root@hn1-hyperion ~]# free -hlm
              total        used        free      shared  buff/cache   available
Mem:           125G        116G        712M         21M        8.7G        8.0G
Low:           125G        124G        712M
High:            0B          0B          0B
Swap:           15G         14G        1.7G
[root@hn1-hyperion ~]# sdiag
*******************************************************
sdiag output at Fri Jul 31 09:54:49 2020
Data since      Fri Jul 31 07:51:32 2020
*******************************************************
Server thread count: 3
Agent queue size:    0
 
Jobs submitted: 91
Jobs started:   77
Jobs completed: 59
Jobs canceled:  4
Jobs failed:    0
 
Main schedule statistics (microseconds):
        Last cycle:   3434
        Max cycle:    318046
        Total cycles: 209
        Mean cycle:   5056
        Mean depth cycle:  128
        Cycles per minute: 1
        Last queue length: 170
 
Backfilling stats
        Total backfilled jobs (since last slurm start): 40
        Total backfilled jobs (since last stats cycle start): 40
        Total cycles: 206
        Last cycle when: Fri Jul 31 09:54:39 2020
        Last cycle: 1265995
        Max cycle:  3962871
        Mean cycle: 2262471
        Last depth cycle: 170
        Last depth cycle (try sched): 170
        Depth Mean: 144
        Depth Mean (try depth): 144
        Last queue length: 170
        Queue length mean: 147
 
Remote Procedure Call statistics by message type
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:1892   ave_time:209653 total_time:396664632
        REQUEST_PING                            ( 1008) count:254    ave_time:666    total_time:169329
        REQUEST_PARTITION_INFO                  ( 2009) count:211    ave_time:3043   total_time:642105
        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:133    ave_time:29134  total_time:3874919
        REQUEST_NODE_INFO                       ( 2007) count:132    ave_time:8891   total_time:1173707
        REQUEST_JOB_INFO                        ( 2003) count:127    ave_time:26634  total_time:3382531
        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:80     ave_time:27204  total_time:2176324
        REQUEST_JOB_USER_INFO                   ( 2039) count:69     ave_time:34717  total_time:2395527
        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:63     ave_time:60636  total_time:3820129
        REQUEST_JOB_ALLOCATION_INFO_LITE        ( 4016) count:55     ave_time:71047  total_time:3907585
        REQUEST_JOB_STEP_CREATE                 ( 5001) count:55     ave_time:2163   total_time:119010
        REQUEST_STEP_COMPLETE                   ( 5016) count:50     ave_time:56030  total_time:2801504
        REQUEST_KILL_JOB                        ( 5032) count:4      ave_time:2171   total_time:8684
        REQUEST_CANCEL_JOB_STEP                 ( 5005) count:4      ave_time:860    total_time:3442
        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:2      ave_time:558    total_time:1116
        ACCOUNTING_REGISTER_CTLD                (10003) count:1      ave_time:169572 total_time:169572
        REQUEST_STATS_INFO                      ( 2035) count:0      ave_time:0      total_time:0
 
Remote Procedure Call statistics by user
 
        root            (       0) count:2775   ave_time:148291 total_time:411509098
        zhonghua        (    1343) count:85     ave_time:15131  total_time:1286167
        kyuan           (    1331) count:80     ave_time:6255   total_time:500427
        dsahsah         (    1233) count:53     ave_time:32806  total_time:1738750
        ammals          (    1007) count:33     ave_time:52698  total_time:1739064
        adr1            (    1193) count:24     ave_time:7565   total_time:181573
        rajbansh        (    1111) count:20     ave_time:146034 total_time:2920692
        klepov          (    1376) count:14     ave_time:16930  total_time:237025
        yh5             (    1236) count:12     ave_time:45469  total_time:545632
        jwk             (    1098) count:10     ave_time:1175   total_time:11752
        bh25            (    1398) count:8      ave_time:20134  total_time:161075
        lin9            (    1418) count:4      ave_time:14759  total_time:59039
        doranrm         (    1081) count:4      ave_time:15512  total_time:62049
        jr23            (    1362) count:3      ave_time:21321  total_time:63964
        richards        (    1058) count:2      ave_time:15539  total_time:31078
        jshu            (    1326) count:2      ave_time:14347  total_time:28694
        mclauchc        (    1249) count:2      ave_time:32232  total_time:64465
        slurm           (     450) count:1      ave_time:169572 total_time:169572
[root@hn1-hyperion ~]#

[root@hn1-hyperion ~]# sdiag
*******************************************************
sdiag output at Fri Jul 31 09:57:09 2020
Data since      Fri Jul 31 07:51:32 2020
*******************************************************
Server thread count: 3
Agent queue size:    0
 
Jobs submitted: 91
Jobs started:   77
Jobs completed: 59
Jobs canceled:  4
Jobs failed:    0
 
Main schedule statistics (microseconds):
        Last cycle:   36010
        Max cycle:    318046
        Total cycles: 211
        Mean cycle:   5195
        Mean depth cycle:  129
        Cycles per minute: 1
        Last queue length: 170
 
Backfilling stats
        Total backfilled jobs (since last slurm start): 40
        Total backfilled jobs (since last stats cycle start): 40
        Total cycles: 209
        Last cycle when: Fri Jul 31 09:56:56 2020
        Last cycle: 1388147
        Max cycle:  3962871
        Mean cycle: 2248662
        Last depth cycle: 170
        Last depth cycle (try sched): 170
        Depth Mean: 144
        Depth Mean (try depth): 144
        Last queue length: 170
        Queue length mean: 148
 
Remote Procedure Call statistics by message type
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:1892   ave_time:209653 total_time:396664632
        REQUEST_PING                            ( 1008) count:260    ave_time:790    total_time:205602
        REQUEST_PARTITION_INFO                  ( 2009) count:214    ave_time:3003   total_time:642731
        REQUEST_NODE_INFO                       ( 2007) count:135    ave_time:8867   total_time:1197055
        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:133    ave_time:29134  total_time:3874919
        REQUEST_JOB_INFO                        ( 2003) count:130    ave_time:26996  total_time:3509554
        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:80     ave_time:27204  total_time:2176324
        REQUEST_JOB_USER_INFO                   ( 2039) count:69     ave_time:34717  total_time:2395527
        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:63     ave_time:60636  total_time:3820129
        REQUEST_JOB_ALLOCATION_INFO_LITE        ( 4016) count:55     ave_time:71047  total_time:3907585
        REQUEST_JOB_STEP_CREATE                 ( 5001) count:55     ave_time:2163   total_time:119010
        REQUEST_STEP_COMPLETE                   ( 5016) count:50     ave_time:56030  total_time:2801504
        REQUEST_KILL_JOB                        ( 5032) count:4      ave_time:2171   total_time:8684
        REQUEST_CANCEL_JOB_STEP                 ( 5005) count:4      ave_time:860    total_time:3442
        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:2      ave_time:558    total_time:1116
        ACCOUNTING_REGISTER_CTLD                (10003) count:1      ave_time:169572 total_time:169572
        REQUEST_STATS_INFO                      ( 2035) count:1      ave_time:307    total_time:307
 
Remote Procedure Call statistics by user
 
        root            (       0) count:2791   ave_time:147508 total_time:411696675
        zhonghua        (    1343) count:85     ave_time:15131  total_time:1286167
        kyuan           (    1331) count:80     ave_time:6255   total_time:500427
        dsahsah         (    1233) count:53     ave_time:32806  total_time:1738750
        ammals          (    1007) count:33     ave_time:52698  total_time:1739064
        adr1            (    1193) count:24     ave_time:7565   total_time:181573
        rajbansh        (    1111) count:20     ave_time:146034 total_time:2920692
        klepov          (    1376) count:14     ave_time:16930  total_time:237025
        yh5             (    1236) count:12     ave_time:45469  total_time:545632
        jwk             (    1098) count:10     ave_time:1175   total_time:11752
        bh25            (    1398) count:8      ave_time:20134  total_time:161075
        lin9            (    1418) count:4      ave_time:14759  total_time:59039
        doranrm         (    1081) count:4      ave_time:15512  total_time:62049
        jr23            (    1362) count:3      ave_time:21321  total_time:63964
        richards        (    1058) count:2      ave_time:15539  total_time:31078
        jshu            (    1326) count:2      ave_time:14347  total_time:28694
        mclauchc        (    1249) count:2      ave_time:32232  total_time:64465
        slurm           (     450) count:1      ave_time:169572 total_time:169572
[root@hn1-hyperion ~]# free -hlm
              total        used        free      shared  buff/cache   available
Mem:           125G         27G         95G         14M        2.8G         96G
Low:           125G         30G         95G
High:            0B          0B          0B
Swap:           15G         15G        1.4M
[root@hn1-hyperion ~]#

I have been slowly reducing the purge period, last night I reduced to 800 days. Is there some logged indication of whether the purge completed successfully or encountered errors? Do I just need to query the db every time to make sure older records are being pruned?

The slurmctld process stays stable throughout all of these slurmdb issues. At the time of this crash it was around 10GB memory used. I will look into testing a bulk submit. The few users that I know employ these methods have not even been on the cluster in this latest period of crashes. I have been looking at all of the scheduled jobs as well as user histories and I do not see any serious abuse of sacct. Most users just call sacct and squeue every so often to check job status. 

We have plenty of users that make use of arrays, but our scheduler just runs in FIFO backfill mode, so we have limited the overall size of the arrays to prevent a few users from monopolizing the queue with array jobs.

"Normally this shouldn't take that long. I'm guessing because the database is so large even simple mysql queries are taking awhile. Also what kind of hardware and filesystem is the database on? (ssd or hdd, local vs shared filesystem)."

The headnode is a dual socket Xeon 2650, 126GB RAM. The drive is a local 2TB raid5 xfs filesystem.
Comment 19 Marshall Garey 2020-08-11 09:06:22 MDT
Apologies for the delay in response.

> I have been slowly reducing the purge period, last night I reduced to 800
> days. Is there some logged indication of whether the purge completed
> successfully or encountered errors? Do I just need to query the db every time
> to make sure older records are being pruned?

There isn't a "success" message at all without debug flags, though there are error messages if something goes wrong. Querying the database is a fine way to check if it succeeded.

You can turn on the db_archive debug flag in slurmdbd.conf (DebugFlags=db_archive), though this will be very verbose.

There are other debug flags we can set in slurmdbd.conf, though I'm hesitant to do so because it is so verbose. We might need to resort to doing that for a short time, though.

https://slurm.schedmd.com/archive/slurm-16.05-latest/slurmdbd.conf.html

Let's get some other information first and then see where we're at.

Can you upload dmesg (or similar) log during a recent time when slurmdbd died to confirm if it was OOM-killed and what else happened?



How many rows are in the job and step tables in the database?

SELECT COUNT(*) FROM <table_name>;

table_name is <clustername>_job_table and <clustername>_step_table where <clustername> is ClusterName in slurm.conf.


Though it may not be relevant, can you also upload the db log (/var/log/mariadb/mariadb.log or wherever the log file is).


Can you tell if Bright is querying the database at any regular interval? I think we may have seen issues with this in the past.

We do also encourage you to make a plan to upgrade Slurm to 19.05 or 20.02, as we can give you the best support when you're on a supported release. As I said before though, upgrading mariadb and also reducing the size of the database are good ideas before upgrading.
Comment 20 Nathan Elger 2020-08-11 09:52:06 MDT
Created attachment 15396 [details]
mariadb log
Comment 21 Nathan Elger 2020-08-11 10:07:47 MDT
>Though it may not be relevant, can you also upload the db log (/var/log/mariadb/mariadb.log or wherever the log file is).
attached to this case

>Can you upload dmesg (or similar) log during a recent time when slurmdbd died to    confirm if it was OOM-killed and what else happened?

[657500.790955] DOM Worker invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[657500.790965] DOM Worker cpuset=/ mems_allowed=0-1
[657500.790968] CPU: 10 PID: 25133 Comm: DOM Worker Not tainted 3.10.0-514.2.2.el7.x86_64 #1
[657500.790970] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 10/17/2018
[657500.790971]  ffff88119f902f10 000000002947a177 ffff8810f0053a78 ffffffff816861cc
[657500.790974]  ffff8810f0053b08 ffffffff81681177 ffffffff810eabac ffff88074cef4fa0
[657500.790976]  ffff88074cef4fb8 0000000000000202 ffff88119f902f10 ffff8810f0053af8
[657500.790979] Call Trace:
[657500.790987]  [<ffffffff816861cc>] dump_stack+0x19/0x1b
[657500.790992]  [<ffffffff81681177>] dump_header+0x8e/0x225
[657500.791004]  [<ffffffff810eabac>] ? ktime_get_ts64+0x4c/0xf0
[657500.791009]  [<ffffffff8113ccbf>] ? delayacct_end+0x8f/0xb0
[657500.791016]  [<ffffffff8118476e>] oom_kill_process+0x24e/0x3c0
[657500.791020]  [<ffffffff8118420d>] ? oom_unkillable_task+0xcd/0x120
[657500.791022]  [<ffffffff811842b6>] ? find_lock_task_mm+0x56/0xc0
[657500.791027]  [<ffffffff810937ee>] ? has_capability_noaudit+0x1e/0x30
[657500.791030]  [<ffffffff81184fa6>] out_of_memory+0x4b6/0x4f0
[657500.791034]  [<ffffffff81681c80>] __alloc_pages_slowpath+0x5d7/0x725
[657500.791037]  [<ffffffff8118b0b5>] __alloc_pages_nodemask+0x405/0x420
[657500.791043]  [<ffffffff811d221a>] alloc_pages_vma+0x9a/0x150
[657500.791047]  [<ffffffff811b14df>] handle_mm_fault+0xc6f/0xfe0
[657500.791054]  [<ffffffff81691c94>] __do_page_fault+0x154/0x450
[657500.791057]  [<ffffffff81691fc5>] do_page_fault+0x35/0x90
[657500.791061]  [<ffffffff8168e288>] page_fault+0x28/0x30
[657500.791063] Mem-Info:
[657500.791070] active_anon:30964943 inactive_anon:1394152 isolated_anon:0
 active_file:0 inactive_file:0 isolated_file:0
 unevictable:0 dirty:5 writeback:8 unstable:0
 slab_reclaimable:32367 slab_unreclaimable:34663
 mapped:13736 shmem:14244 pagetables:79492 bounce:0
 free:131784 free_pcp:7721 free_cma:0
[657500.791073] Node 0 DMA free:15508kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[657500.791077] lowmem_reserve[]: 0 1653 64115 64115
[657500.791080] Node 0 DMA32 free:253216kB min:3372kB low:4212kB high:5056kB active_anon:1037664kB inactive_anon:366116kB active_file:24kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1948156kB managed:1695140kB mlocked:0kB dirty:0kB writeback:0kB mapped:96kB shmem:364kB slab_reclaimable:2764kB slab_unreclaimable:3736kB kernel_stack:336kB pagetables:5088kB unstable:0kB bounce:0kB free_pcp:3892kB local_pcp:100kB free_cma:0kB writeback_tmp:0kB pages_scanned:496 all_unreclaimable? yes
[657500.791084] lowmem_reserve[]: 0 0 62461 62461
[657500.791087] Node 0 Normal free:127284kB min:127280kB low:159100kB high:190920kB active_anon:60227064kB inactive_anon:2545224kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:65011712kB managed:63961056kB mlocked:0kB dirty:20kB writeback:32kB mapped:16820kB shmem:16708kB slab_reclaimable:56716kB slab_unreclaimable:79132kB kernel_stack:10192kB pagetables:190088kB unstable:0kB bounce:0kB free_pcp:16288kB local_pcp:740kB free_cma:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? no
[657500.791091] lowmem_reserve[]: 0 0 0 0
[657500.791094] Node 1 Normal free:131128kB min:131452kB low:164312kB high:197176kB active_anon:62595044kB inactive_anon:2665268kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:67108864kB managed:66057968kB mlocked:0kB dirty:0kB writeback:0kB mapped:38028kB shmem:39904kB slab_reclaimable:69988kB slab_unreclaimable:55784kB kernel_stack:11648kB pagetables:122792kB unstable:0kB bounce:0kB free_pcp:10816kB local_pcp:672kB free_cma:0kB writeback_tmp:0kB pages_scanned:3377 all_unreclaimable? yes
[657500.791098] lowmem_reserve[]: 0 0 0 0
[657500.791103] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 0*64kB 1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15508kB
[657500.791111] Node 0 DMA32: 387*4kB (UEM) 351*8kB (UEM) 502*16kB (UEM) 796*32kB (UEM) 879*64kB (UEM) 320*128kB (UEM) 141*256kB (UEM) 84*512kB (UEM) 32*1024kB (UEM) 3*2048kB (M) 0*4096kB = 253092kB
[657500.791122] Node 0 Normal: 13735*4kB (UE) 8664*8kB (UEM) 76*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 125468kB
[657500.791129] Node 1 Normal: 10066*4kB (UEM) 9046*8kB (UEM) 1226*16kB (UEM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 132248kB
[657500.791136] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[657500.791138] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[657500.791139] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[657500.791141] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[657500.791143] 645724 total pagecache pages
[657500.791145] 630656 pages in swap cache
[657500.791146] Swap cache stats: add 17771306, delete 17140650, find 6891068/7649844
[657500.791147] Free swap  = 0kB
[657500.791148] Total swap = 16777212kB
[657500.791150] 33521181 pages RAM
[657500.791151] 0 pages HighMem/MovableOnly
[657500.791152] 588664 pages reserved
[657500.791152] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[657500.791162] [  537]     0   537    56805    21489     114       73             0 systemd-journal
[657500.791164] [  558]     0   558    29723        0      26       83             0 lvmetad
[657500.791166] [  579]     0   579    11210        2      22      421         -1000 systemd-udevd
[657500.791170] [  916]     0   916     4287        0      12       59             0 rpc.idmapd
[657500.791171] [  927]     0   927   100444      192      48      112             0 accounts-daemon
[657500.791174] [  931]    81   931     7544      358      19       80          -900 dbus-daemon
[657500.791176] [  932]     0   932     1742       11       7       15             0 mdadm
[657500.791177] [  950]     0   950     1092       21       6       18             0 rngd
[657500.791179] [  956]     0   956   104831       88      70      265             0 ModemManager
[657500.791181] [  959]     0   959    50303       10      35      113             0 gssproxy
[657500.791182] [  962]     0   962     5704        8      13       94             0 ipmievd
[657500.791184] [ 1000]   172  1000    41162       26      16       27             0 rtkit-daemon
[657500.791186] [ 1003]   422  1003     2131        7       9       29             0 lsmd
[657500.791187] [ 1004]     0  1004     4858       57      13       54             0 irqbalance
[657500.791189] [ 1007]     0  1007    53192      118      56      332             0 abrtd
[657500.791190] [ 1011]     0  1011    52566       11      55      327             0 abrt-watch-log
[657500.791192] [ 1016]     0  1016    52566        2      55      337             0 abrt-watch-log
[657500.791193] [ 1025]     0  1025    31968       12      17      113             0 smartd
[657500.791195] [ 1029]     0  1029     7110       39      18       53             0 systemd-logind
[657500.791196] [ 1030]     0  1030     4202        4      12       42             0 alsactl
[657500.791198] [ 1037]     0  1037     1639        1       6       33             0 mcelog
[657500.791199] [ 1040]   999  1040   136478      473      62     1427             0 polkitd
[657500.791201] [ 1062]     0  1062    28876       92      10       22             0 ksmtuned
[657500.791202] [ 1509]     0  1509   119578     9459     128      128             0 rsyslogd
[657500.791204] [ 1526]    32  1526    16237       35      34      103             0 rpcbind
[657500.791206] [ 1528]     0  1528    20617       23      42      192         -1000 sshd
[657500.791207] [ 1534]   415  1534   532688    57842     225     9293             0 slapd
[657500.791209] [ 1535]     0  1535    10271      161      24      141             0 rpc.mountd
[657500.791210] [ 1589]     0  1589   118840       91      51      681             0 gdm
[657500.791212] [ 1591]     0  1591     6461        3      17       48             0 atd
[657500.791214] [ 1662]    27  1662    28314        0      10       70             0 mysqld_safe
[657500.791215] [ 2198]    25  2198   475718     9924     128    15317             0 named
[657500.791217] [ 2344]    27  2344 30032258  7460571   22758  3899669         -1000 mysqld
[657500.791219] [ 2497]     0  2497    94056     1486     183      173             0 httpd
[657500.791220] [ 2535]    29  2535    11647       72      26      170             0 rpc.statd
[657500.791223] [ 2543]    65  2543   112506      347      53      327             0 nslcd
[657500.791226] [ 2562]     0  2562   128354     1461     153     7647             0 Xorg
[657500.791227] [ 2579]     0  2579    31554       23      17      130             0 crond
[657500.791229] [ 2581]     0  2581    26852        0      20      130             0 cm-nfs-checker
[657500.791239] [ 3223]     2  3223    56326      267      21       93             0 munged
[657500.791240] [ 3282]     0  3282    92985      107      61      301             0 upowerd
[657500.791242] [ 3333]   418  3333   104128        0      57      944             0 colord
[657500.791243] [ 3420]     0  3420    12728        0      28      145             0 wpa_supplicant
[657500.791245] [ 3433]     0  3433   101298       96      51      172             0 packagekitd
[657500.791246] [ 3444]     0  3444    95821      156      48      222             0 udisksd
[657500.791248] [ 3566]     0  3566    28812        2      12       60         -1000 safe_cmd
[657500.791249] [ 3579]     0  3579  2230126   888362    2462   152621         -1000 cmd
[657500.791251] [ 4067]     0  4067    94924      106      71     1276             0 gdm-session-wor
[657500.791253] [ 4360]    38  4360     8938       55      22      117             0 ntpd
[657500.791254] [ 5582]     0  5582   115881        0      56      289             0 gnome-keyring-d
[657500.791256] [ 5670]     0  5670   136711      131     109      357             0 gnome-session
[657500.791257] [ 5677]     0  5677     3486        0      10       47             0 dbus-launch
[657500.791259] [ 5678]     0  5678     7443       64      19      246             0 dbus-daemon
[657500.791260] [ 5745]     0  5745    90481        0      38      169             0 gvfsd
[657500.791262] [ 5750]     0  5750   118525        0      49      205             0 gvfsd-fuse
[657500.791263] [ 5835]     0  5835    13215       10      28      131             0 ssh-agent
[657500.791264] [ 5871]     0  5871    86495        0      49      216             0 at-spi-bus-laun
[657500.791266] [ 5876]     0  5876     7380       16      19      326             0 dbus-daemon
[657500.791267] [ 5878]     0  5878    47245        0      34      174             0 at-spi2-registr
[657500.791269] [ 5886]     0  5886   263022     1255     214     2608             0 gnome-settings-
[657500.791271] [ 5906]     0  5906   118364       85      90      302             0 pulseaudio
[657500.791272] [ 5918]     0  5918   420263    44027     418    18937             0 gnome-shell
[657500.791274] [ 5924]   419  5924   108555      111      61     1302             0 geoclue
[657500.791275] [ 5929]     0  5929   140934        0      97      412             0 gsd-printer
[657500.791277] [ 5974]     0  5974   111913       33      48      503             0 ibus-daemon
[657500.791278] [ 5979]     0  5979   147502        0      93      783             0 gnome-shell-cal
[657500.791280] [ 5984]     0  5984   316938      225     181      807             0 evolution-sourc
[657500.791281] [ 5986]     0  5986    94198        0      45      217             0 ibus-dconf
[657500.791282] [ 5988]     0  5988   110550        0     105      529             0 ibus-x11
[657500.791284] [ 5999]     0  5999   171300        0     146     1223             0 goa-daemon
[657500.791285] [ 6008]     0  6008    96895       60      91      330             0 goa-identity-se
[657500.791287] [ 6015]     0  6015   113241       65      68      304             0 mission-control
[657500.791288] [ 6020]     0  6020   137418      150      94      685             0 caribou
[657500.791290] [ 6027]     0  6027    95523       56      47      253             0 gvfs-udisks2-vo
[657500.791291] [ 6033]     0  6033    88773        0      34      164             0 gvfs-goa-volume
[657500.791293] [ 6038]     0  6038    94238        0      43      233             0 gvfs-gphoto2-vo
[657500.791294] [ 6043]     0  6043   117755        0      55      320             0 gvfs-afc-volume
[657500.791296] [ 6049]     0  6049    91970        0      39      183             0 gvfs-mtp-volume
[657500.791297] [ 6053]     0  6053   189834      113     155     1972             0 nautilus
[657500.791298] [ 6074]     0  6074   140384      186      67     1977             0 tracker-store
[657500.791300] [ 6080]     0  6080   139644        0      91      821             0 tracker-extract
[657500.791301] [ 6082]     0  6082   113424        0      73      729             0 tracker-miner-a
[657500.791303] [ 6086]     0  6086   151442      163      84      698             0 tracker-miner-f
[657500.791304] [ 6087]     0  6087   113373        0      72      651             0 tracker-miner-u
[657500.791306] [ 6094]     0  6094   130347      722     143      795             0 abrt-applet
[657500.791308] [ 6165]     0  6165   112276       74      50      158             0 gvfsd-trash
[657500.791309] [ 6187]     0  6187    75068        0      36      160             0 gvfsd-metadata
[657500.791310] [ 6191]     0  6191    77285        0      42      204             0 ibus-engine-sim
[657500.791312] [ 6195]     0  6195    41555        0      24      113             0 dconf-service
[657500.791313] [ 6204]     0  6204   261179        0     193     9506             0 evolution-calen
[657500.791315] [ 6258]   450  6258  3164656   187959     604    32700             0 slurmctld
[657500.791316] [ 6533]     0  6533   136499        0     124     3020             0 gnome-terminal-
[657500.791318] [ 6541]     0  6541     2120        1       9       29             0 gnome-pty-helpe
[657500.791320] [ 6542]     0  6542    29076        2      13      345             0 bash
[657500.791322] [17656]     0 17656     4868       50      13      131             0 lmgrd
[657500.791324] [17663]     0 17663    20519      161      13       88             0 INTEL
[657500.791325] [ 3747]     0  3747    37965       23      75      312             0 sshd
[657500.791327] [ 3750]     0  3750    29091      261      12      102             0 bash
[657500.791328] [13232]     0 13232    22766       21      42      238             0 master
[657500.791330] [13234]    89 13234    23359       40      45      241             0 qmgr
[657500.791332] [ 4275]     0  4275    37291      242      74       78             0 sshd
[657500.791333] [ 4278]  1242  4278    37359      309      72       77             0 sshd
[657500.791335] [ 4279]  1242  4279    30628      155      14      139             0 bash
[657500.791336] [ 6457]  1242  6457     5603       23      14       49             0 dbus-launch
[657500.791337] [ 6547]  1242  6547     7147       11      18       75             0 dbus-daemon
[657500.791339] [ 8057]  1242  8057    37965       99      29        6             0 gconfd-2
[657500.791341] [15393]    48 15393    95102     1459     179      174             0 httpd
[657500.791342] [15394]    48 15394    95102     1464     179      170             0 httpd
[657500.791344] [15395]    48 15395    95102     1455     179      171             0 httpd
[657500.791346] [15396]    48 15396    95102     1455     179      171             0 httpd
[657500.791347] [15397]    48 15397    95102     1455     179      171             0 httpd
[657500.791349] [11907]   450 11907 23014313 22705300   44702       24             0 slurmdbd
[657500.791350] [24347]     0 24347    37291      320      74        0             0 sshd
[657500.791352] [24350]  1002 24350    37794      823      73        0             0 sshd
[657500.791353] [24351]  1002 24351    30630      278      14       18             0 bash
[657500.791355] [24433]  1002 24433    28282       60      10        0             0 cmgui
[657500.791356] [24441]  1002 24441   408706   165061     724      388             0 firefox
[657500.791358] [24689]  1002 24689     5604       71      14        0             0 dbus-launch
[657500.791359] [24735]  1002 24735     7148       82      18        0             0 dbus-daemon
[657500.791361] [25010]  1002 25010    37966      105      29        0             0 gconfd-2
[657500.791362] [29487]  1002 29487   117284      353      88        0             0 pulseaudio
[657500.791363] [32481]  1242 32481   117282      347      88        0             0 pulseaudio
[657500.791365] [ 9040]  1242  9040    28282       57      10        0             0 cmgui
[657500.791366] [ 9048]  1242  9048   424659   181245     751      441             0 firefox
[657500.791368] [13081]     0 13081     3990      704      13        0             0 dhcpd
[657500.791370] [ 9240]    89  9240    23317      253      45        0             0 pickup
[657500.791374] [11656]     0 11656    26974       21       8        0             0 sleep
[657500.791376] [12268]     0 12268     5255       76      15        0             0 sacct
[657500.791378] [12596]     0 12596     5163       23      12        0             0 systemctl
[657500.791380] Out of memory: Kill process 11907 (slurmdbd) score 612 or sacrifice child
[657500.791383] Killed process 11907 (slurmdbd) total-vm:92057252kB, anon-rss:90821200kB, file-rss:0kB, shmem-rss:0kB

>How many rows are in the job and step tables in the database?

MariaDB [(none)]> use slurm_acct_db
Database changed
MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM SLURM_CLUSTER_job_table;
ERROR 1146 (42S02): Table 'slurm_acct_db.SLURM_CLUSTER_job_table' doesn't exist
MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM slurm_cluster_job_table;
+----------+
| COUNT(*) |
+----------+
|  4655245 |
+----------+
1 row in set (3.60 sec)

MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM slurm_cluster_step_table;
+----------+
| COUNT(*) |
+----------+
| 30034605 |
+----------+
1 row in set (13.05 sec)

Looks like jobs are definitely purging, as we are at almost 9m jobs lifetime on this cluster. ibdata1 is still at 24gb in /var/lib/mysql

>Can you tell if Bright is querying the database at any regular interval? I think we may have seen issues with this in the past.

Yes the cluster manager daemon runs an sacct query every minute. I am going to turn off the bright job collection entirely this friday during my maintenance window.

>We do also encourage you to make a plan to upgrade Slurm to 19.05 or 20.02, as we can give you the best support when you're on a supported release. As I said before though, upgrading mariadb and also reducing the size of the database are good ideas before upgrading.

That is my intention but Bright cant verify that this version of Bright will run ok on mariadb 10, I suspect it probably will but they wont confirm. Basically, I will have to forklift upgrade the entire cluster to bright 8 to use mariadb 10, I am not anxious to do that as it entails a bunch of kernel updates that will break a lot of researcher software. I think what I will need to do is limp along in the current config for the rest of the year, slowly recompiling our software catalog for the new kernel until our longer annual maintenance window in December. Slurmdb did crash again early this morning, and I am at 400 days on the purge settings. The huge mariadb tmp folders have not been an issue in over 2 weeks, however. Hopefully we are heading in the right direction.
Comment 22 Marshall Garey 2020-08-17 14:48:44 MDT
> Yes the cluster manager daemon runs an sacct query every minute. I am going to
> turn off the bright job collection entirely this friday during my maintenance
> window.

What exact sacct query was bright running? Have you noticed a difference since turning it off?


Mariadb log:

Lots of this message:

200715 18:07:57 [Warning] mysqld: Disk is full writing '/var/tmp/#sql_8e8_49.MAI' (Errcode: 28). Waiting for someone to free space... (Expect up to 60 secs delay for server to continue after freeing disk space)

This is basically what you already told us - that the disk was filling up. But you did say you haven't seem to have had this issue recently (of maridb tmp folders filling up). Have these warnings happened recently in the log?
Comment 23 Marshall Garey 2020-09-04 09:40:08 MDT
Do you have any updates on this?
Comment 24 Nathan Elger 2020-09-08 13:00:37 MDT
Hi Marshall,

I think the database purge was the answer. I walked it down by 100 days to currently 600 days on all options and everything has been good since then. I did disable that bright sacct query, not sure the exact options they were using, but the slurm service had been running fine for over a week at that point. As of today, slurmdbd has been running for 4 weeks  and no more issues with mariadb filling up /var/tmp. Thanks very much for all your help I think we can close this case.
Comment 25 Marshall Garey 2020-09-08 13:02:10 MDT
Thanks for the update. I'm glad things are working.