We're dealing with an issue with slurmdbd crashing frequently, sometimes as often as every 15 minutes. This behavior has been going on for about 3 weeks, and has been steadily getting worse. We are using Centos 7.3, Slurm 16.05.08 and slurmdb is running on MariaDB 5.5 as part of a Bright Cluster Manager 7.3 installation. This configuration has been running with very little intervention for almost 4 years. There are about 8m jobs in the db and we usually do around 2k-5k jobs a day. The headnode has 128G of system RAM and runs the bright cluster manager infrastructure, as well as slurmctld and the slurmdbd. The typical course of the issue is slurmdbd slowly (or recently, quickly) gobbles up all available memory, gets killed, and restarts cleanly. I have also seen recently that the mariadb tmp space in /var/tmp starts growing incredibly large, I have seen it as big as 120GB. Several times this has filled up the 200G /var mount and the headnode crashes and requires a reboot. I bumped the slurmdb logging up to debug3 but I haven't seen any errors being logged, just slurmdb service restart info and generic connection information. We've stood up xdmod to handle our historical data reporting so I've recently started purging the db slowly to try to get down to a years worth of records, at most. I have also set the innodb_buffer_pool_size to ~80% of system ram as recommended. Here are my password sanitized my.cnf, my.cnf.d/innodb.cnf, slurm.conf and slurmdb.conf Most of these settings are the default bright configs. # ##my.cnf [mysqld] datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock port=3306 user=mysql key_buffer_size=1500M delay_key_write=on myisam_sort_buffer_size=400M max_allowed_packet=1000M bulk_insert_buffer_size=256M # Default to using old password format for compatibility with mysql 3.x # clients (those using the mysqlclient10 compatibility package). old_passwords=1 myisam-recover=FORCE skip-external-locking table_open_cache=64 sort_buffer_size=512K net_buffer_length=8K read_buffer_size=256K read_rnd_buffer_size=512K log-bin=mysql-bin binlog_format=mixed server-id=1 # Disabling symbolic-links is recommended to prevent assorted security risks symbolic-links=0 # Settings user and group are ignored when systemd is used. # If you need to run mysqld under a different user or group, # customize your systemd unit file for mariadb according to the # instructions in http://fedoraproject.org/wiki/Systemd # some suggestions from wget http://mysqltuner.pl/ -O mysqltuner.pl query_cache_size=8M tmp_table_size=16M max_heap_table_size=16M thread_cache_size=4 join_buffer_size=128M [mysqld_safe] log-error=/var/log/mariadb/mariadb.log pid-file=/var/run/mariadb/mariadb.pid socket=/var/lib/mysql/mysql.sock [mysqldump] socket=/var/lib/mysql/mysql.sock quick max_allowed_packet=16M [mysql] no-auto-rehash [myisamchk] key_buffer_size=20M sort_buffer_size=20M read_buffer=2M write_buffer=2M [mysqlhotcopy] interactive-timeout [mysqld_multi] mysqld=/usr/bin/mysqld_safe mysqladmin=/usr/bin/mysqladmin log=/var/log/mysqld_multi.log # # include all files from the config directory # !includedir /etc/my.cnf.d ---------------------------------------------------- # ##innodb.cnf [mysqld] innodb_buffer_pool_size=100000M innodb_lock_wait_timeout=900 ---------------------------------------------------- # # slurmdbd.conf file. # # See the slurmdbd.conf man page for more information. # # Archive info #ArchiveJobs=yes #ArchiveDir="/tmp" #ArchiveSteps=yes #ArchiveScript= #JobPurge=12 #StepPurge=1 # PurgeEventAfter=1000days PurgeJobAfter=1000days PurgeResvAfter=1000days PurgeStepAfter=1000days PurgeSuspendAfter=1000days # Authentication info AuthType=auth/munge #AuthInfo=/var/run/munge/munge.socket.2 # # slurmDBD info DbdAddr=DBD_ADDR #DbdPort=7031 SlurmUser=slurm #MessageTimeout=300 DebugLevel=debug3 #DefaultQOS=normal,standby LogFile=/var/log/slurmdbd PidFile=/var/run/slurmdbd.pid #PluginDir=/cm/shared/apps/slurm/current/lib64:/usr/lib64:/cm/shared/apps/slurm/current/lib64/slurm #PrivateData=accounts,users,usage,jobs #TrackWCKey=yes # # Database info StorageType=accounting_storage/mysql StorageHost=master #StoragePort=1234 StorageUser=slurm StorageLoc=slurm_acct_db # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE DbdHost=hn1-hyperion # END AUTOGENERATED SECTION -- DO NOT REMOVE ---------------------------------------------------- # # See the slurm.conf man page for more information. # ClusterName=SLURM_CLUSTER SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave SlurmdSpoolDir=/cm/local/apps/slurm/var/spool SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid #ProctrackType=proctrack/pgid ProctrackType=proctrack/cgroup #PluginDir= CacheGroups=0 #FirstJobId= ReturnToService=2 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= TaskPlugin=task/cgroup #TrackWCKey=no #TreeWidth=50 #TmpFs= #UsePAM= # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING #SchedulerAuth= #SchedulerPort= #SchedulerRootFilter= #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd #JobCompType=jobcomp/filetxt #JobCompLoc=/cm/local/apps/slurm/var/spool/job_comp.log # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux AccountingStorageEnforce=limits #JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm # AccountingStorageLoc=slurm_acct_db # AccountingStoragePass=SLURMDBD_USERPASS # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE # Scheduler SchedulerType=sched/backfill # Master nodes ControlMachine=hn1-hyperion ControlAddr=hn1-hyperion AccountingStorageHost=hn1-hyperion # Nodes NodeName=node[003-010] CoresPerSocket=10 Sockets=4 NodeName=node[363-382] CoresPerSocket=12 Sockets=4 Gres=gpu:2 NodeName=hn1-thoth,node[017-238] CoresPerSocket=14 Sockets=2 NodeName=node[011-016,241,242] CoresPerSocket=14 Sockets=2 Gres=gpu:2 NodeName=ultra CoresPerSocket=22 Sockets=24 NodeAddr=10.16.255.249 NodeName=node[243-362] CoresPerSocket=24 Sockets=2 NodeName=node[383-406] CoresPerSocket=24 Sockets=2 Gres=gpu:2 NodeName=minksy1 CoresPerSocket=8 Sockets=2 Gres=gpu:4 # Partitions PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=GANG,SUSPEND ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[026-172,174-182,184,187-189,200,209-213,219,220] PartitionName=business Default=NO MinNodes=1 AllowGroups=business DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[219,220] PartitionName=BigMem Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=GANG,SUSPEND ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[003-010] PartitionName=soph Default=NO MinNodes=1 AllowGroups=soph DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[217,218] PartitionName=neset-lab Default=NO MinNodes=1 AllowGroups=neset-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node210 PartitionName=gpu Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[011-013,015,241,242] PartitionName=varshna-lab Default=NO MinNodes=1 AllowGroups=varshna-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node209 PartitionName=vatch-lab Default=NO MinNodes=1 AllowGroups=vatch-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[014,016] PartitionName=farouk-lab Default=NO MinNodes=1 AllowGroups=farouk-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node211 PartitionName=heyden-lab Default=NO MinNodes=1 AllowGroups=heyden-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[190-199,221-238] PartitionName=scopatz-lab Default=NO MinNodes=1 AllowGroups=scopatz-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[007-010] PartitionName=hoy-lab Default=NO MinNodes=1 AllowGroups=hoy-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[184,212,213] PartitionName=no-kill Default=NO MinNodes=1 AllowGroups=nokill DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=hn1-thoth,node[017-025] PartitionName=flora-lab Default=NO MinNodes=1 AllowGroups=flora-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[214-216] PartitionName=jerlab-gpu Default=NO MinNodes=1 AllowGroups=jer-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node015 PartitionName=jerlab Default=NO MinNodes=1 AllowGroups=jer-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=GANG,SUSPEND ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[183,185,186,201-208] PartitionName=microbiome Default=NO MinNodes=1 AllowGroups=microbiome DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node200 PartitionName=test Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO PartitionName=msmoms Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=ultra PartitionName=ddlab Default=NO MinNodes=1 AllowGroups=ddlab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node026 PartitionName=vendemia-lab Default=NO MinNodes=1 AllowGroups=vendemia-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[187-189] PartitionName=gcai-lab Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node178 PartitionName=minsky Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=minksy1 PartitionName=ziehl-lab Default=NO MinNodes=1 AllowGroups=ziehl-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node173 PartitionName=defq-48core Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[281,321-362] PartitionName=gpu-v100-16gb Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[384-386,389-406] PartitionName=gpu-v100-32gb Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[363-380] PartitionName=jerlab-48core Default=NO MinNodes=1 AllowGroups=jer-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[243,279,280,282-286,311] PartitionName=hulab-48core Default=NO MinNodes=1 AllowGroups=hu-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[254-278] PartitionName=caicedolab-48core Default=NO MinNodes=1 AllowGroups=caicedo-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[245,317] PartitionName=besmannlab-48core Default=NO MinNodes=1 AllowGroups=besmann-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[246-253] PartitionName=grosslab-48core Default=NO MinNodes=1 AllowGroups=gross-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node244 PartitionName=heydenlab-48core Default=NO MinNodes=1 AllowGroups=heyden-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[287-306] PartitionName=qiwang-gpu Default=NO MinNodes=1 AllowGroups=qiwang-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node383 PartitionName=hannolab-48core Default=NO MinNodes=1 AllowGroups=hanno-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[307-310] PartitionName=nagarkatti-lab Default=NO MinNodes=1 AllowGroups=nagarkatti-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node172 PartitionName=ge-lab Default=NO MinNodes=1 AllowGroups=gelab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[312-316] PartitionName=AI_Center Default=NO MinNodes=1 AllowGroups=aic DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[381,382] PartitionName=floralab-48core Default=NO MinNodes=1 AllowGroups=flora-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node320 PartitionName=yiwang-48core Default=NO MinNodes=1 AllowGroups=yiwang-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[318,319] PartitionName=jamshidi-lab Default=NO MinNodes=1 AllowGroups=jamshidi-lab DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO Nodes=node[387,388] # Generic resources types GresTypes=gpu,mic # Epilog/Prolog parameters PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob Prolog=/cm/local/apps/cmd/scripts/prolog Epilog=/cm/local/apps/cmd/scripts/epilog # Fast Schedule option FastSchedule=0 # Power Saving SuspendTime=-1 # this disables power saving SuspendTimeout=30 ResumeTimeout=60 SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron # END AUTOGENERATED SECTION -- DO NOT REMOVE SelectType=select/cons_res SelectTypeParameters=CR_Core PreemptMode=GANG,SUSPEND PreemptType=preempt/partition_prio MaxArraySize=2000 FirstJobId=4600000 #NodeName=ultra NodeAddr=10.16.255.249 CoresPerSocket=22 Sockets=24 #PartitionName=msmoms Default=NO MinNodes=1 AllowGroups=all DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=REQUEUE ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO PriorityJobFactor=1000 PriorityTier=1000 OverSubscribe=NO State=UP Nodes=ultra
Hi Nathan, I'm looking at this issue. I'm still thinking about the problem and may ask for more information, but let's start with this: Even though you said the slurmdbd log doesn't have errors, can you upload about an hour of the slurmdbd log file which includes a period of time where the slurmdbd got killed, restarted, and got killed again? Can you also upload the slurmctld log during that same time frame (so I can get some more context about what's going on)? Is anybody running sacct queries that request lots of jobs? Have you seen any sacct queries fail? The way Slurm implements List structures is when memory is allocated for a list and then later free'd, Slurm may not free the memory to the OS but keeps a pointer to it, essentially caching the memory for later use. This means that the memory footprint of slurmdbd (or slurmctld for that matter) can grow if large queries require a list bigger than any of the currently cached lists. This is very old code and was found to be unnecessary and even slower on modern systems, so it was removed in commit fc952769 which is in Slurm 20.02. This could be one likely explanation for the growing memory footprint of slurmdbd. Another potential issue: Slurm limits the amount of data it can send back to a client (like sacct) to 3 GB. However, when someone requests more data than this with an sacct query, slurmdbd still loads all of that data in memory with a mysql query. Then it checks the size of data, sees that it's too big, and sends an error to the client (sacct). But that memory might not be free'd to the OS and remain cached in slurmdbd memory for later use. This is described in bug 5817, which is a public bug you can view. This particular problem was experienced by another site in bug 5632. The workaround was to prevent users from doing overly aggressive queries with sacct. Can you also run: sacctmgr show runaway If it says there are runaway jobs and would you like to fix them, type N for no but upload the output. I'd like to see what's going on first before fixing any runaway jobs. I also wonder if 80% of the RAM for innodb_buffer_pool_size is too much - I'm wondering if the remaining amount of RAM on the server is sufficient for slurmdbd, slurmctld, Bright, the OS, and anything else going on. Don't worry about changing this for now, but it's something we might consider depending on the diagnosis of the problem. Slurm 16.05 is very old and isn't technically supported right now. I understand your site has a temporary support contract right now so I'll do my best to help diagnose and fix (or have a good workaround for) this issue, but won't be able to provide any patches for 16.05.
Created attachment 15149 [details] slurmctld log 7-21-2020
Created attachment 15150 [details] slurmdbd log 7-21-2020
Hey Nathan, I'm looking through the slurmdbd log file you uploaded. I noticed that the slurmctld log file is empty (0 bytes) - can you try getting that again and uploading it? Are you able to answer my other questions from comment 1? > Is anybody running sacct queries that request lots of jobs? Have you seen any sacct queries fail? > Can you also run sacctmgr show runaway? (and press N for no) I'll ask more things once I'm done studying the slurmdbd log file.
Created attachment 15177 [details] slurmctld log 7-21-2020
That's strange, I see it as the correct size: slurmctld log 7-21-2020 (4.78 MB, text/plain) 2020-07-23 12:59 MDT, Nathan Elger ...and my comment is gone/never posted as well. I have uploaded the file again and checked to make sure it is complete. "Is anybody running sacct queries that request lots of jobs? Have you seen any sacct queries fail?" No, I don't see any users running sacct for any slurm info. The only notable use of sacct that we use is running xdmod's shredder once a week to gather data. show runaway still shows 0 jobs. I cleared the runaways a few weeks ago when the problem first manifested, it was ~50 jobs. I did not actually set the innodb buffer size option until a few days before I submitted this bug. After setting it, it seemed like the slurmdb crashes increased in frequency, but the large writes to /var/tmp/systemdxxx-mariadb/tmp actually slowed down and stopped growing larger than 1-2GB. In fact, after a day or two the slurmdb service stopped crashing entirely. At the time of this comment, the service has not crashed in 5 days. Prior to this, it has been crashing reliably dozens of times a day for over a month. The only changes that I had made since the last headnode crash/reboot were the innodb_buffer_pool_size set to 100G and stepping the slurmdb purge settings from 1200 to 1000 days as shown in my posted slurmdb.conf.
Is there a good slurm-side way to track sacct usage by users other than combing histories and submit scripts?
RE the 0 size log file, I think I copied it to another location before it finished downloading and thus it didn't have any data in it, so your original file wasn't empty. User error on my part. > How to see usage of sacct other than combing histories and submit scripts? There is the command sacctmgr show stats, but that command doesn't exist in 16.05, it only exists in 17.02 and up. It's very much like the command sdiag - sdiag shows stats for slurmctld, sacctmgr show stats shows stats for slurmdbd. You'd be interested in the bottom section, which shows RPC stats by type and by user (though not which user did which RPC's). Sample from my command in 20.02: Remote Procedure Call statistics by message type SLURM_PERSIST_INIT ( 6500) count:8 ave_time:1054 total_time:8438 DBD_FINI ( 1401) count:7 ave_time:621 total_time:4348 DBD_GET_FEDERATIONS ( 1494) count:5 ave_time:864 total_time:4321 DBD_REGISTER_CTLD ( 1434) count:3 ave_time:28434 total_time:85303 DBD_GET_ASSOCS ( 1410) count:3 ave_time:7087 total_time:21263 DBD_GET_RES ( 1478) count:3 ave_time:1119 total_time:3357 DBD_GET_USERS ( 1415) count:3 ave_time:575 total_time:1727 DBD_CLUSTER_TRES ( 1407) count:3 ave_time:516 total_time:1548 DBD_GET_TRES ( 1486) count:3 ave_time:501 total_time:1505 DBD_GET_QOS ( 1448) count:3 ave_time:451 total_time:1354 DBD_GET_JOBS_COND ( 1444) count:2 ave_time:3462 total_time:6925 DBD_GET_STATS ( 1489) count:2 ave_time:94 total_time:188 Remote Procedure Call statistics by user marshall ( 1000) count:45 ave_time:3117 total_time:140277 Unfortunately this isn't available in 16.05 so you'd have to upgrade to get it. sacct is fine to run and shouldn't be problematic even when run fairly often, but it does cause problems when requesting large amounts of data that result in large mysql queries (as I explained). > I did not actually set the innodb buffer size option until a few days before I submitted this bug. > After setting it, it seemed like the slurmdb crashes increased in frequency, but the large writes to > /var/tmp/systemdxxx-mariadb/tmp actually slowed down and stopped growing larger than 1-2GB. In fact, > after a day or two the slurmdb service stopped crashing entirely. At the time of this comment, the > service has not crashed in 5 days. Since you haven't seen any slurmdbd crash since increasing innodb_buffer_pool_size I assume that fixed whatever problems you were seeing. Let's keep this bug open for a bit so you can monitor slurmdbd. If it crashes again, I'd be interested in the slurmdbd log file again and also the output of show engine innodb status. You might also look at our accounting web page (https://slurm.schedmd.com/accounting.html). In addition to increasing innodb_buffer_pool_size and innodb_lock_wait_timeout (which you've already done), we also recommend increasing innodb_log_file_size. RE the log files - I didn't see anything very relevant to this particular bug. Here are my thoughts on what I saw anyway: I didn't see anything too interesting in the slurmctld log file besides the abundance of this set of errors: [2020-07-21T10:17:31.786] error: slurm_receive_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid credential [2020-07-21T10:17:31.786] error: slurm_receive_msg: Protocol authentication error [2020-07-21T10:17:31.796] error: slurm_receive_msg [10.16.2.34:33698]: Protocol authentication error [2020-07-21T10:17:31.796] error: invalid type trying to be freed 65534 [2020-07-21T10:17:32.807] error: Munge decode failed: Invalid credential That looks like some of your nodes is having an issue with munge (MESSAGE_NODE_REGISTRATION_STATUS is a node "registering" with slurmctld). That would be good to look into, but it doesn't have anything to do with the database. It also looks like slurmctld log level is probably at info (3) so there wasn't much to go off of. slurmdbd log file: * 12 of these at the beginning of the log file (looks like after a restart of slurmdbd) and one each hour (at rollup): [2020-07-21T09:52:26.711] error: id_assoc 321 doesn't have any tres I've seen these errors before and I know we've done some fixes where this error has shown up, so it's possible whatever caused this has been fixed in a newer Slurm version. * Invalid permissions for log file directory: [2020-07-21T10:17:23.141] error: chdir(/var/log): Permission denied [2020-07-21T10:17:23.141] chdir to /var/tmp Make sure that slurmdbd has the correct permissions for the file specified by LogFile in slurmdbd.conf. * There are users in the database where slurmdbd can't find UID's: [2020-07-21T10:17:23.248] debug: post user: couldn't get a uid for user guoshuai [2020-07-21T10:17:23.253] debug: post user: couldn't get a uid for user hulb [2020-07-21T10:17:23.261] debug: post user: couldn't get a uid for user jturner It might be good to delete those users with sacctmgr del user if they aren't active users anymore. * These happen once per minute on the minute: [2020-07-21T09:54:00.877] debug: DBD_INIT: CLUSTER:slurm_cluster VERSION:7680 UID:0 IP:10.16.255.254 CONN:13 This indicates something connecting to slurmdbd. Is there some cron job doing a query (running sacct, sacctgmr, or sreport) every minute? If so, what is it doing? If not, then I guess it's something within slurmctld in 16.05. I didn't see slurmctld connecting to slurmdbd every minute anywhere, but that said that RPC hasn't been used since 16.05 (it was replaced).
Marshall, Thanks for the update and the advice. I will continue to monitor the slurmdb crash and update this bug if it crops up again. "I didn't see anything too interesting in the slurmctld log file besides the abundance of this set of errors: [2020-07-21T10:17:31.786] error: slurm_receive_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid credential" Yeah I recently upgraded BCM and I had to wait to reboot a few nodes, which makes them complain about the munge credential. I have fixed these since the log upload. "slurmdbd log file: * 12 of these at the beginning of the log file (looks like after a restart of slurmdbd) and one each hour (at rollup): [2020-07-21T09:52:26.711] error: id_assoc 321 doesn't have any tres I've seen these errors before and I know we've done some fixes where this error has shown up, so it's possible whatever caused this has been fixed in a newer Slurm version." Yeah I have always seen this error and I've never been able to figure out the source but its never seemed to cause any problems, so I just left it alone. "* Invalid permissions for log file directory: [2020-07-21T10:17:23.141] error: chdir(/var/log): Permission denied [2020-07-21T10:17:23.141] chdir to /var/tmp Make sure that slurmdbd has the correct permissions for the file specified by LogFile in slurmdbd.conf." Fixed "* There are users in the database where slurmdbd can't find UID's: [2020-07-21T10:17:23.248] debug: post user: couldn't get a uid for user guoshuai [2020-07-21T10:17:23.253] debug: post user: couldn't get a uid for user hulb [2020-07-21T10:17:23.261] debug: post user: couldn't get a uid for user jturner It might be good to delete those users with sacctmgr del user if they aren't active users anymore." Fixed, the del user command took a very long time to complete, 2+ minutes. "* These happen once per minute on the minute: [2020-07-21T09:54:00.877] debug: DBD_INIT: CLUSTER:slurm_cluster VERSION:7680 UID:0 IP:10.16.255.254 CONN:13 This indicates something connecting to slurmdbd. Is there some cron job doing a query (running sacct, sacctgmr, or sreport) every minute? If so, what is it doing? If not, then I guess it's something within slurmctld in 16.05. I didn't see slurmctld connecting to slurmdbd every minute anywhere, but that said that RPC hasn't been used since 16.05 (it was replaced)." There is nothing that should be running that command with such regularity, unless it is something in BCM 7.3 that is monitoring the scheduler. I will open a ticket with them to check if this is expected. I know we need to look into updating slurm, I just feel like it will be tricky as it is built into this bright cluster manager deployment. I read that you can update the slurmdb up to 2 releases ahead of the slurm controller. If that is true, I will try to update the db to the latest 18 release. I also read that it is problematic to attempt the database migration on mariadb 5.5. Do you recommend that we update to mariadb 10 first? Thanks again for your help
Created attachment 15219 [details] slurmdbd log 7-29
Created attachment 15220 [details] slurmctld log 7-29
Hi Marshall, The crash has cropped up again, along with the huge mariadb dir in /var/tmp- [root@hn1-hyperion ~]# du -hs /var/tmp/* 0 /var/tmp/abrt 0 /var/tmp/lockINTEL 0 /var/tmp/systemd-private-62293ed4114f48e8ab6a2090fb1349c1-colord.service-MyKJRd 0 /var/tmp/systemd-private-62293ed4114f48e8ab6a2090fb1349c1-httpd.service-FaeVn8 0 /var/tmp/systemd-private-62293ed4114f48e8ab6a2090fb1349c1-named.service-sSxR6v 0 /var/tmp/systemd-private-62293ed4114f48e8ab6a2090fb1349c1-ntpd.service-aZ2kcV 0 /var/tmp/systemd-private-62293ed4114f48e8ab6a2090fb1349c1-rtkit-daemon.service-qxag1p 0 /var/tmp/systemd-private-7d7ca43e4cf84a5c9fa4b14a7058114f-colord.service-31tvnD 0 /var/tmp/systemd-private-7d7ca43e4cf84a5c9fa4b14a7058114f-httpd.service-a2Ww7T 0 /var/tmp/systemd-private-7d7ca43e4cf84a5c9fa4b14a7058114f-mariadb.service-1p6QSm 0 /var/tmp/systemd-private-7d7ca43e4cf84a5c9fa4b14a7058114f-named.service-ymF7BB 0 /var/tmp/systemd-private-7d7ca43e4cf84a5c9fa4b14a7058114f-ntpd.service-17zESF 0 /var/tmp/systemd-private-7d7ca43e4cf84a5c9fa4b14a7058114f-rtkit-daemon.service-9Vg3t8 0 /var/tmp/systemd-private-b77aedda9d2e485192f9d3000ba37997-colord.service-3lldCn 0 /var/tmp/systemd-private-b77aedda9d2e485192f9d3000ba37997-httpd.service-8EcJwr 106G /var/tmp/systemd-private-b77aedda9d2e485192f9d3000ba37997-mariadb.service-SBzvv6 0 /var/tmp/systemd-private-b77aedda9d2e485192f9d3000ba37997-named.service-eXG5jQ 0 /var/tmp/systemd-private-b77aedda9d2e485192f9d3000ba37997-ntpd.service-Eui6Ap 0 /var/tmp/systemd-private-b77aedda9d2e485192f9d3000ba37997-rtkit-daemon.service-UGIMkA 20K /var/tmp/yum-root-oRMbnC I bumped the slurmctld logging up to debug3 and captured a crash in the attached logs. The crash occured at 11:12am. Please let me know if you need any further info. I have been crawling user submit scripts and histories to see if anyone has been making use of sacct or sacctmgr but so far I haven't found anything obvious. I have several users that use bulk submit scripts with some logic to query squeue to get job info. Could a large number or rapid use of squeue or sinfo calls cause this problem? Some of these scripts have been in use for a long time and I've never seen any problems with them.
I'll look at your logs and get back to you. Has the slurmdbd been continuing to crash regularly like it was before, or did it only crash once? squeue and sinfo don't communicate with slurmdbd at all, so that shouldn't affect slurmdbd. However, job submissions do cause slurmctld to send RPC's to slurmdbd to have slurmdbd insert a new row into the job table. When the job runs, slurmdbd will insert additional rows into the step table for each job step (srun) in the job. Normally I wouldn't expect job submissions to cause a huge spike in memory usage in the database. However, there are a couple things about this situation that make me suspicious. (1) Since sacctmgr del user took 2+ minutes to complete (which results in a relatively simple set of queries), it's apparent that a even single queries are a bit slow. If a few queries are that slow then perhaps many job submissions could cause big issues. Reducing the size of the database more might help - if possible, I'd recommend purging more jobs, but I don't know what your site's policy on database records is. Another thing that we have done to make the database faster is add more indexes to the database. I know that some indexes have been added since Slurm 16.05, and I'm not sure what indexes exist in 16.05. (2) Since slurmctld and slurmdbd are on the same node, memory/CPU usage in one will affect the other. So I also wonder if the CPU and memory usage in slurmctld spiked and caused some issues in slurmdbd. Having a loop that continuously calls squeue and sinfo without any sleep calls in between will definitely be problematic for slurmctld, since they'll be generating tons of RPC's. We strongly recommend against that approach. If users need to poll slurmctld, they ought to space them at least a second in between, perhaps more. Could you run the command sdiag and upload the output next time this happens? I'll need sdiag before restarting slurmctld since the stats are reset on restarting slurmctld. (The stats also reset every day at midnight.) Do your users know about job arrays? A job array is a way to submit multiple jobs with a single command, which results in a single RPC to slurmctld and is much faster than submitting many jobs in parallel. Job arrays are ideal for a workload running the same program many times but with different parameters. https://slurm.schedmd.com/sbatch.html#OPT_array If you're willing to risk another slurmdbd crash, could you (or one of your users) run a bulk submit script similar to what your users do and monitor the slurmctld/slurmdbd node and slurmctld and slurmdbd processes? I'd also like the output of sdiag run once every 5 minutes during this experiment. Now responding to your previous comment: > Fixed, the del user command took a very long time to complete, 2+ minutes. Normally this shouldn't take that long. I'm guessing because the database is so large even simple mysql queries are taking awhile. Also what kind of hardware and filesystem is the database on? (ssd or hdd, local vs shared filesystem). > I know we need to look into updating slurm, I just feel like it will be tricky > as it is built into this bright cluster manager deployment. I read that you can > update the slurmdb up to 2 releases ahead of the slurm controller. If that is > true, I will try to update the db to the latest 18 release. I also read that it > is problematic to attempt the database migration on mariadb 5.5. Do you > recommend that we update to mariadb 10 first? Yes, we recommend updating to mariadb first. Also yes, you can update by 2 major releases. However, you can't upgrade directly to 18.08. The major releases are numbered year.month (like Ubuntu) and are on a 9-month release cycle. So you are currently on 16.05. The next major release is 17.02, and the one after that is 17.11, and the one after that is 18.08. 17.02 and 17.11 are different major releases. So 18.08 is actually 3 major releases above 16.05 and you may lose data if you try to upgrade directly to 18.08. Our current supported releases are 19.05 and 20.02. To get to 19.05, you'd first have to upgrade to 17.11 and then could upgrade to 19.05. Also, the slurmdbd must be upgraded first, then the slurmctld, then slurmd and slurmstepd, then all client commands. Or, everything can be upgraded at once. All binaries must be within 2 major releases of each other, so if you wanted to upgrade to 19.05, you would have to upgrade everything to 17.11 first, then could upgrade everything to 19.05. Definitely create backups of the database and the slurmctld state files (which reside in "StateSaveLocation" which is defined in slurm.conf) before upgrading. This is documented on our upgrading guide - I recommend reading it before you plan an upgrade. https://slurm.schedmd.com/quickstart_admin.html#upgrade The rest of that web page (quickstart_admin.html) may have useful information for you, too. If/when you do plan an upgrade, I'd recommend upgrading all the way to 19.05 or 20.02.
Hi Marshall, We're back to the constant crashes, every hour or so right now. Here is the sdiag output right before and right after a crash: [root@hn1-hyperion ~]# free -hlm total used free shared buff/cache available Mem: 125G 116G 712M 21M 8.7G 8.0G Low: 125G 124G 712M High: 0B 0B 0B Swap: 15G 14G 1.7G [root@hn1-hyperion ~]# sdiag ******************************************************* sdiag output at Fri Jul 31 09:54:49 2020 Data since Fri Jul 31 07:51:32 2020 ******************************************************* Server thread count: 3 Agent queue size: 0 Jobs submitted: 91 Jobs started: 77 Jobs completed: 59 Jobs canceled: 4 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 3434 Max cycle: 318046 Total cycles: 209 Mean cycle: 5056 Mean depth cycle: 128 Cycles per minute: 1 Last queue length: 170 Backfilling stats Total backfilled jobs (since last slurm start): 40 Total backfilled jobs (since last stats cycle start): 40 Total cycles: 206 Last cycle when: Fri Jul 31 09:54:39 2020 Last cycle: 1265995 Max cycle: 3962871 Mean cycle: 2262471 Last depth cycle: 170 Last depth cycle (try sched): 170 Depth Mean: 144 Depth Mean (try depth): 144 Last queue length: 170 Queue length mean: 147 Remote Procedure Call statistics by message type MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:1892 ave_time:209653 total_time:396664632 REQUEST_PING ( 1008) count:254 ave_time:666 total_time:169329 REQUEST_PARTITION_INFO ( 2009) count:211 ave_time:3043 total_time:642105 MESSAGE_EPILOG_COMPLETE ( 6012) count:133 ave_time:29134 total_time:3874919 REQUEST_NODE_INFO ( 2007) count:132 ave_time:8891 total_time:1173707 REQUEST_JOB_INFO ( 2003) count:127 ave_time:26634 total_time:3382531 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:80 ave_time:27204 total_time:2176324 REQUEST_JOB_USER_INFO ( 2039) count:69 ave_time:34717 total_time:2395527 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:63 ave_time:60636 total_time:3820129 REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:55 ave_time:71047 total_time:3907585 REQUEST_JOB_STEP_CREATE ( 5001) count:55 ave_time:2163 total_time:119010 REQUEST_STEP_COMPLETE ( 5016) count:50 ave_time:56030 total_time:2801504 REQUEST_KILL_JOB ( 5032) count:4 ave_time:2171 total_time:8684 REQUEST_CANCEL_JOB_STEP ( 5005) count:4 ave_time:860 total_time:3442 REQUEST_JOB_INFO_SINGLE ( 2021) count:2 ave_time:558 total_time:1116 ACCOUNTING_REGISTER_CTLD (10003) count:1 ave_time:169572 total_time:169572 REQUEST_STATS_INFO ( 2035) count:0 ave_time:0 total_time:0 Remote Procedure Call statistics by user root ( 0) count:2775 ave_time:148291 total_time:411509098 zhonghua ( 1343) count:85 ave_time:15131 total_time:1286167 kyuan ( 1331) count:80 ave_time:6255 total_time:500427 dsahsah ( 1233) count:53 ave_time:32806 total_time:1738750 ammals ( 1007) count:33 ave_time:52698 total_time:1739064 adr1 ( 1193) count:24 ave_time:7565 total_time:181573 rajbansh ( 1111) count:20 ave_time:146034 total_time:2920692 klepov ( 1376) count:14 ave_time:16930 total_time:237025 yh5 ( 1236) count:12 ave_time:45469 total_time:545632 jwk ( 1098) count:10 ave_time:1175 total_time:11752 bh25 ( 1398) count:8 ave_time:20134 total_time:161075 lin9 ( 1418) count:4 ave_time:14759 total_time:59039 doranrm ( 1081) count:4 ave_time:15512 total_time:62049 jr23 ( 1362) count:3 ave_time:21321 total_time:63964 richards ( 1058) count:2 ave_time:15539 total_time:31078 jshu ( 1326) count:2 ave_time:14347 total_time:28694 mclauchc ( 1249) count:2 ave_time:32232 total_time:64465 slurm ( 450) count:1 ave_time:169572 total_time:169572 [root@hn1-hyperion ~]# [root@hn1-hyperion ~]# sdiag ******************************************************* sdiag output at Fri Jul 31 09:57:09 2020 Data since Fri Jul 31 07:51:32 2020 ******************************************************* Server thread count: 3 Agent queue size: 0 Jobs submitted: 91 Jobs started: 77 Jobs completed: 59 Jobs canceled: 4 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 36010 Max cycle: 318046 Total cycles: 211 Mean cycle: 5195 Mean depth cycle: 129 Cycles per minute: 1 Last queue length: 170 Backfilling stats Total backfilled jobs (since last slurm start): 40 Total backfilled jobs (since last stats cycle start): 40 Total cycles: 209 Last cycle when: Fri Jul 31 09:56:56 2020 Last cycle: 1388147 Max cycle: 3962871 Mean cycle: 2248662 Last depth cycle: 170 Last depth cycle (try sched): 170 Depth Mean: 144 Depth Mean (try depth): 144 Last queue length: 170 Queue length mean: 148 Remote Procedure Call statistics by message type MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:1892 ave_time:209653 total_time:396664632 REQUEST_PING ( 1008) count:260 ave_time:790 total_time:205602 REQUEST_PARTITION_INFO ( 2009) count:214 ave_time:3003 total_time:642731 REQUEST_NODE_INFO ( 2007) count:135 ave_time:8867 total_time:1197055 MESSAGE_EPILOG_COMPLETE ( 6012) count:133 ave_time:29134 total_time:3874919 REQUEST_JOB_INFO ( 2003) count:130 ave_time:26996 total_time:3509554 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:80 ave_time:27204 total_time:2176324 REQUEST_JOB_USER_INFO ( 2039) count:69 ave_time:34717 total_time:2395527 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:63 ave_time:60636 total_time:3820129 REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:55 ave_time:71047 total_time:3907585 REQUEST_JOB_STEP_CREATE ( 5001) count:55 ave_time:2163 total_time:119010 REQUEST_STEP_COMPLETE ( 5016) count:50 ave_time:56030 total_time:2801504 REQUEST_KILL_JOB ( 5032) count:4 ave_time:2171 total_time:8684 REQUEST_CANCEL_JOB_STEP ( 5005) count:4 ave_time:860 total_time:3442 REQUEST_JOB_INFO_SINGLE ( 2021) count:2 ave_time:558 total_time:1116 ACCOUNTING_REGISTER_CTLD (10003) count:1 ave_time:169572 total_time:169572 REQUEST_STATS_INFO ( 2035) count:1 ave_time:307 total_time:307 Remote Procedure Call statistics by user root ( 0) count:2791 ave_time:147508 total_time:411696675 zhonghua ( 1343) count:85 ave_time:15131 total_time:1286167 kyuan ( 1331) count:80 ave_time:6255 total_time:500427 dsahsah ( 1233) count:53 ave_time:32806 total_time:1738750 ammals ( 1007) count:33 ave_time:52698 total_time:1739064 adr1 ( 1193) count:24 ave_time:7565 total_time:181573 rajbansh ( 1111) count:20 ave_time:146034 total_time:2920692 klepov ( 1376) count:14 ave_time:16930 total_time:237025 yh5 ( 1236) count:12 ave_time:45469 total_time:545632 jwk ( 1098) count:10 ave_time:1175 total_time:11752 bh25 ( 1398) count:8 ave_time:20134 total_time:161075 lin9 ( 1418) count:4 ave_time:14759 total_time:59039 doranrm ( 1081) count:4 ave_time:15512 total_time:62049 jr23 ( 1362) count:3 ave_time:21321 total_time:63964 richards ( 1058) count:2 ave_time:15539 total_time:31078 jshu ( 1326) count:2 ave_time:14347 total_time:28694 mclauchc ( 1249) count:2 ave_time:32232 total_time:64465 slurm ( 450) count:1 ave_time:169572 total_time:169572 [root@hn1-hyperion ~]# free -hlm total used free shared buff/cache available Mem: 125G 27G 95G 14M 2.8G 96G Low: 125G 30G 95G High: 0B 0B 0B Swap: 15G 15G 1.4M [root@hn1-hyperion ~]# I have been slowly reducing the purge period, last night I reduced to 800 days. Is there some logged indication of whether the purge completed successfully or encountered errors? Do I just need to query the db every time to make sure older records are being pruned? The slurmctld process stays stable throughout all of these slurmdb issues. At the time of this crash it was around 10GB memory used. I will look into testing a bulk submit. The few users that I know employ these methods have not even been on the cluster in this latest period of crashes. I have been looking at all of the scheduled jobs as well as user histories and I do not see any serious abuse of sacct. Most users just call sacct and squeue every so often to check job status. We have plenty of users that make use of arrays, but our scheduler just runs in FIFO backfill mode, so we have limited the overall size of the arrays to prevent a few users from monopolizing the queue with array jobs. "Normally this shouldn't take that long. I'm guessing because the database is so large even simple mysql queries are taking awhile. Also what kind of hardware and filesystem is the database on? (ssd or hdd, local vs shared filesystem)." The headnode is a dual socket Xeon 2650, 126GB RAM. The drive is a local 2TB raid5 xfs filesystem.
Apologies for the delay in response. > I have been slowly reducing the purge period, last night I reduced to 800 > days. Is there some logged indication of whether the purge completed > successfully or encountered errors? Do I just need to query the db every time > to make sure older records are being pruned? There isn't a "success" message at all without debug flags, though there are error messages if something goes wrong. Querying the database is a fine way to check if it succeeded. You can turn on the db_archive debug flag in slurmdbd.conf (DebugFlags=db_archive), though this will be very verbose. There are other debug flags we can set in slurmdbd.conf, though I'm hesitant to do so because it is so verbose. We might need to resort to doing that for a short time, though. https://slurm.schedmd.com/archive/slurm-16.05-latest/slurmdbd.conf.html Let's get some other information first and then see where we're at. Can you upload dmesg (or similar) log during a recent time when slurmdbd died to confirm if it was OOM-killed and what else happened? How many rows are in the job and step tables in the database? SELECT COUNT(*) FROM <table_name>; table_name is <clustername>_job_table and <clustername>_step_table where <clustername> is ClusterName in slurm.conf. Though it may not be relevant, can you also upload the db log (/var/log/mariadb/mariadb.log or wherever the log file is). Can you tell if Bright is querying the database at any regular interval? I think we may have seen issues with this in the past. We do also encourage you to make a plan to upgrade Slurm to 19.05 or 20.02, as we can give you the best support when you're on a supported release. As I said before though, upgrading mariadb and also reducing the size of the database are good ideas before upgrading.
Created attachment 15396 [details] mariadb log
>Though it may not be relevant, can you also upload the db log (/var/log/mariadb/mariadb.log or wherever the log file is). attached to this case >Can you upload dmesg (or similar) log during a recent time when slurmdbd died to confirm if it was OOM-killed and what else happened? [657500.790955] DOM Worker invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 [657500.790965] DOM Worker cpuset=/ mems_allowed=0-1 [657500.790968] CPU: 10 PID: 25133 Comm: DOM Worker Not tainted 3.10.0-514.2.2.el7.x86_64 #1 [657500.790970] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 10/17/2018 [657500.790971] ffff88119f902f10 000000002947a177 ffff8810f0053a78 ffffffff816861cc [657500.790974] ffff8810f0053b08 ffffffff81681177 ffffffff810eabac ffff88074cef4fa0 [657500.790976] ffff88074cef4fb8 0000000000000202 ffff88119f902f10 ffff8810f0053af8 [657500.790979] Call Trace: [657500.790987] [<ffffffff816861cc>] dump_stack+0x19/0x1b [657500.790992] [<ffffffff81681177>] dump_header+0x8e/0x225 [657500.791004] [<ffffffff810eabac>] ? ktime_get_ts64+0x4c/0xf0 [657500.791009] [<ffffffff8113ccbf>] ? delayacct_end+0x8f/0xb0 [657500.791016] [<ffffffff8118476e>] oom_kill_process+0x24e/0x3c0 [657500.791020] [<ffffffff8118420d>] ? oom_unkillable_task+0xcd/0x120 [657500.791022] [<ffffffff811842b6>] ? find_lock_task_mm+0x56/0xc0 [657500.791027] [<ffffffff810937ee>] ? has_capability_noaudit+0x1e/0x30 [657500.791030] [<ffffffff81184fa6>] out_of_memory+0x4b6/0x4f0 [657500.791034] [<ffffffff81681c80>] __alloc_pages_slowpath+0x5d7/0x725 [657500.791037] [<ffffffff8118b0b5>] __alloc_pages_nodemask+0x405/0x420 [657500.791043] [<ffffffff811d221a>] alloc_pages_vma+0x9a/0x150 [657500.791047] [<ffffffff811b14df>] handle_mm_fault+0xc6f/0xfe0 [657500.791054] [<ffffffff81691c94>] __do_page_fault+0x154/0x450 [657500.791057] [<ffffffff81691fc5>] do_page_fault+0x35/0x90 [657500.791061] [<ffffffff8168e288>] page_fault+0x28/0x30 [657500.791063] Mem-Info: [657500.791070] active_anon:30964943 inactive_anon:1394152 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:0 dirty:5 writeback:8 unstable:0 slab_reclaimable:32367 slab_unreclaimable:34663 mapped:13736 shmem:14244 pagetables:79492 bounce:0 free:131784 free_pcp:7721 free_cma:0 [657500.791073] Node 0 DMA free:15508kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [657500.791077] lowmem_reserve[]: 0 1653 64115 64115 [657500.791080] Node 0 DMA32 free:253216kB min:3372kB low:4212kB high:5056kB active_anon:1037664kB inactive_anon:366116kB active_file:24kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1948156kB managed:1695140kB mlocked:0kB dirty:0kB writeback:0kB mapped:96kB shmem:364kB slab_reclaimable:2764kB slab_unreclaimable:3736kB kernel_stack:336kB pagetables:5088kB unstable:0kB bounce:0kB free_pcp:3892kB local_pcp:100kB free_cma:0kB writeback_tmp:0kB pages_scanned:496 all_unreclaimable? yes [657500.791084] lowmem_reserve[]: 0 0 62461 62461 [657500.791087] Node 0 Normal free:127284kB min:127280kB low:159100kB high:190920kB active_anon:60227064kB inactive_anon:2545224kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:65011712kB managed:63961056kB mlocked:0kB dirty:20kB writeback:32kB mapped:16820kB shmem:16708kB slab_reclaimable:56716kB slab_unreclaimable:79132kB kernel_stack:10192kB pagetables:190088kB unstable:0kB bounce:0kB free_pcp:16288kB local_pcp:740kB free_cma:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? no [657500.791091] lowmem_reserve[]: 0 0 0 0 [657500.791094] Node 1 Normal free:131128kB min:131452kB low:164312kB high:197176kB active_anon:62595044kB inactive_anon:2665268kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:67108864kB managed:66057968kB mlocked:0kB dirty:0kB writeback:0kB mapped:38028kB shmem:39904kB slab_reclaimable:69988kB slab_unreclaimable:55784kB kernel_stack:11648kB pagetables:122792kB unstable:0kB bounce:0kB free_pcp:10816kB local_pcp:672kB free_cma:0kB writeback_tmp:0kB pages_scanned:3377 all_unreclaimable? yes [657500.791098] lowmem_reserve[]: 0 0 0 0 [657500.791103] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 0*64kB 1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15508kB [657500.791111] Node 0 DMA32: 387*4kB (UEM) 351*8kB (UEM) 502*16kB (UEM) 796*32kB (UEM) 879*64kB (UEM) 320*128kB (UEM) 141*256kB (UEM) 84*512kB (UEM) 32*1024kB (UEM) 3*2048kB (M) 0*4096kB = 253092kB [657500.791122] Node 0 Normal: 13735*4kB (UE) 8664*8kB (UEM) 76*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 125468kB [657500.791129] Node 1 Normal: 10066*4kB (UEM) 9046*8kB (UEM) 1226*16kB (UEM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 132248kB [657500.791136] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [657500.791138] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [657500.791139] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [657500.791141] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [657500.791143] 645724 total pagecache pages [657500.791145] 630656 pages in swap cache [657500.791146] Swap cache stats: add 17771306, delete 17140650, find 6891068/7649844 [657500.791147] Free swap = 0kB [657500.791148] Total swap = 16777212kB [657500.791150] 33521181 pages RAM [657500.791151] 0 pages HighMem/MovableOnly [657500.791152] 588664 pages reserved [657500.791152] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [657500.791162] [ 537] 0 537 56805 21489 114 73 0 systemd-journal [657500.791164] [ 558] 0 558 29723 0 26 83 0 lvmetad [657500.791166] [ 579] 0 579 11210 2 22 421 -1000 systemd-udevd [657500.791170] [ 916] 0 916 4287 0 12 59 0 rpc.idmapd [657500.791171] [ 927] 0 927 100444 192 48 112 0 accounts-daemon [657500.791174] [ 931] 81 931 7544 358 19 80 -900 dbus-daemon [657500.791176] [ 932] 0 932 1742 11 7 15 0 mdadm [657500.791177] [ 950] 0 950 1092 21 6 18 0 rngd [657500.791179] [ 956] 0 956 104831 88 70 265 0 ModemManager [657500.791181] [ 959] 0 959 50303 10 35 113 0 gssproxy [657500.791182] [ 962] 0 962 5704 8 13 94 0 ipmievd [657500.791184] [ 1000] 172 1000 41162 26 16 27 0 rtkit-daemon [657500.791186] [ 1003] 422 1003 2131 7 9 29 0 lsmd [657500.791187] [ 1004] 0 1004 4858 57 13 54 0 irqbalance [657500.791189] [ 1007] 0 1007 53192 118 56 332 0 abrtd [657500.791190] [ 1011] 0 1011 52566 11 55 327 0 abrt-watch-log [657500.791192] [ 1016] 0 1016 52566 2 55 337 0 abrt-watch-log [657500.791193] [ 1025] 0 1025 31968 12 17 113 0 smartd [657500.791195] [ 1029] 0 1029 7110 39 18 53 0 systemd-logind [657500.791196] [ 1030] 0 1030 4202 4 12 42 0 alsactl [657500.791198] [ 1037] 0 1037 1639 1 6 33 0 mcelog [657500.791199] [ 1040] 999 1040 136478 473 62 1427 0 polkitd [657500.791201] [ 1062] 0 1062 28876 92 10 22 0 ksmtuned [657500.791202] [ 1509] 0 1509 119578 9459 128 128 0 rsyslogd [657500.791204] [ 1526] 32 1526 16237 35 34 103 0 rpcbind [657500.791206] [ 1528] 0 1528 20617 23 42 192 -1000 sshd [657500.791207] [ 1534] 415 1534 532688 57842 225 9293 0 slapd [657500.791209] [ 1535] 0 1535 10271 161 24 141 0 rpc.mountd [657500.791210] [ 1589] 0 1589 118840 91 51 681 0 gdm [657500.791212] [ 1591] 0 1591 6461 3 17 48 0 atd [657500.791214] [ 1662] 27 1662 28314 0 10 70 0 mysqld_safe [657500.791215] [ 2198] 25 2198 475718 9924 128 15317 0 named [657500.791217] [ 2344] 27 2344 30032258 7460571 22758 3899669 -1000 mysqld [657500.791219] [ 2497] 0 2497 94056 1486 183 173 0 httpd [657500.791220] [ 2535] 29 2535 11647 72 26 170 0 rpc.statd [657500.791223] [ 2543] 65 2543 112506 347 53 327 0 nslcd [657500.791226] [ 2562] 0 2562 128354 1461 153 7647 0 Xorg [657500.791227] [ 2579] 0 2579 31554 23 17 130 0 crond [657500.791229] [ 2581] 0 2581 26852 0 20 130 0 cm-nfs-checker [657500.791239] [ 3223] 2 3223 56326 267 21 93 0 munged [657500.791240] [ 3282] 0 3282 92985 107 61 301 0 upowerd [657500.791242] [ 3333] 418 3333 104128 0 57 944 0 colord [657500.791243] [ 3420] 0 3420 12728 0 28 145 0 wpa_supplicant [657500.791245] [ 3433] 0 3433 101298 96 51 172 0 packagekitd [657500.791246] [ 3444] 0 3444 95821 156 48 222 0 udisksd [657500.791248] [ 3566] 0 3566 28812 2 12 60 -1000 safe_cmd [657500.791249] [ 3579] 0 3579 2230126 888362 2462 152621 -1000 cmd [657500.791251] [ 4067] 0 4067 94924 106 71 1276 0 gdm-session-wor [657500.791253] [ 4360] 38 4360 8938 55 22 117 0 ntpd [657500.791254] [ 5582] 0 5582 115881 0 56 289 0 gnome-keyring-d [657500.791256] [ 5670] 0 5670 136711 131 109 357 0 gnome-session [657500.791257] [ 5677] 0 5677 3486 0 10 47 0 dbus-launch [657500.791259] [ 5678] 0 5678 7443 64 19 246 0 dbus-daemon [657500.791260] [ 5745] 0 5745 90481 0 38 169 0 gvfsd [657500.791262] [ 5750] 0 5750 118525 0 49 205 0 gvfsd-fuse [657500.791263] [ 5835] 0 5835 13215 10 28 131 0 ssh-agent [657500.791264] [ 5871] 0 5871 86495 0 49 216 0 at-spi-bus-laun [657500.791266] [ 5876] 0 5876 7380 16 19 326 0 dbus-daemon [657500.791267] [ 5878] 0 5878 47245 0 34 174 0 at-spi2-registr [657500.791269] [ 5886] 0 5886 263022 1255 214 2608 0 gnome-settings- [657500.791271] [ 5906] 0 5906 118364 85 90 302 0 pulseaudio [657500.791272] [ 5918] 0 5918 420263 44027 418 18937 0 gnome-shell [657500.791274] [ 5924] 419 5924 108555 111 61 1302 0 geoclue [657500.791275] [ 5929] 0 5929 140934 0 97 412 0 gsd-printer [657500.791277] [ 5974] 0 5974 111913 33 48 503 0 ibus-daemon [657500.791278] [ 5979] 0 5979 147502 0 93 783 0 gnome-shell-cal [657500.791280] [ 5984] 0 5984 316938 225 181 807 0 evolution-sourc [657500.791281] [ 5986] 0 5986 94198 0 45 217 0 ibus-dconf [657500.791282] [ 5988] 0 5988 110550 0 105 529 0 ibus-x11 [657500.791284] [ 5999] 0 5999 171300 0 146 1223 0 goa-daemon [657500.791285] [ 6008] 0 6008 96895 60 91 330 0 goa-identity-se [657500.791287] [ 6015] 0 6015 113241 65 68 304 0 mission-control [657500.791288] [ 6020] 0 6020 137418 150 94 685 0 caribou [657500.791290] [ 6027] 0 6027 95523 56 47 253 0 gvfs-udisks2-vo [657500.791291] [ 6033] 0 6033 88773 0 34 164 0 gvfs-goa-volume [657500.791293] [ 6038] 0 6038 94238 0 43 233 0 gvfs-gphoto2-vo [657500.791294] [ 6043] 0 6043 117755 0 55 320 0 gvfs-afc-volume [657500.791296] [ 6049] 0 6049 91970 0 39 183 0 gvfs-mtp-volume [657500.791297] [ 6053] 0 6053 189834 113 155 1972 0 nautilus [657500.791298] [ 6074] 0 6074 140384 186 67 1977 0 tracker-store [657500.791300] [ 6080] 0 6080 139644 0 91 821 0 tracker-extract [657500.791301] [ 6082] 0 6082 113424 0 73 729 0 tracker-miner-a [657500.791303] [ 6086] 0 6086 151442 163 84 698 0 tracker-miner-f [657500.791304] [ 6087] 0 6087 113373 0 72 651 0 tracker-miner-u [657500.791306] [ 6094] 0 6094 130347 722 143 795 0 abrt-applet [657500.791308] [ 6165] 0 6165 112276 74 50 158 0 gvfsd-trash [657500.791309] [ 6187] 0 6187 75068 0 36 160 0 gvfsd-metadata [657500.791310] [ 6191] 0 6191 77285 0 42 204 0 ibus-engine-sim [657500.791312] [ 6195] 0 6195 41555 0 24 113 0 dconf-service [657500.791313] [ 6204] 0 6204 261179 0 193 9506 0 evolution-calen [657500.791315] [ 6258] 450 6258 3164656 187959 604 32700 0 slurmctld [657500.791316] [ 6533] 0 6533 136499 0 124 3020 0 gnome-terminal- [657500.791318] [ 6541] 0 6541 2120 1 9 29 0 gnome-pty-helpe [657500.791320] [ 6542] 0 6542 29076 2 13 345 0 bash [657500.791322] [17656] 0 17656 4868 50 13 131 0 lmgrd [657500.791324] [17663] 0 17663 20519 161 13 88 0 INTEL [657500.791325] [ 3747] 0 3747 37965 23 75 312 0 sshd [657500.791327] [ 3750] 0 3750 29091 261 12 102 0 bash [657500.791328] [13232] 0 13232 22766 21 42 238 0 master [657500.791330] [13234] 89 13234 23359 40 45 241 0 qmgr [657500.791332] [ 4275] 0 4275 37291 242 74 78 0 sshd [657500.791333] [ 4278] 1242 4278 37359 309 72 77 0 sshd [657500.791335] [ 4279] 1242 4279 30628 155 14 139 0 bash [657500.791336] [ 6457] 1242 6457 5603 23 14 49 0 dbus-launch [657500.791337] [ 6547] 1242 6547 7147 11 18 75 0 dbus-daemon [657500.791339] [ 8057] 1242 8057 37965 99 29 6 0 gconfd-2 [657500.791341] [15393] 48 15393 95102 1459 179 174 0 httpd [657500.791342] [15394] 48 15394 95102 1464 179 170 0 httpd [657500.791344] [15395] 48 15395 95102 1455 179 171 0 httpd [657500.791346] [15396] 48 15396 95102 1455 179 171 0 httpd [657500.791347] [15397] 48 15397 95102 1455 179 171 0 httpd [657500.791349] [11907] 450 11907 23014313 22705300 44702 24 0 slurmdbd [657500.791350] [24347] 0 24347 37291 320 74 0 0 sshd [657500.791352] [24350] 1002 24350 37794 823 73 0 0 sshd [657500.791353] [24351] 1002 24351 30630 278 14 18 0 bash [657500.791355] [24433] 1002 24433 28282 60 10 0 0 cmgui [657500.791356] [24441] 1002 24441 408706 165061 724 388 0 firefox [657500.791358] [24689] 1002 24689 5604 71 14 0 0 dbus-launch [657500.791359] [24735] 1002 24735 7148 82 18 0 0 dbus-daemon [657500.791361] [25010] 1002 25010 37966 105 29 0 0 gconfd-2 [657500.791362] [29487] 1002 29487 117284 353 88 0 0 pulseaudio [657500.791363] [32481] 1242 32481 117282 347 88 0 0 pulseaudio [657500.791365] [ 9040] 1242 9040 28282 57 10 0 0 cmgui [657500.791366] [ 9048] 1242 9048 424659 181245 751 441 0 firefox [657500.791368] [13081] 0 13081 3990 704 13 0 0 dhcpd [657500.791370] [ 9240] 89 9240 23317 253 45 0 0 pickup [657500.791374] [11656] 0 11656 26974 21 8 0 0 sleep [657500.791376] [12268] 0 12268 5255 76 15 0 0 sacct [657500.791378] [12596] 0 12596 5163 23 12 0 0 systemctl [657500.791380] Out of memory: Kill process 11907 (slurmdbd) score 612 or sacrifice child [657500.791383] Killed process 11907 (slurmdbd) total-vm:92057252kB, anon-rss:90821200kB, file-rss:0kB, shmem-rss:0kB >How many rows are in the job and step tables in the database? MariaDB [(none)]> use slurm_acct_db Database changed MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM SLURM_CLUSTER_job_table; ERROR 1146 (42S02): Table 'slurm_acct_db.SLURM_CLUSTER_job_table' doesn't exist MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM slurm_cluster_job_table; +----------+ | COUNT(*) | +----------+ | 4655245 | +----------+ 1 row in set (3.60 sec) MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM slurm_cluster_step_table; +----------+ | COUNT(*) | +----------+ | 30034605 | +----------+ 1 row in set (13.05 sec) Looks like jobs are definitely purging, as we are at almost 9m jobs lifetime on this cluster. ibdata1 is still at 24gb in /var/lib/mysql >Can you tell if Bright is querying the database at any regular interval? I think we may have seen issues with this in the past. Yes the cluster manager daemon runs an sacct query every minute. I am going to turn off the bright job collection entirely this friday during my maintenance window. >We do also encourage you to make a plan to upgrade Slurm to 19.05 or 20.02, as we can give you the best support when you're on a supported release. As I said before though, upgrading mariadb and also reducing the size of the database are good ideas before upgrading. That is my intention but Bright cant verify that this version of Bright will run ok on mariadb 10, I suspect it probably will but they wont confirm. Basically, I will have to forklift upgrade the entire cluster to bright 8 to use mariadb 10, I am not anxious to do that as it entails a bunch of kernel updates that will break a lot of researcher software. I think what I will need to do is limp along in the current config for the rest of the year, slowly recompiling our software catalog for the new kernel until our longer annual maintenance window in December. Slurmdb did crash again early this morning, and I am at 400 days on the purge settings. The huge mariadb tmp folders have not been an issue in over 2 weeks, however. Hopefully we are heading in the right direction.
> Yes the cluster manager daemon runs an sacct query every minute. I am going to > turn off the bright job collection entirely this friday during my maintenance > window. What exact sacct query was bright running? Have you noticed a difference since turning it off? Mariadb log: Lots of this message: 200715 18:07:57 [Warning] mysqld: Disk is full writing '/var/tmp/#sql_8e8_49.MAI' (Errcode: 28). Waiting for someone to free space... (Expect up to 60 secs delay for server to continue after freeing disk space) This is basically what you already told us - that the disk was filling up. But you did say you haven't seem to have had this issue recently (of maridb tmp folders filling up). Have these warnings happened recently in the log?
Do you have any updates on this?
Hi Marshall, I think the database purge was the answer. I walked it down by 100 days to currently 600 days on all options and everything has been good since then. I did disable that bright sacct query, not sure the exact options they were using, but the slurm service had been running fine for over a week at that point. As of today, slurmdbd has been running for 4 weeks and no more issues with mariadb filling up /var/tmp. Thanks very much for all your help I think we can close this case.
Thanks for the update. I'm glad things are working.