Ticket 16167 - Slurm gets stuck for 25k+ array jobs
Summary: Slurm gets stuck for 25k+ array jobs
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 21.08.8
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-03-02 05:21 MST by Kapil Sawate
Modified: 2024-07-30 06:31 MDT (History)
2 users (show)

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: Ubuntu
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kapil Sawate 2023-03-02 05:21:35 MST
Hi 
We have a cluster of around 50k cores in which almost every job is an array jobs whose time of execution varies from 30s to max 24 hrs (80 % of jobs are 5-10 min jobs), slurm responsiveness get slows (squeue, sacct, sinfo sbatch)when there are too many no. of slurm calls like burst of job submission, cancellation, preemption etc. Can u please help us in fine tuning our cluster.

The slurmctld and slurmdbd are on the same linux server, the server configuration is 64 cores, 256gb of ram, all the slurm communications are over mellanox high speed ethernet (25Gb/s)
We have seen slurm sched threads 256, and slurm db queue length too high when the slurm operation slows down

The munge auth service is running with 10 threads 

slurm. conf  details

sdiag output

*******************************************************
sdiag output at Thu Mar 02 17:35:57 2023 (1677758757)
Data since      Thu Mar 02 17:35:13 2023 (1677758713)
*******************************************************
Server thread count:  256
Agent queue size:     0
Agent count:          2
Agent thread count:   6
DBD Agent queue size: 2

Jobs submitted: 506
Jobs started:   700
Jobs completed: 1861
Jobs canceled:  0
Jobs failed:    0

Job states ts:  Thu Mar 02 17:35:43 2023 (1677758743)
Jobs pending:   238
Jobs running:   23207

Main schedule statistics (microseconds):
	Last cycle:   133286
	Max cycle:    1868295
	Total cycles: 18
	Mean cycle:   294357
	Mean depth cycle:  43
	Last queue length: 40147

Backfilling stats
	Total backfilled jobs (since last slurm start): 0
	Total backfilled jobs (since last stats cycle start): 0
	Total backfilled heterogeneous job components: 0
	Total cycles: 0
	Last cycle when: Mon Dec 19 15:01:35 2022 (1671442295)
	Last cycle: 0
	Max cycle:  0
	Last depth cycle: 0
	Last depth cycle (try sched): 0
	Last queue length: 0
	Last table size: 0

Latency for 1000 calls to gettimeofday(): 24 microseconds

Remote Procedure Call statistics by message type
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:1861   ave_time:5142346 total_time:9569907374
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:1019   ave_time:34973  total_time:35638356
	REQUEST_HET_JOB_ALLOC_INFO              ( 4027) count:17     ave_time:6019   total_time:102323
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:17     ave_time:28601  total_time:486223
	REQUEST_STEP_COMPLETE                   ( 5016) count:15     ave_time:101770 total_time:1526550
	REQUEST_PARTITION_INFO                  ( 2009) count:13     ave_time:16787  total_time:218240
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:11     ave_time:172282 total_time:1895108
	REQUEST_JOB_INFO                        ( 2003) count:8      ave_time:349078 total_time:2792624
	REQUEST_NODE_INFO                       ( 2007) count:5      ave_time:37640  total_time:188203
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:2      ave_time:16552  total_time:33105
	REQUEST_SHARE_INFO                      ( 2022) count:1      ave_time:327    total_time:327
	REQUEST_RECONFIGURE                     ( 1003) count:1      ave_time:859557 total_time:859557
	REQUEST_UPDATE_PARTITION                ( 3005) count:1      ave_time:46670  total_time:46670
	REQUEST_STATS_INFO                      ( 2035) count:1      ave_time:237    total_time:237
	REQUEST_RESERVATION_INFO                ( 2024) count:1      ave_time:15515  total_time:15515

Remote Procedure Call statistics by user
	root            (       0) count:2928   ave_time:3282522 total_time:9611226758
	sourabh.basutkar(    1047) count:39     ave_time:32370  total_time:1262444
	keshav.malpani  (    1104) count:5      ave_time:113163 total_time:565815
	sathvik.reddy   (    1069) count:1      ave_time:655395 total_time:655395

Pending RPC statistics
	No pending RPCs



-------------------------

scontrol show config output

Configuration data as of 2023-03-02T17:46:36
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos
AccountingStorageHost   = sim-s1a2
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreFlags    = job_comment
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes            = (null)
AuthAltParameters       = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 30 sec
BcastExclude            = /lib,/usr/lib,/lib64,/usr/lib64
BcastParameters         = (null)
BOOT_TIME               = 2023-03-02T17:45:10
BurstBufferType         = (null)
CliFilterPlugins        = (null)
ClusterName             = gtsims
CommunicationParameters = keepalivetime=300
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = OnDemand,Performance,UserSpace
CredType                = cred/munge
DebugFlags              = NO_CONF_HASH
DefMemPerNode           = UNLIMITED
DependencyParameters    = (null)
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = NO
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = (null)
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
JobAcctGatherFrequency  = 0
JobAcctGatherType       = jobacct_gather/none
JobAcctGatherParams     = (null)
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KeepAliveTime           = 250 sec
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Licenses                = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 4000000
MaxDBDMsgs              = 1002128
MaxJobCount             = 500000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 90 sec
MinJobAge               = 420 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 25579778
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/:/usr/lib/slurm/
PlugStackConfig         = (null)
PowerParameters         = (null)
PowerPlugin             = 
PreemptMode             = REQUEUE
PreemptType             = preempt/qos
PreemptExemptTime       = 00:00:00
PrEpParameters          = (null)
PrEpPlugins             = prep/script
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityType            = priority/basic
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 600 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = route/default
SchedulerParameters     = FastSchedule=1,batch_sched_delay=6,sched_min_interval=2000000,sched_max_job_start=500,default_queue_depth=1000,preempt_youngest_first,max_rpc_cnt=300,defer,sched_interval=2
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/builtin
ScronParameters         = (null)
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE,CR_LLN
SlurmUser               = slurm(1001)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = sim-s1a2(10.100.0.3)
SlurmctldLogFile        = /var/log/slurmctl.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 300 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 21.08.8
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurmd
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = INFINITE
SuspendTimeout          = 30 sec
SwitchParameters        = (null)
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity,task/cgroup
TaskPluginParam         = cores
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = No
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 120 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
CgroupPlugin            = (null)
ConstrainCores          = yes
ConstrainDevices        = no
ConstrainKmemSpace      = no
ConstrainRAMSpace       = no
ConstrainSwapSpace      = no
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

------------------------------------------------
slurm.conf 

SlurmctldHost=server-ctl
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
TaskPluginParam=Cores,SlurmdOffSpec
MailProg=/bin/mail
InactiveLimit=0
KillWait=30
MinJobAge=420
SlurmctldTimeout=300
SlurmdTimeout=300
UnkillableStepTimeout=120
MessageTimeout=90
ResumeTimeout=600
KeepAliveTime=250
BatchStartTimeout=30
Waittime=0
SchedulerType=sched/builtin
PreemptType=preempt/qos
PreemptMode=REQUEUE
SchedulerParameters=FastSchedule=1,batch_sched_delay=6,sched_min_interval=2000000,sched_max_job_start=500,default_queue_depth=1000,preempt_youngest_first,max_rpc_cnt=300,defer,sched_interval=2
CommunicationParameters=keepalivetime=300
SelectType=select/cons_res
SelectTypeParameters=CR_Core,CR_LLN
DebugFlags=NO_CONF_HASH
MaxArraySize=4000000

AccountingStorageHost=sim-s1a2
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=root
AccountingStoreFlags=job_comment
AccountingStorageEnforce=limits,qos
ClusterName=gtsims
MaxJobCount=500000
JobCompType=jobcomp/none
JobAcctGatherFrequency=0
JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctl.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm.log
include /etc/slurm/host
include /etc/slurm/partition
PluginDir=/usr/:/usr/lib/slurm/

------------------------------------
partition file

PartitionName=med   Nodes=server-k1a1,server-k1a4,server-k1a5,server-k1a6,server-k1a7,server-k1a8,server-k1a9,server-k1a10,server-k1a12,server-k1b1,server-k1b2,server-k1b3,server-k1b4,server-k1b5,server-k1b6,server-k1b8,server-k1b9,server-k1b10,server-k1b11,server-k1b12,server-k1c1,server-k1c2,server-k1c3,server-k1c4,server-k1c5,server-k1c6,server-k1c7,server-k1c8,server-k1c9,server-k1c10,server-k1c11,server-k1c12,server-k2a1,server-k2a2,server-k2a3,server-k2a4,server-k2a5,server-k2a6,server-k2a7,server-k2a8,server-k2a9,server-k2a10,server-k2a11,server-k2a12,server-k2b1,server-k2b2,server-k2b3,server-k2b4,server-k2b5,server-k2b6,server-k2b7,server-k2b8,server-k2b9,server-k2b10,server-k2b11,server-k2b12,server-k3a1,server-k3a2,server-k3a3,server-k3a4,server-k3a5,server-k3a6,server-k3a7,server-k3a8,server-k3a9,server-k3a10,server-k3a11,server-k3a12,server-k3b1,server-k3b2,server-k3b3,server-k3b4,server-k3b5,server-k3b6,server-k3b7,server-k3b8,server-k3b9,server-k3b10,server-k3b11,server-k3b12,server-k3c1,server-k3c2,server-k3c3,server-k3c4,server-k3c5,server-k3c6,server-k3c7,server-k3c8,server-k3c9,server-k3c10,server-k3c11,server-k3c12,server-k4a1,server-k4a3,server-k4a4,server-k4a5,server-k4a6,server-k4a7,server-k4a8,server-k4a9,server-k4a10,server-k4a11,server-k4a12,server-k4b1,server-k4b2,server-k4b3,server-k4b4,server-k4b5,server-k4b6,server-k4b7,server-k4b8,server-k4b9,server-k4b10,server-k4b11,server-k4b12,server-k4c1,server-k4c2,server-k4c3,server-k4c4,server-k4c5,server-k4c6,server-k4c7,server-k4c8,server-k4c9,server-k4c10,server-k4c11,server-k4c12,server-k5a1,server-k5a2,server-k5a3,server-k5a4,server-k5a5,server-k5a6,server-k5a7,server-k5a8,server-k5a9,server-k5a10,server-k5a11,server-k5a12,server-k5b1,server-k5b2,server-k5b3,server-k5b4,server-k5b5,server-k5b6,server-k5b7,server-k5b8,server-k5b9,server-k5b10,server-k5b11,server-k5b12,server-k5c1,server-k5c2,server-k5c3,server-k5c4,server-k5c5,server-k5c6,server-k5c7,server-k5c8,server-k5c9,server-k5c10,server-k5c11,server-k5c12,server-k6a1,server-k6a2,server-k6a3,server-k6a4,server-k6a5,server-k6a6,server-k6a7,server-k6a8,server-k6a9,server-k6a10,server-k6a11,server-k6a12,server-k6b1,server-k6b2,server-k6b3,server-k6b4,server-k6b5,server-k6b6,server-k6b7,server-k6b8,server-k6b9,server-k6b10,server-k6b11,server-k6b12,server-k6c1,server-k6c2,server-k6c3,server-k6c4,server-k6c5,server-k6c6,server-k6c7,server-k6c8,server-k6c9,server-k6c10,server-k6c11,server-k6c12,server-k7a1,server-k7a2,server-k7a3,server-k7a4,server-k7a5,server-k7a6,server-k7a7,server-k7a8,server-k7a9,server-k7a10,server-k8a1,server-k8a2,server-k8a3,server-k8a4,server-k8a5,server-k8a6,server-k8a7,server-k8a8,server-k8a9,server-k8a10,server-k8a11,server-k8a12,server-k8b1,server-k8b2,server-k8b3,server-k8b4,server-k8b5,server-k8b6,server-k8b7,server-k8b8,server-k8b9,server-k8b10,server-k8b11,server-k8b12,server-k8c1,server-k8c2,server-k8c3,server-k8c4,server-k8c5,server-k8c6,server-k8c7,server-k8c8,server-k8c9,server-k8c10,server-k8c11,server-k8c12,server-k9a1,server-k9a2,server-k9a3,server-k9a4,server-k9a5,server-k9a6,server-k9a7,server-k9a8,server-k9a9,server-k9a10,server-k9a11,server-k9a12,server-k9b1,server-k9b2,server-k9b3,server-k9b4,server-k9b5,server-k9b6,server-k9b7,server-k9b8,server-k9b9,server-k9b10,server-k9b11,server-k9b12,server-k9c1,server-k9c2,server-k9c3,server-k9c4,server-k9c5,server-k9c6,server-k9c7,server-k9c8,server-k9c9,server-k9c10,server-k9c11,server-k9c12,server-k10a1,server-k10a2,server-k10a3,server-k10a4,server-k10a5,server-k10a6,server-k10a7,server-k10a8,server-k10a9,server-k10a10,server-k10a11,server-k10a12,server-k10b5,server-k10b6,server-k10b7,server-k10b8,server-k10b9,server-k10b10,server-k10b11,server-k10b12,server-k10c1,server-k10c2,server-k10c3,server-k10c4,server-k10c5,server-k10c6,server-k10c7,server-k10c9,server-k10c10,server-k10c11,server-k10c12,server-k11a1,server-k11a2,server-k11a3,server-k11a4,server-k11a5,server-k11a6,server-k11a7,server-k11a8,server-k11b1,server-k11b2,server-k11b3,server-k11b4,server-k13a1,server-k13a2,server-k13a3,server-k13a4,server-k13a5,server-k13a6,server-k13a7,server-k13a8,server-k13a9,server-k13a10,server-k13a11,server-k13a12,server-k13b1,server-k13b2,server-k13b3,server-k13b4,server-k13b5,server-k13b6,server-k13b7,server-k13b8,server-k13b9,server-k13b10,server-k13b11,server-k13b12,server-k13c1,server-k13c2,server-k13c3,server-k13c4,server-k13c5,server-k13c6,server-k13c7,server-k13c8,server-k13c9,server-k13c10,server-k13c11,server-k13c12,server-k14a1,server-k14a2,server-k14a4,server-k14a5,server-k14a6,server-k14a7,server-k14a8,server-k14a9,server-k14a10,server-k14a11,server-k14a12,server-k14b1,server-k14b2,server-k14b3,server-k14b4,server-k14b5,server-k14b6,server-k14b7,server-k14b8,server-k14b9,server-k14b10,server-k14b11,server-k14b12,server-k14c1,server-k14c2,server-k14c3,server-k14c4,server-k14c5,server-k14c6,server-k14c7,server-k14c9,server-k14c10,server-k14c11,server-k14c12,server-k15a1,server-k15a2,server-k15a3,server-k15a4,server-k15a5,server-k15a6,server-k15a7,server-k15a8,server-k15a9,server-k15a10,server-k15a11,server-k15a12,server-k15b1,server-k15b2,server-k15b3,server-k15b4,server-k17a1,server-k17a2,server-k17a3,server-k17a4,server-k17a5,server-k17a6,server-k17a7,server-k17a8,server-k17a9,server-k17a10,server-k17a11,server-k17a12,server-k17b1,server-k17b2,server-k17b3,server-k17b4,server-k17b5,server-k17b6,server-k17b7,server-k17b8,server-k17b9,server-k17b10,server-k17b11,server-k17b12,server-k17c1,server-k17c2,server-k17c3,server-k17c4,server-k17c5,server-k17c6,server-k17c7,server-k17c8,server-k17c9,server-k17c10,server-k17c11,server-k17c12,server-k18a1,server-k18a2,server-k18a3,server-k18a4,server-k18a5,server-k18a6,server-k18a7,server-k18a8,server-k18a9,server-k18a10,server-k18a11,server-k18a12,server-k18b1,server-k18b3,server-k18b4,server-k18b5,server-k18b6,server-k18b7,server-k18b8,server-k18b9,server-k18b10,server-k18b11,server-k18b12,server-k18c1,server-k18c2,server-k18c3,server-k18c4,server-k18c5,server-k18c6,server-k18c7,server-k18c8,server-k18c9,server-k18c10,server-k18c11,server-k18c12,server-k19a1,server-k19a2,server-k19a3,server-k19a4,server-k19a5,server-k19a6,server-k19a7,server-k19a8,server-k19a9,server-k19a10,server-k19a11,server-k19a12,server-k19b2,server-k19b3,server-k19b4,server-k19b5,server-k19b6,server-k19b7,server-k19b8,server-k19b9,server-k19b10,server-k19b11,server-k19b12,server-k19c1,server-k19c2,server-k19c3,server-k19c4,server-k19c5,server-k19c6,server-k19c7,server-k19c8,server-k19c9,server-k19c10,server-k19c11,server-k19c12,server-k20a1,server-k20a2,server-k20a3,server-k20a4,server-k20a5,server-k20a6,server-k20a7,server-k20a8,server-k20a9,server-k20a10,server-k20a11,server-k20a12,server-k20b1,server-k20b2,server-k20b3,server-k20b4  PriorityTier=20 Default=YES MaxTime=INFINITE State=UP

---------------------------------------------------

slurmdbd.conf

ArchiveEvents=yes 
ArchiveJobs=yes 
ArchiveResvs=yes 
ArchiveSteps=no 
ArchiveSuspend=no 
ArchiveTXN=no 
ArchiveUsage=no 
#ArchiveScript=/usr/sbin/slurm.dbd.archive 
AuthInfo=/var/run/munge/munge.socket.2 
AuthType=auth/munge 
DbdHost=sim-s1a2
DbdPort=6819
DebugLevel=verbose

PurgeEventAfter=3days 
PurgeJobAfter=3days 
PurgeResvAfter=3days 
PurgeStepAfter=1days 
PurgeSuspendAfter=3days 
PurgeTXNAfter=3days
PurgeUsageAfter=3days 
LogFile=/var/log/slurmdbd.log
PidFile=/var/run/slurmdbd.pid 
CommitDelay=1
SlurmUser=slurm
StorageType=accounting_storage/mysql 
StorageUser=slurm
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePass=strpass@c0de
StoragePort=3306
PluginDir=/usr/:/usr/lib/slurm/


-------------------------------------------
cgroup.conf

CgroupAutomount=yes 
ConstrainCores=yes 

--------------------------------------------

Kindly help we are stuck with issue since last few months