Hi We have a cluster of around 50k cores in which almost every job is an array jobs whose time of execution varies from 30s to max 24 hrs (80 % of jobs are 5-10 min jobs), slurm responsiveness get slows (squeue, sacct, sinfo sbatch)when there are too many no. of slurm calls like burst of job submission, cancellation, preemption etc. Can u please help us in fine tuning our cluster. The slurmctld and slurmdbd are on the same linux server, the server configuration is 64 cores, 256gb of ram, all the slurm communications are over mellanox high speed ethernet (25Gb/s) We have seen slurm sched threads 256, and slurm db queue length too high when the slurm operation slows down The munge auth service is running with 10 threads slurm. conf details sdiag output ******************************************************* sdiag output at Thu Mar 02 17:35:57 2023 (1677758757) Data since Thu Mar 02 17:35:13 2023 (1677758713) ******************************************************* Server thread count: 256 Agent queue size: 0 Agent count: 2 Agent thread count: 6 DBD Agent queue size: 2 Jobs submitted: 506 Jobs started: 700 Jobs completed: 1861 Jobs canceled: 0 Jobs failed: 0 Job states ts: Thu Mar 02 17:35:43 2023 (1677758743) Jobs pending: 238 Jobs running: 23207 Main schedule statistics (microseconds): Last cycle: 133286 Max cycle: 1868295 Total cycles: 18 Mean cycle: 294357 Mean depth cycle: 43 Last queue length: 40147 Backfilling stats Total backfilled jobs (since last slurm start): 0 Total backfilled jobs (since last stats cycle start): 0 Total backfilled heterogeneous job components: 0 Total cycles: 0 Last cycle when: Mon Dec 19 15:01:35 2022 (1671442295) Last cycle: 0 Max cycle: 0 Last depth cycle: 0 Last depth cycle (try sched): 0 Last queue length: 0 Last table size: 0 Latency for 1000 calls to gettimeofday(): 24 microseconds Remote Procedure Call statistics by message type REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:1861 ave_time:5142346 total_time:9569907374 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:1019 ave_time:34973 total_time:35638356 REQUEST_HET_JOB_ALLOC_INFO ( 4027) count:17 ave_time:6019 total_time:102323 REQUEST_JOB_STEP_CREATE ( 5001) count:17 ave_time:28601 total_time:486223 REQUEST_STEP_COMPLETE ( 5016) count:15 ave_time:101770 total_time:1526550 REQUEST_PARTITION_INFO ( 2009) count:13 ave_time:16787 total_time:218240 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:11 ave_time:172282 total_time:1895108 REQUEST_JOB_INFO ( 2003) count:8 ave_time:349078 total_time:2792624 REQUEST_NODE_INFO ( 2007) count:5 ave_time:37640 total_time:188203 MESSAGE_EPILOG_COMPLETE ( 6012) count:2 ave_time:16552 total_time:33105 REQUEST_SHARE_INFO ( 2022) count:1 ave_time:327 total_time:327 REQUEST_RECONFIGURE ( 1003) count:1 ave_time:859557 total_time:859557 REQUEST_UPDATE_PARTITION ( 3005) count:1 ave_time:46670 total_time:46670 REQUEST_STATS_INFO ( 2035) count:1 ave_time:237 total_time:237 REQUEST_RESERVATION_INFO ( 2024) count:1 ave_time:15515 total_time:15515 Remote Procedure Call statistics by user root ( 0) count:2928 ave_time:3282522 total_time:9611226758 sourabh.basutkar( 1047) count:39 ave_time:32370 total_time:1262444 keshav.malpani ( 1104) count:5 ave_time:113163 total_time:565815 sathvik.reddy ( 1069) count:1 ave_time:655395 total_time:655395 Pending RPC statistics No pending RPCs ------------------------- scontrol show config output Configuration data as of 2023-03-02T17:46:36 AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits,qos AccountingStorageHost = sim-s1a2 AccountingStorageExternalHost = (null) AccountingStorageParameters = (null) AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreFlags = job_comment AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = No AuthAltTypes = (null) AuthAltParameters = (null) AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 30 sec BcastExclude = /lib,/usr/lib,/lib64,/usr/lib64 BcastParameters = (null) BOOT_TIME = 2023-03-02T17:45:10 BurstBufferType = (null) CliFilterPlugins = (null) ClusterName = gtsims CommunicationParameters = keepalivetime=300 CompleteWait = 0 sec CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = OnDemand,Performance,UserSpace CredType = cred/munge DebugFlags = NO_CONF_HASH DefMemPerNode = UNLIMITED DependencyParameters = (null) DisableRootJobs = No EioTimeout = 60 EnforcePartLimits = NO Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = (null) GpuFreqDef = high,memory=high GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 0 sec InteractiveStepOptions = --interactive --preserve-env --pty $SHELL JobAcctGatherFrequency = 0 JobAcctGatherType = jobacct_gather/none JobAcctGatherParams = (null) JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = (null) KeepAliveTime = 250 sec KillOnBadExit = 0 KillWait = 30 sec LaunchParameters = (null) LaunchType = launch/slurm Licenses = (null) LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 4000000 MaxDBDMsgs = 1002128 MaxJobCount = 500000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MessageTimeout = 90 sec MinJobAge = 420 sec MpiDefault = none MpiParams = (null) NEXT_JOB_ID = 25579778 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /usr/:/usr/lib/slurm/ PlugStackConfig = (null) PowerParameters = (null) PowerPlugin = PreemptMode = REQUEUE PreemptType = preempt/qos PreemptExemptTime = 00:00:00 PrEpParameters = (null) PrEpPlugins = prep/script PriorityParameters = (null) PrioritySiteFactorParameters = (null) PrioritySiteFactorPlugin = (null) PriorityType = priority/basic PrivateData = none ProctrackType = proctrack/cgroup Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = (null) PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = (null) ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 600 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 RoutePlugin = route/default SchedulerParameters = FastSchedule=1,batch_sched_delay=6,sched_min_interval=2000000,sched_max_job_start=500,default_queue_depth=1000,preempt_youngest_first,max_rpc_cnt=300,defer,sched_interval=2 SchedulerTimeSlice = 30 sec SchedulerType = sched/builtin ScronParameters = (null) SelectType = select/cons_res SelectTypeParameters = CR_CORE,CR_LLN SlurmUser = slurm(1001) SlurmctldAddr = (null) SlurmctldDebug = info SlurmctldHost[0] = sim-s1a2(10.100.0.3) SlurmctldLogFile = /var/log/slurmctl.log SlurmctldPort = 6817 SlurmctldSyslogDebug = unknown SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 300 sec SlurmctldParameters = (null) SlurmdDebug = info SlurmdLogFile = /var/log/slurm.log SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdSyslogDebug = unknown SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 21.08.8 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /var/spool/slurmd SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = INFINITE SuspendTimeout = 30 sec SwitchParameters = (null) SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/affinity,task/cgroup TaskPluginParam = cores TaskProlog = (null) TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 50 UsePam = No UnkillableStepProgram = (null) UnkillableStepTimeout = 120 sec VSizeFactor = 0 percent WaitTime = 0 sec X11Parameters = (null) Cgroup Support Configuration: AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file.conf AllowedKmemSpace = (null) AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupAutomount = yes CgroupMountpoint = /sys/fs/cgroup CgroupPlugin = (null) ConstrainCores = yes ConstrainDevices = no ConstrainKmemSpace = no ConstrainRAMSpace = no ConstrainSwapSpace = no MaxKmemPercent = 100.0% MaxRAMPercent = 100.0% MaxSwapPercent = 100.0% MemorySwappiness = (null) MinKmemSpace = 30 MB MinRAMSpace = 30 MB TaskAffinity = no ------------------------------------------------ slurm.conf SlurmctldHost=server-ctl MpiDefault=none ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurmd SwitchType=switch/none TaskPlugin=task/affinity,task/cgroup TaskPluginParam=Cores,SlurmdOffSpec MailProg=/bin/mail InactiveLimit=0 KillWait=30 MinJobAge=420 SlurmctldTimeout=300 SlurmdTimeout=300 UnkillableStepTimeout=120 MessageTimeout=90 ResumeTimeout=600 KeepAliveTime=250 BatchStartTimeout=30 Waittime=0 SchedulerType=sched/builtin PreemptType=preempt/qos PreemptMode=REQUEUE SchedulerParameters=FastSchedule=1,batch_sched_delay=6,sched_min_interval=2000000,sched_max_job_start=500,default_queue_depth=1000,preempt_youngest_first,max_rpc_cnt=300,defer,sched_interval=2 CommunicationParameters=keepalivetime=300 SelectType=select/cons_res SelectTypeParameters=CR_Core,CR_LLN DebugFlags=NO_CONF_HASH MaxArraySize=4000000 AccountingStorageHost=sim-s1a2 AccountingStoragePort=6819 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=root AccountingStoreFlags=job_comment AccountingStorageEnforce=limits,qos ClusterName=gtsims MaxJobCount=500000 JobCompType=jobcomp/none JobAcctGatherFrequency=0 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctl.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm.log include /etc/slurm/host include /etc/slurm/partition PluginDir=/usr/:/usr/lib/slurm/ ------------------------------------ partition file PartitionName=med Nodes=server-k1a1,server-k1a4,server-k1a5,server-k1a6,server-k1a7,server-k1a8,server-k1a9,server-k1a10,server-k1a12,server-k1b1,server-k1b2,server-k1b3,server-k1b4,server-k1b5,server-k1b6,server-k1b8,server-k1b9,server-k1b10,server-k1b11,server-k1b12,server-k1c1,server-k1c2,server-k1c3,server-k1c4,server-k1c5,server-k1c6,server-k1c7,server-k1c8,server-k1c9,server-k1c10,server-k1c11,server-k1c12,server-k2a1,server-k2a2,server-k2a3,server-k2a4,server-k2a5,server-k2a6,server-k2a7,server-k2a8,server-k2a9,server-k2a10,server-k2a11,server-k2a12,server-k2b1,server-k2b2,server-k2b3,server-k2b4,server-k2b5,server-k2b6,server-k2b7,server-k2b8,server-k2b9,server-k2b10,server-k2b11,server-k2b12,server-k3a1,server-k3a2,server-k3a3,server-k3a4,server-k3a5,server-k3a6,server-k3a7,server-k3a8,server-k3a9,server-k3a10,server-k3a11,server-k3a12,server-k3b1,server-k3b2,server-k3b3,server-k3b4,server-k3b5,server-k3b6,server-k3b7,server-k3b8,server-k3b9,server-k3b10,server-k3b11,server-k3b12,server-k3c1,server-k3c2,server-k3c3,server-k3c4,server-k3c5,server-k3c6,server-k3c7,server-k3c8,server-k3c9,server-k3c10,server-k3c11,server-k3c12,server-k4a1,server-k4a3,server-k4a4,server-k4a5,server-k4a6,server-k4a7,server-k4a8,server-k4a9,server-k4a10,server-k4a11,server-k4a12,server-k4b1,server-k4b2,server-k4b3,server-k4b4,server-k4b5,server-k4b6,server-k4b7,server-k4b8,server-k4b9,server-k4b10,server-k4b11,server-k4b12,server-k4c1,server-k4c2,server-k4c3,server-k4c4,server-k4c5,server-k4c6,server-k4c7,server-k4c8,server-k4c9,server-k4c10,server-k4c11,server-k4c12,server-k5a1,server-k5a2,server-k5a3,server-k5a4,server-k5a5,server-k5a6,server-k5a7,server-k5a8,server-k5a9,server-k5a10,server-k5a11,server-k5a12,server-k5b1,server-k5b2,server-k5b3,server-k5b4,server-k5b5,server-k5b6,server-k5b7,server-k5b8,server-k5b9,server-k5b10,server-k5b11,server-k5b12,server-k5c1,server-k5c2,server-k5c3,server-k5c4,server-k5c5,server-k5c6,server-k5c7,server-k5c8,server-k5c9,server-k5c10,server-k5c11,server-k5c12,server-k6a1,server-k6a2,server-k6a3,server-k6a4,server-k6a5,server-k6a6,server-k6a7,server-k6a8,server-k6a9,server-k6a10,server-k6a11,server-k6a12,server-k6b1,server-k6b2,server-k6b3,server-k6b4,server-k6b5,server-k6b6,server-k6b7,server-k6b8,server-k6b9,server-k6b10,server-k6b11,server-k6b12,server-k6c1,server-k6c2,server-k6c3,server-k6c4,server-k6c5,server-k6c6,server-k6c7,server-k6c8,server-k6c9,server-k6c10,server-k6c11,server-k6c12,server-k7a1,server-k7a2,server-k7a3,server-k7a4,server-k7a5,server-k7a6,server-k7a7,server-k7a8,server-k7a9,server-k7a10,server-k8a1,server-k8a2,server-k8a3,server-k8a4,server-k8a5,server-k8a6,server-k8a7,server-k8a8,server-k8a9,server-k8a10,server-k8a11,server-k8a12,server-k8b1,server-k8b2,server-k8b3,server-k8b4,server-k8b5,server-k8b6,server-k8b7,server-k8b8,server-k8b9,server-k8b10,server-k8b11,server-k8b12,server-k8c1,server-k8c2,server-k8c3,server-k8c4,server-k8c5,server-k8c6,server-k8c7,server-k8c8,server-k8c9,server-k8c10,server-k8c11,server-k8c12,server-k9a1,server-k9a2,server-k9a3,server-k9a4,server-k9a5,server-k9a6,server-k9a7,server-k9a8,server-k9a9,server-k9a10,server-k9a11,server-k9a12,server-k9b1,server-k9b2,server-k9b3,server-k9b4,server-k9b5,server-k9b6,server-k9b7,server-k9b8,server-k9b9,server-k9b10,server-k9b11,server-k9b12,server-k9c1,server-k9c2,server-k9c3,server-k9c4,server-k9c5,server-k9c6,server-k9c7,server-k9c8,server-k9c9,server-k9c10,server-k9c11,server-k9c12,server-k10a1,server-k10a2,server-k10a3,server-k10a4,server-k10a5,server-k10a6,server-k10a7,server-k10a8,server-k10a9,server-k10a10,server-k10a11,server-k10a12,server-k10b5,server-k10b6,server-k10b7,server-k10b8,server-k10b9,server-k10b10,server-k10b11,server-k10b12,server-k10c1,server-k10c2,server-k10c3,server-k10c4,server-k10c5,server-k10c6,server-k10c7,server-k10c9,server-k10c10,server-k10c11,server-k10c12,server-k11a1,server-k11a2,server-k11a3,server-k11a4,server-k11a5,server-k11a6,server-k11a7,server-k11a8,server-k11b1,server-k11b2,server-k11b3,server-k11b4,server-k13a1,server-k13a2,server-k13a3,server-k13a4,server-k13a5,server-k13a6,server-k13a7,server-k13a8,server-k13a9,server-k13a10,server-k13a11,server-k13a12,server-k13b1,server-k13b2,server-k13b3,server-k13b4,server-k13b5,server-k13b6,server-k13b7,server-k13b8,server-k13b9,server-k13b10,server-k13b11,server-k13b12,server-k13c1,server-k13c2,server-k13c3,server-k13c4,server-k13c5,server-k13c6,server-k13c7,server-k13c8,server-k13c9,server-k13c10,server-k13c11,server-k13c12,server-k14a1,server-k14a2,server-k14a4,server-k14a5,server-k14a6,server-k14a7,server-k14a8,server-k14a9,server-k14a10,server-k14a11,server-k14a12,server-k14b1,server-k14b2,server-k14b3,server-k14b4,server-k14b5,server-k14b6,server-k14b7,server-k14b8,server-k14b9,server-k14b10,server-k14b11,server-k14b12,server-k14c1,server-k14c2,server-k14c3,server-k14c4,server-k14c5,server-k14c6,server-k14c7,server-k14c9,server-k14c10,server-k14c11,server-k14c12,server-k15a1,server-k15a2,server-k15a3,server-k15a4,server-k15a5,server-k15a6,server-k15a7,server-k15a8,server-k15a9,server-k15a10,server-k15a11,server-k15a12,server-k15b1,server-k15b2,server-k15b3,server-k15b4,server-k17a1,server-k17a2,server-k17a3,server-k17a4,server-k17a5,server-k17a6,server-k17a7,server-k17a8,server-k17a9,server-k17a10,server-k17a11,server-k17a12,server-k17b1,server-k17b2,server-k17b3,server-k17b4,server-k17b5,server-k17b6,server-k17b7,server-k17b8,server-k17b9,server-k17b10,server-k17b11,server-k17b12,server-k17c1,server-k17c2,server-k17c3,server-k17c4,server-k17c5,server-k17c6,server-k17c7,server-k17c8,server-k17c9,server-k17c10,server-k17c11,server-k17c12,server-k18a1,server-k18a2,server-k18a3,server-k18a4,server-k18a5,server-k18a6,server-k18a7,server-k18a8,server-k18a9,server-k18a10,server-k18a11,server-k18a12,server-k18b1,server-k18b3,server-k18b4,server-k18b5,server-k18b6,server-k18b7,server-k18b8,server-k18b9,server-k18b10,server-k18b11,server-k18b12,server-k18c1,server-k18c2,server-k18c3,server-k18c4,server-k18c5,server-k18c6,server-k18c7,server-k18c8,server-k18c9,server-k18c10,server-k18c11,server-k18c12,server-k19a1,server-k19a2,server-k19a3,server-k19a4,server-k19a5,server-k19a6,server-k19a7,server-k19a8,server-k19a9,server-k19a10,server-k19a11,server-k19a12,server-k19b2,server-k19b3,server-k19b4,server-k19b5,server-k19b6,server-k19b7,server-k19b8,server-k19b9,server-k19b10,server-k19b11,server-k19b12,server-k19c1,server-k19c2,server-k19c3,server-k19c4,server-k19c5,server-k19c6,server-k19c7,server-k19c8,server-k19c9,server-k19c10,server-k19c11,server-k19c12,server-k20a1,server-k20a2,server-k20a3,server-k20a4,server-k20a5,server-k20a6,server-k20a7,server-k20a8,server-k20a9,server-k20a10,server-k20a11,server-k20a12,server-k20b1,server-k20b2,server-k20b3,server-k20b4 PriorityTier=20 Default=YES MaxTime=INFINITE State=UP --------------------------------------------------- slurmdbd.conf ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes ArchiveSteps=no ArchiveSuspend=no ArchiveTXN=no ArchiveUsage=no #ArchiveScript=/usr/sbin/slurm.dbd.archive AuthInfo=/var/run/munge/munge.socket.2 AuthType=auth/munge DbdHost=sim-s1a2 DbdPort=6819 DebugLevel=verbose PurgeEventAfter=3days PurgeJobAfter=3days PurgeResvAfter=3days PurgeStepAfter=1days PurgeSuspendAfter=3days PurgeTXNAfter=3days PurgeUsageAfter=3days LogFile=/var/log/slurmdbd.log PidFile=/var/run/slurmdbd.pid CommitDelay=1 SlurmUser=slurm StorageType=accounting_storage/mysql StorageUser=slurm StorageHost=localhost StorageLoc=slurm_acct_db StoragePass=strpass@c0de StoragePort=3306 PluginDir=/usr/:/usr/lib/slurm/ ------------------------------------------- cgroup.conf CgroupAutomount=yes ConstrainCores=yes -------------------------------------------- Kindly help we are stuck with issue since last few months