| Summary: | error: Slurm job queue full, sleeping and retrying | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Derek Fox <foxd4> |
| Component: | Scheduling | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | mcglow2, mcmullan |
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | RPI/CCNI - Rensselaer Polytechnic Institute | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | sdiag as requested | ||
|
Description
Derek Fox
2023-08-14 09:16:24 MDT
[CCNIdrfx@dcsfen01 ~]$ scontrol show config Configuration data as of 2023-08-14T11:14:37 AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits,qos,safe AccountingStorageHost = slurmdb06 AccountingStorageExternalHost = (null) AccountingStorageParameters = (null) AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu,gres/nvme AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = No AuthAltTypes = (null) AuthAltParameters = (null) AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec BOOT_TIME = 2023-06-03T09:07:16 BurstBufferType = (null) CliFilterPlugins = (null) ClusterName = dcs CommunicationParameters = (null) CompleteWait = 0 sec CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = Performance,OnDemand,UserSpace CredType = cred/munge DebugFlags = (null) DefMemPerNode = UNLIMITED DependencyParameters = (null) DisableRootJobs = No EioTimeout = 60 EnforcePartLimits = NO Epilog = /etc/slurm/slurm.epilog EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 2 FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu,nvme GpuFreqDef = high,memory=high GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 0 sec InteractiveStepOptions = --interactive --preserve-env --pty $SHELL JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/none JobAcctGatherParams = (null) JobCompHost = localhost JobCompLoc = /var/log/slurm/slurmjobs.log JobCompPort = 0 JobCompType = jobcomp/filetxt JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = require_timelimit,lua KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 30 sec LaunchParameters = (null) LaunchType = launch/slurm Licenses = (null) LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /etc/slurm/mailprog_wrapper.py MaxArraySize = 1001 MaxDBDMsgs = 21080 MaxJobCount = 10000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MessageTimeout = 10 sec MinJobAge = 14400 sec MpiDefault = none MpiParams = (null) NEXT_JOB_ID = 751362 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = (null) PowerParameters = (null) PowerPlugin = PreemptMode = OFF PreemptType = preempt/none PreemptExemptTime = 00:00:00 PrEpParameters = (null) PrEpPlugins = prep/script PriorityParameters = (null) PrioritySiteFactorParameters = (null) PrioritySiteFactorPlugin = (null) PriorityDecayHalfLife = 60-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = Yes PriorityFlags = PriorityMaxAge = 14-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 150000000 PriorityWeightAssoc = 0 PriorityWeightFairShare = 2400000000 PriorityWeightJobSize = 4000 PriorityWeightPartition = 0 PriorityWeightQOS = 1000000000 PriorityWeightTRES = CPU=0,gres/gpu=12000 PrivateData = jobs,usage ProctrackType = proctrack/cgroup Prolog = /etc/slurm/slurm.prolog PrologEpilogTimeout = 65534 PrologSlurmctld = /etc/slurm/slurmctld.prolog PrologFlags = Alloc,Contain PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = /etc/slurm/node-reboot.bash ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 1800 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 0 RoutePlugin = route/default SbcastParameters = (null) SchedulerParameters = bf_max_job_test=2000,bf_window=5760,bf_resolution=300,defer,kill_invalid_depend SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill ScronParameters = (null) SelectType = select/cons_tres SelectTypeParameters = CR_CORE SlurmUser = slurm(188) SlurmctldAddr = (null) SlurmctldDebug = info SlurmctldHost[0] = dcssn01 SlurmctldHost[1] = dcssn02 SlurmctldLogFile = /var/log/slurm/slurmctld.log SlurmctldPort = 6817-6818 SlurmctldSyslogDebug = unknown SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 300 sec SlurmctldParameters = (null) SlurmdDebug = info SlurmdLogFile = /var/log/slurm/slurmd.log SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurm/d SlurmdSyslogDebug = unknown SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 20.11.8 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /gpfs/u/slurm/dcs SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/affinity,task/cgroup TaskPluginParam = (null type) TaskProlog = /etc/slurm/slurm.task.prolog TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 50 UsePam = No UnkillableStepProgram = /etc/slurm/unkillable-killer.sh UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec X11Parameters = (null) Cgroup Support Configuration: AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file.conf AllowedKmemSpace = (null) AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupAutomount = yes CgroupMountpoint = /sys/fs/cgroup ConstrainCores = yes ConstrainDevices = yes ConstrainKmemSpace = no ConstrainRAMSpace = yes ConstrainSwapSpace = no MaxKmemPercent = 100.0% MaxRAMPercent = 100.0% MaxSwapPercent = 100.0% MemorySwappiness = (null) MinKmemSpace = 30 MB MinRAMSpace = 30 MB TaskAffinity = no Please run sdiag and attach that output to this ticket. Created attachment 31759 [details]
sdiag as requested
Hey Derek! There are a couple things I'm noticing here with the output provided. > MinJobAge = 14400 sec This setting refers to the minimum amount of time that the slurm controller will keep a job in memory after it has completed. > MaxJobCount = 10000 This one which you already referenced is the maximum number of jobs that can be in the slurmctld's memory. Depending on how quickly those jobs are cycling through the jobs you may be bumping into the MaxJobCount limit, since it would be at most 10000 jobs in 4 hours. Would you be able to look through the slurmctld.log file for errors like "error: job_allocate: MaxJobCount limit from slurm.conf reached (10000)"? This would help to confirm that this limit is what you are running in to. Something else is sticking out in the sdiag output, you have a couple users that seem to be generating an awful lot of RPCs. > PTFMqngp ( 8458) count:31042771 ave_time:2115 total_time:65671629389 > root ( 0) count:21243554 ave_time:18168 total_time:385962453805 > LSMCgnjn ( 8372) count:6362971 ave_time:12241 total_time:77892519333 Notice that PTFMqngp generated 31,000,000 RPCs a cool 10,000,000 over what root is generating, and the next highest user is ~1/4 of root. I can infer that the majority of those calls are REQUEST_PARTITION_INFO and REQUEST_JOB_INFO which suggests that PRFMqngp might be running something like squeue in a loop. If possible I'd see if they can reduce the number of those kinds of calls since they can slow the system down. Let me know what you find in the slurmctld log file! --Tim Thank you for that analysis. I do see a very large number of "error: job_allocate: MaxJobCount limit from slurm.conf reached (10000)" messages. (In reply to foxd4 from comment #7) > Thank you for that analysis. I do see a very large number of "error: > job_allocate: MaxJobCount limit from slurm.conf reached (10000)" messages. Sure thing! That pretty much confirms the issue, so my suggestion here would be to tweak MinJobAge and/or MaxJobCount to strike a new balance for this workload. Either one/both should be safe enough to do at the moment. Let me know if there is any more information or help I can provide on this! Thanks, --Tim Hey Derek, I just wanted to check and see if you needed anything else on this one! Thanks, --Tim I think you answered the question so you can go ahead and close. Thanks for the help. ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, August 18, 2023 9:23 AM To: Fox, Derek Adam <foxd4@rpi.edu> Subject: [EXTERNAL][Bug 17423] error: Slurm job queue full, sleeping and retrying CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Comment # 9<https://bugs.schedmd.com/show_bug.cgi?id=17423#c9> on bug 17423<https://bugs.schedmd.com/show_bug.cgi?id=17423> from Tim McMullan<mailto:mcmullan@schedmd.com> Hey Derek, I just wanted to check and see if you needed anything else on this one! Thanks, --Tim ________________________________ You are receiving this mail because: * You reported the bug. Sounds good! I'll close this now, let us know if you have any other issues! Thanks! --Tim |