Ticket 17423

Summary:	error: Slurm job queue full, sleeping and retrying
Product:	Slurm	Reporter:	Derek Fox <foxd4>
Component:	Scheduling	Assignee:	Tim McMullan <mcmullan>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	mcglow2, mcmullan
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	RPI/CCNI - Rensselaer Polytechnic Institute	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	sdiag as requested

Description Derek Fox 2023-08-14 09:16:24 MDT

We are seeing the error "Slurm temporarily unable to accept job, sleeping and retrying" when a user is submitting a job. We have over 1000 jobs on the system with the majority of them running. Wehave MaxJobCount set to 10,000 and MaxArraySize=1001. I tried an salloc once earlier and I got that message first then it proceeded to allocate. I tried again just now and did not have the message.

Comment 1 Derek Fox 2023-08-14 09:47:25 MDT

[CCNIdrfx@dcsfen01 ~]$ scontrol show config
Configuration data as of 2023-08-14T11:14:37
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos,safe
AccountingStorageHost   = slurmdb06
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu,gres/nvme
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes            = (null)
AuthAltParameters       = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2023-06-03T09:07:16
BurstBufferType         = (null)
CliFilterPlugins        = (null)
ClusterName             = dcs
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand,UserSpace
CredType                = cred/munge
DebugFlags              = (null)
DefMemPerNode           = UNLIMITED
DependencyParameters    = (null)
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = NO
Epilog                  = /etc/slurm/slurm.epilog
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 2
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu,nvme
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/none
JobAcctGatherParams     = (null)
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm/slurmjobs.log
JobCompPort             = 0
JobCompType             = jobcomp/filetxt
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = require_timelimit,lua
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Licenses                = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /etc/slurm/mailprog_wrapper.py
MaxArraySize            = 1001
MaxDBDMsgs              = 21080
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 10 sec
MinJobAge               = 14400 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 751362
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = (null)
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = OFF
PreemptType             = preempt/none
PreemptExemptTime       = 00:00:00
PrEpParameters          = (null)
PrEpPlugins             = prep/script
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 60-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = Yes
PriorityFlags           =
PriorityMaxAge          = 14-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 150000000
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 2400000000
PriorityWeightJobSize   = 4000
PriorityWeightPartition = 0
PriorityWeightQOS       = 1000000000
PriorityWeightTRES      = CPU=0,gres/gpu=12000
PrivateData             = jobs,usage
ProctrackType           = proctrack/cgroup
Prolog                  = /etc/slurm/slurm.prolog
PrologEpilogTimeout     = 65534
PrologSlurmctld         = /etc/slurm/slurmctld.prolog
PrologFlags             = Alloc,Contain
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = /etc/slurm/node-reboot.bash
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 1800 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 0
RoutePlugin             = route/default
SbcastParameters        = (null)
SchedulerParameters     = bf_max_job_test=2000,bf_window=5760,bf_resolution=300,defer,kill_invalid_depend
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
ScronParameters         = (null)
SelectType              = select/cons_tres
SelectTypeParameters    = CR_CORE
SlurmUser               = slurm(188)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = dcssn01
SlurmctldHost[1]        = dcssn02
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6817-6818
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 300 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurm/d
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 20.11.8
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /gpfs/u/slurm/dcs
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity,task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = /etc/slurm/slurm.task.prolog
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = No
UnkillableStepProgram   = /etc/slurm/unkillable-killer.sh
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
ConstrainCores          = yes
ConstrainDevices        = yes
ConstrainKmemSpace      = no
ConstrainRAMSpace       = yes
ConstrainSwapSpace      = no
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

Comment 2 Jason Booth 2023-08-14 10:48:49 MDT

Please run sdiag and attach that output to this ticket.

Comment 4 Derek Fox 2023-08-14 10:53:58 MDT

Created attachment 31759 [details]
sdiag as requested

Comment 6 Tim McMullan 2023-08-14 11:45:53 MDT

Hey Derek!

There are a couple things I'm noticing here with the output provided.

> MinJobAge               = 14400 sec
This setting refers to the minimum amount of time that the slurm controller will keep a job in memory after it has completed.

> MaxJobCount             = 10000
This one which you already referenced is the maximum number of jobs that can be in the slurmctld's memory.

Depending on how quickly those jobs are cycling through the jobs you may be bumping into the MaxJobCount limit, since it would be at most 10000 jobs in 4 hours.  Would you be able to look through the slurmctld.log file for errors like "error: job_allocate: MaxJobCount limit from slurm.conf reached (10000)"?  This would help to confirm that this limit is what you are running in to.


Something else is sticking out in the sdiag output, you have a couple users that seem to be generating an awful lot of RPCs.

> PTFMqngp        (    8458) count:31042771 ave_time:2115   total_time:65671629389
> root            (       0) count:21243554 ave_time:18168  total_time:385962453805
> LSMCgnjn        (    8372) count:6362971 ave_time:12241  total_time:77892519333

Notice that PTFMqngp generated 31,000,000 RPCs a cool 10,000,000 over what root is generating, and the next highest user is ~1/4 of root.  I can infer that the majority of those calls are REQUEST_PARTITION_INFO and REQUEST_JOB_INFO which suggests that PRFMqngp might be running something like squeue in a loop.  If possible I'd see if they can reduce the number of those kinds of calls since they can slow the system down.

Let me know what you find in the slurmctld log file!
--Tim

Comment 7 Derek Fox 2023-08-14 12:32:24 MDT

Thank you for that analysis. I do see a very large number of "error: job_allocate: MaxJobCount limit from slurm.conf reached (10000)" messages.

Comment 8 Tim McMullan 2023-08-14 12:41:10 MDT

(In reply to foxd4 from comment #7)
> Thank you for that analysis. I do see a very large number of "error:
> job_allocate: MaxJobCount limit from slurm.conf reached (10000)" messages.

Sure thing!  That pretty much confirms the issue, so my suggestion here would be to tweak MinJobAge and/or MaxJobCount to strike a new balance for this workload.  Either one/both should be safe enough to do at the moment.

Let me know if there is any more information or help I can provide on this!

Thanks,
--Tim

Comment 9 Tim McMullan 2023-08-18 07:23:59 MDT

Hey Derek,

I just wanted to check and see if you needed anything else on this one!

Thanks,
--Tim

Comment 10 Derek Fox 2023-08-21 07:35:22 MDT

I think you answered the question so you can go ahead and close. Thanks for the help.
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, August 18, 2023 9:23 AM
To: Fox, Derek Adam <foxd4@rpi.edu>
Subject: [EXTERNAL][Bug 17423] error: Slurm job queue full, sleeping and retrying

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Comment # 9<https://bugs.schedmd.com/show_bug.cgi?id=17423#c9> on bug 17423<https://bugs.schedmd.com/show_bug.cgi?id=17423> from Tim McMullan<mailto:mcmullan@schedmd.com>

Hey Derek,

I just wanted to check and see if you needed anything else on this one!

Thanks,
--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 11 Tim McMullan 2023-08-21 09:08:21 MDT

Sounds good!  I'll close this now, let us know if you have any other issues!

Thanks!
--Tim