Ticket 1390

Summary:	Understanding Priority for Pending Jobs
Product:	Slurm	Reporter:	Will French <will>
Component:	Scheduling	Assignee:	Moe Jette <jette>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	brian, da, simran
Version:	14.11.3
Hardware:	Linux
OS:	Linux
Site:	Vanderbilt	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	squeue output slurm.conf output from sprio output from squeue --start squeue --start 2nd output from sprio sdiag output associations qos Users sdiag output

Description Will French 2015-01-21 01:02:46 MST

Created attachment 1563 [details]
squeue output

Hello,

As we are moving more and more of our users over to our SLURM-managed cluster, we are finally pushing SLURM hard enough to get a better sense of how it handles job scheduling. 

So far, one thing that is unclear to us is the "Priority" reason listed for some pending jobs. Attached is the result of running squeue on our cluster this morning. As you will see, there are a number of jobs (242) that are pending due to "Priority". However, if you look at a summary of the nodes, there are a large number that are completely idle (forgive the alias):

[frenchwr@vmps11 ~]$ sinfofeatures 
NODELIST                                                                                FEATURES      AVAIL NODES(A/I)
vmp[101-103,105-110,112-120]                                                                 amd         up       18/0
vmp[301-380]                                                                  intel,sandy_bridge         up       80/0
vmp[502-548,552-574,602-648,652-653,659-662,664-690,1041-1054,1081-109                     intel         up      92/87
vmp[801-803,805-809,813,815,817-819,821-830,835,838,840]                                  cuda42         up       8/18
vmp[831-834]


As you can see, there are 105 nodes that are completely idle. We understand that it might be possible for jobs to be blocked due to Priority if they are asking for an extremely long wall time and other higher priority jobs are waiting for resources to become available, but even if I submit an extremely small test script requesting 15 minutes of walltime, the job will be pending with "Priority" listed as the reason...the job will eventually run but only after sitting in the queue for 20-25 minutes (all while  those 105 nodes sit idle). Here's the SLURM batch script for this test:

--------------------------------
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=400mb
#SBATCH --time=00:15:00
#SBATCH --output=testjob5.output
echo "Hello world!"
--------------------------------

Note that we do have backfill configured. Can you explain why this happens and what we can do to expedite scheduling of jobs? Here is our configuration:

[frenchwr@vmps11 ~]$ scontrol show config
Configuration data as of 2015-01-21T08:56:56
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,safe
AccountingStorageHost   = slurmdb
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = YES
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 0
AuthInfo                = (null)
AuthType                = auth/munge
BackupAddr              = 10.0.0.50
BackupController        = slurmsched2
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2015-01-20T14:57:32
CacheGroups             = 0
CheckpointType          = checkpoint/none
ChosLoc                 = (null)
ClusterName             = accre
CompleteWait            = 0 sec
ControlAddr             = 10.0.0.49
ControlMachine          = slurmsched1
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = OnDemand
CryptoType              = crypto/munge
DebugFlags              = NO_CONF_HASH
DefMemPerNode           = UNLIMITED
DisableRootJobs         = NO
DynAllocPort            = 0
EnforcePartLimits       = NO
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = (null)
GroupUpdateForce        = 0
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = slurmdb
JobCompLoc              = slurm_jobcomp_db
JobCompPort             = 0
JobCompType             = jobcomp/mysql
JobCompUser             = slurm
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchType              = launch/slurm
Layouts                 = 
Licenses                = (null)
LicensesUsed            = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxJobCount             = 10000
MaxJobId                = 4294901760
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 128
MemLimitEnforce         = yes
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 17482
OverTimeLimit           = 0 min
PluginDir               = /usr/scheduler/slurm/lib/slurm
PlugStackConfig         = /usr/scheduler/slurm-14.11.3/etc/plugstack.conf
PreemptMode             = OFF
PreemptType             = preempt/none
PriorityParameters      = (null)
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = 0
PriorityFlags           = 
PriorityMaxAge          = 14-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 1000
PriorityWeightFairShare = 1000
PriorityWeightJobSize   = 1000
PriorityWeightPartition = 1000
PriorityWeightQOS       = 1000
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = (null)
SallocDefaultCommand    = (null)
SchedulerParameters     = (null)
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = slurm(59229)
SlurmctldDebug          = debug3
SlurmctldLogFile        = (null)
SlurmctldPort           = 6817
SlurmctldTimeout        = 120 sec
SlurmdDebug             = debug2
SlurmdLogFile           = (null)
SlurmdPidFile           = /var/run/slurm/slurmd.pid
SlurmdPlugstack         = (null)
SlurmdPort              = 6818
SlurmdSpoolDir          = /usr/spool/slurm
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurm/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /usr/scheduler/slurm-14.11.3/etc/slurm.conf
SLURM_VERSION           = 14.11.3
SrunEpilog              = (null)
SrunProlog              = (null)
StateSaveLocation       = /usr/scheduler/state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TmpFS                   = /tmp
TopologyPlugin          = topology/none
TrackWCKey              = 0
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Slurmctld(primary/backup) at slurmsched1/slurmsched2 are UP/UP

Comment 1 Will French 2015-01-21 03:55:10 MST

I'm escalating this ticket to high impact because we have several users whose jobs are not being scheduled as needed. From doing some digging, it appears that the problem might be arising from one of our users (glow) who has 172 jobs pending due to AssocGrpCPURunMinsLimit. All of these pending jobs have a higher priority than the other jobs in the queue and appear to be blocking other jobs from beginning....at least that's our best guess. We currently have 1092 jobs pending due to "Priority" but 143 completely idle nodes. Help (even if a quick temporary fix!) would be greatly appreciated! I've tried bumping up account's fairshare to arbitrarily high values but this only appears to help for new jobs, and not jobs that are already queued.

Comment 2 David Bigagli 2015-01-21 04:34:09 MST

Hello,
      we analyzed the data you sent us and we have couple of initial suggestions
that could improve the throughput:

1) Configure SchedulerParameters for backfill. By default backfill only looks
at 100 pending jobs so we can increase this to 500.

SchedulerParameters=bf_max_job_test=500

then bf_max_job_user to have backfill to only check a limited number of jobs
for every user instead of all.

SchedulerParameters=bf_max_job_test=500,bf_max_job_user=10

For other parameters see:

http://slurm.schedmd.com/slurm.conf.html

2) Your PriorityWeightFairShare gives fairshare the same weight as other
parameters so fairshare is not playing a major role in determining jobs
priorities. We suggest you to increase the value 100 to 10000 so fairshare 
will be a dominant factor in priority calculation.

You can also see the output of squeue with the %S format which will give you the expected start time of the pending jobs.

david@prometeo ~/slurm/work $ squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R %S"
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) START_TIME
       6778_[5-20]    markab  sleepme    david PD       0:00      1 (Resources) 2016-01-21T10:22:29
            6778_1    markab  sleepme    david  R       1:42      1 prometeo 2015-01-21T10:22:29

A recommended tutorial about how to tune scheduling can be found here:

http://slurm.schedmd.com/SUG14/sched_tutorial.pdf

David

Comment 3 Will French 2015-01-21 08:07:36 MST

Hi David,

Thanks for the response. We weren't aware of all these backfill parameters, so we have started tweaking them here and there. Your suggestions seemed to help but we are still seeing some oddities that make me think there is an additional parameter that needs to be adjusted. Reading through the SchedulerParameters section, nothing jumps out.

We have set bf_max_job_test=1000 and bf_max_job_user=50. Currently we have 815 jobs in the pending state:

[frenchwr@vmps65 slurm]$ squeue --states=pending | wc -l
815

Many of these are pending due to group limits, but several are also pending due to low priority. For instance, I submitted a short Hello World batch script about half an hour ago that has still not run, even though there are 111 nodes that are completely idle:

[frenchwr@vmps65 slurm]$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
production*    up 14-00:00:0    185    mix vmp[101,105,108,112-114,117-118,120,301,303,305-308,310-313,318-319,321-322,325-326,328-330,332-346,349,351-377,379-380,502-505,508,510-511,513,515-517,519,523,527,529,533,610-648,652-653,659-662,664-690,1001-1003,1041-1044,1047-1053,1082-1083,1086,1088-1093,1095]
production*    up 14-00:00:0     35  alloc vmp[102-103,106-107,109-110,115-116,119,302,304,309,314-317,320,323-324,327,331,347-348,350,378,506-507,509,609,1046,1054,1081,1084,1087,1094]
production*    up 14-00:00:0    111   idle vmp[512,514,518,520-522,524-526,528,530-532,534-548,552-574,602-608,1007-1039,1045,1055-1059,1061-1073,1085]
gpu            up 14-00:00:0      3  down* vmp[810,814,839]
gpu            up 14-00:00:0      8  alloc vmp[801-803,805-809]
gpu            up 14-00:00:0     27   idle vmp[811-813,815-819,821-838,840]




In fact, it appears that SLURM has not even attempted to backfill schedule my job. I gather this from trying to get the projected start time of the job, which (from http://slurm.schedmd.com/slurm.conf.html) is supposed to have a value assigned if SLURM has attempted to schedule it:

[frenchwr@vmps65 slurm]$ squeue --start | head -n 1
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
[frenchwr@vmps65 slurm]$ squeue --start | grep frenchwr
             19253 productio testjob5 frenchwr PD                 N/A      1 (null)               (Priority)



It's unclear why SLURM has not attempted to schedule this job if there are a total of 814 pending jobs and bf_max_job_test=1000. Additionally, with bf_max_job_user=50 shouldn't this ensure that SLURM attempts to backfill schedule up to 50 of my pending jobs? Are we missing something?

Comment 4 David Bigagli 2015-01-21 08:25:22 MST

Can you please send us your slurm.conf, output of sprio and squeue --start for all jobs. Do the jobs use the runtime limit?

David

Comment 5 Will French 2015-01-21 08:43:19 MST

Created attachment 1568 [details]
slurm.conf

Comment 6 Will French 2015-01-21 08:44:09 MST

Created attachment 1569 [details]
output from sprio

Comment 7 Will French 2015-01-21 08:45:03 MST

Created attachment 1570 [details]
output from squeue --start

Comment 8 Will French 2015-01-21 08:46:38 MST

Not sure what you mean when you say runtime limit. Do you mean the default walltime? If so, we have a 15 minute default set for both our partitions.

Comment 9 David Bigagli 2015-01-21 08:58:36 MST

Hi, yes I meant walltime, we see indeed 15 as default and 20160 as max.

After you changed the reconfiguration did you restart slurmctld and the
slurmds? The sprio command still shows the fairshare priority as 0
and most jobs don't have predicted startime.

Also please add bf_continue to the SchedulerParameters.

SchedulerParameters=bf_max_job_test=1000,bf_max_job_user=50,bf_continue

so the backfill scheduler will continue from where it stopped last time
instead of restarting from the top of the queue.

David

Comment 10 Will French 2015-01-21 09:52:38 MST

Yes, we did a: 

service slurmctl restart
and
scontrol reconfigure

but I did it again just to be sure. I waited for several minutes and no changes to the queue. 

However, adding the bf_continue parameter appears to be helping. More jobs are being scheduled, but the rate at which previously "Priority" classified jobs are getting a projected start time and then starting up is slow. When I submit a very small test script (15 minutes walltime), the job takes awhile (more than 10 minutes) to get scheduled and run. Is this expected? It seems like the job is not being considered for backfill scheduling for an prolonged period of time..

Comment 11 David Bigagli 2015-01-21 09:59:40 MST

Is the number of jobs with N/A decreasing now? Can we see again please the 
output of squeue --start, sprio and this time also sdiag.

David

Comment 12 Will French 2015-01-21 10:07:38 MST

Created attachment 1571 [details]
squeue --start

Comment 13 Will French 2015-01-21 10:08:25 MST

Created attachment 1572 [details]
2nd output from sprio

Comment 14 Will French 2015-01-21 10:08:54 MST

Created attachment 1573 [details]
sdiag output

Comment 15 Will French 2015-01-21 10:11:17 MST

Yes, the number of lines with N/A decreased gradually. More jobs are getting submitted so it's a little hard to gauge the affect exactly. Note that I also decreased bf_interval to 5 and bf_max_job_user to 10.

Comment 16 David Bigagli 2015-01-21 10:50:39 MST

Thanks for the data. We are having our scheduling developer to analyzing the data.
Meanwhile I would like to reproduce the AssocGrpMemoryLimit pending jobs
in your cluster to see if it affects things in any way.

Could you please send us the output of:

'sacctmgr show assoc'
'sacctmgr show qos'
'sacctmgr show users'

Thanks,

  David

Comment 17 Will French 2015-01-21 11:31:14 MST

Created attachment 1574 [details]
associations

Comment 18 Will French 2015-01-21 11:31:43 MST

Created attachment 1575 [details]
qos

Comment 19 Will French 2015-01-21 11:32:10 MST

Created attachment 1576 [details]
Users

Comment 20 Will French 2015-01-21 11:33:49 MST

Thanks, David. We're not as concerned about the jobs being blocked due to AssocGrpMemoryLimit. We have that configured in the SLURM db, as you'll see. Our concern is with low priority jobs and them being scheduled slowly when lots of nodes are idle.

Comment 21 David Bigagli 2015-01-21 11:35:00 MST

Yes I understand. I am just trying to see if that may be somehow related.

David

Comment 22 Moe Jette 2015-01-22 04:07:26 MST

All of these jobs hitting association group limits count against the 1000 jobs that the backfill scheduler looks at ("bf_max_job_test=1000"). Since you have over 1000 pending jobs, the scheduler is definitely not be looking at them all.

Running the backfill scheduler really frequently, unless there is a lot of churn in your jobs, is just going to waste time ("bf_interval=5").

This is what we have at Harvard U, which works well for their workload:
SchedulerParameters=bf_interval=600,bf_continue,bf_resolution=300,max_job_bf=5000,bf_max_job_part=5000,bf_max_job_user=100

This is what I would recommend for you, a slight variation of the above:
SchedulerParameters=bf_interval=60,bf_continue,bf_resolution=300,max_job_bf=5000,bf_max_job_user=100

Notes:
bf_interval=60       Running the backfill scheduler once a minute is probably sufficient in most cases
bf_continue          Needed for any system with large job counts
bf_resolution=300    Decreases backfill scheduler overhead
max_job_bf=5000      You want to test most if not all jobs
bf_max_job_user=100  Don't spend too much time on any single user

I have other recommendations for your configuration, but let's get jobs running first.

Comment 23 Will French 2015-01-22 05:01:59 MST

Thanks for the reply, Joe.

Things improved considerably starting last night and through this morning. Of our now ~350 non-GPU nodes, none are currently completely idle. I will make the changes you suggested and continue monitoring queued jobs to see how they respond.

Comment 24 Moe Jette 2015-01-22 05:23:00 MST

(In reply to Will French from comment #23)
> Thanks for the reply, Joe.
> 
> Things improved considerably starting last night and through this morning.
> Of our now ~350 non-GPU nodes, none are currently completely idle. I will
> make the changes you suggested and continue monitoring queued jobs to see
> how they respond.

Actually, it's "Moe".

If you are going to be making configuration changes, here are some other suggestions:

DebugFlags=NO_CONF_HASH
This disables testing that your configuration files are consistent across the cluster. This may be fine if you are in the process of tuning scheduling, but in general is a bad idea. If your configurations get out of sync across the cluster, very difficult to diagnose communication problems could occur, so I would recommend removing it.

JobCompType=jobcomp/mysql
This is storing redundant accounting information already being stored in the slurmdbd (AccountingStorageType=accounting_storage/slurmdbd) and should be removed.

MaxJobCount=10000
You might want to bump this up.

PriorityWeight*
Consider how you want to prioritize the workload. Weighting all of the factors the same is probably not really what you want.

SelectTypeParameters=CR_CORE_MEMORY
Are you setting default memory limits in the partition configuration?

SlurmctldDebug=debug3
SlurmdDebug=debug2
These is really verbose, to the point of likely impacting performance.

Comment 25 Will French 2015-01-22 05:37:36 MST

Moe, not Joe. Sorry about that!

I went ahead and implemented the first set of changes, I'll wait several hours to see what sort of affect these changes have then implement the next set of recommended changes. One clarification, I'm assuming when you wrote "max_job_bf=5000" you meant "bf_max_job_test=5000"?

Comment 26 Moe Jette 2015-01-22 05:40:30 MST

(In reply to Will French from comment #25)
> Moe, not Joe. Sorry about that!
> 
> I went ahead and implemented the first set of changes, I'll wait several
> hours to see what sort of affect these changes have then implement the next
> set of recommended changes. One clarification, I'm assuming when you wrote
> "max_job_bf=5000" you meant "bf_max_job_test=5000"?

Sorry, "max_job_bf=5000" is the old form of "bf_max_job_test=5000", but both work. I added a bunch of backfill scheduling parameters and wanted them all to start with "bf_" for better clarity.

Comment 27 Will French 2015-01-22 08:58:34 MST

A few hours later and scheduling is much better than yesterday. Thank you.


> SelectTypeParameters=CR_CORE_MEMORY
> Are you setting default memory limits in the partition configuration?


Do you mean the DefMemPerCPU option in slurm.conf? If so, no, we have not configured that option yet. Or do you mean listing the RealMemory option on the NodeName lines? If so, yes, we do that (usually by just logging into the node and running free -g...we then let SLURM drain any nodes that have missing memory and we then reduce the RealMemory value down accordingly) 


> SlurmctldDebug=debug3
> SlurmdDebug=debug2
> These is really verbose, to the point of likely impacting performance.

I set both of these to 3.

SlurmctldDebug=3
SlurmdDebug=3

Comment 28 Moe Jette 2015-01-22 09:10:33 MST

(In reply to Will French from comment #27)
> > SelectTypeParameters=CR_CORE_MEMORY
> > Are you setting default memory limits in the partition configuration?
> 
> Do you mean the DefMemPerCPU option in slurm.conf? If so, no, we have not
> configured that option yet. Or do you mean listing the RealMemory option on
> the NodeName lines? If so, yes, we do that (usually by just logging into the
> node and running free -g...we then let SLURM drain any nodes that have
> missing memory and we then reduce the RealMemory value down accordingly) 

The issue is that Slurm is configured to allocate memory to jobs, but without defining DefMemPerCPU/Node and MaxMemPerCPU/Node on a system wide or per partition/queue basis, Slurm has no information as to how much memory each job should be allocated.

You might also use a job_submit plugin for this purpose as was discussed earlier today on the slurm-dev mailing list, but that would be a more complex approach that I would not recommend at this point.

Comment 29 Will French 2015-01-22 09:50:17 MST

Okay, for our CPU-only partition I set:

DefMemPerCPU=1000
DefMemPerNode=2000
MaxMemPerCPU=15375
MaxMemPerNode=123000

The GPU partition is similar, only with different Max values based on the amount of RAM available on these nodes:

MaxMemPerCPU=5625
MaxMemPerNode=45000

Comment 30 Moe Jette 2015-01-22 09:56:56 MST

(In reply to Will French from comment #29)
> Okay, for our CPU-only partition I set:
> 
> DefMemPerCPU=1000
> DefMemPerNode=2000
> MaxMemPerCPU=15375
> MaxMemPerNode=123000
> 
> The GPU partition is similar, only with different Max values based on the
> amount of RAM available on these nodes:
> 
> MaxMemPerCPU=5625
> MaxMemPerNode=45000

See "man slurm.conf":
DefMemPerCPU and DefMemPerNode are  mutually  exclusive.
MaxMemPerCPU and MaxMemPerNode are  mutually  exclusive.

Pick one or the other.

Comment 31 Moe Jette 2015-01-23 03:18:32 MST

How are things running now?
Can I downgrade the severity of this bug or close it?

Comment 32 Will French 2015-01-23 03:20:38 MST

Yes, you can close this ticket. We're happy with scheduling at this point and have a much better understanding of which knobs to turn for tuning. Thank you for all the assistance.

Will

Comment 33 Moe Jette 2015-01-23 03:28:05 MST

Closed per client request.

Comment 34 Will French 2015-02-03 07:54:04 MST

I'm re-opening this ticket because we're experiencing some issues that appear to be related to our backfill parameters.

Our users have reported seeing messages like the following when attempting to run sbatch, squeue, or other SLURM commands:

sbatch: error: slurm_receive_msg: Socket timed out on send/recv operation
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

We gather that this is related to the SLURM controller being busy due to computationally demanding tasks within the backfill algorithm. We don't consider this to be a big problem for users submitting from the command line, but many of our users have automated pipelines that are failing because of this error. At the moment we are using:

SchedulerParameters=bf_interval=60,bf_continue,bf_resolution=300,bf_max_job_test=5000,bf_max_job_user=300

Our general understanding is that we can improve responsiveness by increasing bf_interval and decreasing the other parameters. Is responsiveness most sensitive to one of these parameters more so than the others?

Would you recommend we try the "defer" option? Relatedly, could you explain what the process is for a new job that is submitted to the scheduler? I had assumed that SLURM only attempted to schedule jobs every bf_interval (60 in our case) seconds. So if the controller attempted to schedule jobs at 2 AM (2:00:00), and a job is submitted at 2:00:01, SLURM would not attempt to schedule that job until 2:01:00. The "defer" option leads me to believe that my understanding of how the backfill algorithm works is wrong.

One last thing, bf_window should be in minutes, correct? That's what I see in old documentation (https://computing.llnl.gov/linux/slurm/slurm.conf.html) but docs on the SchedMD site does not specify units (http://slurm.schedmd.com/sched_config.html). Could increasing bf_window improve responsiveness? I understand that it will increase the computational load when initially attempting to schedule a job, but I'm wondering if once a job is scheduled, SLURM will stop trying to schedule that job each backfill iteration?

Comment 35 Moe Jette 2015-02-03 08:39:45 MST

(In reply to Will French from comment #34)
> I'm re-opening this ticket because we're experiencing some issues that
> appear to be related to our backfill parameters.
> 
> Our users have reported seeing messages like the following when attempting
> to run sbatch, squeue, or other SLURM commands:
> 
> sbatch: error: slurm_receive_msg: Socket timed out on send/recv operation
> sbatch: error: Batch job submission failed: Socket timed out on send/recv
> operation
> 
> We gather that this is related to the SLURM controller being busy due to
> computationally demanding tasks within the backfill algorithm. We don't
> consider this to be a big problem for users submitting from the command
> line, but many of our users have automated pipelines that are failing
> because of this error. At the moment we are using:
> 
> SchedulerParameters=bf_interval=60,bf_continue,bf_resolution=300,
> bf_max_job_test=5000,bf_max_job_user=300
> 
> Our general understanding is that we can improve responsiveness by
> increasing bf_interval and decreasing the other parameters. Is
> responsiveness most sensitive to one of these parameters more so than the
> others? 

There are a lot of variables involved. Could you send a current output of the "sdiag" command so we can best advise how to proceed.


> Would you recommend we try the "defer" option?

Probably not unless you are submitting 100+ jobs/second.


> Relatedly, could you explain
> what the process is for a new job that is submitted to the scheduler? I had
> assumed that SLURM only attempted to schedule jobs every bf_interval (60 in
> our case) seconds. So if the controller attempted to schedule jobs at 2 AM
> (2:00:00), and a job is submitted at 2:00:01, SLURM would not attempt to
> schedule that job until 2:01:00. The "defer" option leads me to believe that
> my understanding of how the backfill algorithm works is wrong.

I'll reference the tutorial here:
http://slurm.schedmd.com/SUG14/sched_tutorial.pdf

The scheduler will run immediately at job submit time (see pages 8 and 9). If there are resources are available and the scheduler goes far enough down the list of jobs to reach it and it is the highest priority pending job in its queue, it will start immediately.

Once each minute, all of the jobs get checked (see page 10), but only he highest priority jobs in each queue can be started.

Once each "bf_interval" (time starts after the previous backfill scheduler completes), the backfill scheduler will go however deep in the queues to determine when and where each pending job will start (see pages 14 - 25). Depending upon your configuration and workload, the backfill scheduler might take several minutes to complete a cycle.

It might be helpful for you to contact Jacob Jenson (jacob@schedmd.com) to be notified when we have our next training. We cover this sort of thing in great detail in a 2-day training session.


> One last thing, bf_window should be in minutes, correct? That's what I see
> in old documentation
> (https://computing.llnl.gov/linux/slurm/slurm.conf.html) but docs on the
> SchedMD site does not specify units
> (http://slurm.schedmd.com/sched_config.html).

That will be fixed soon.

> Could increasing bf_window
> improve responsiveness?

That may make the sluggish responses less frequent rather than eliminating them.

> I understand that it will increase the computational
> load when initially attempting to schedule a job, but I'm wondering if once
> a job is scheduled, SLURM will stop trying to schedule that job each
> backfill iteration?

Every pending job (per your configuration) gets checked on each backfill cycle.

Comment 36 Will French 2015-02-04 05:51:28 MST

Hi Moe,

Thanks for the quick reply and detailed explanation. It's very helpful. About the training -- is that on-site or is there an option to remote in?

I'm attaching the output from sdiag.

Note we made a few changes to our backfill parameters since yesterday:

SchedulerParameters     = bf_interval=120,bf_continue,bf_resolution=300,bf_max_job_test=1000,bf_max_job_user=50

One other question -- how does the backfill algorithm handle job arrays? Really what I want to know is if job arrays are scheduled more efficiently (improves responsiveness) than the equivalent of submitting n jobs by invoking sbatch n times. We have a lot of users who do the latter.

Best,

Will

Comment 37 Will French 2015-02-04 05:52:09 MST

Created attachment 1612 [details]
sdiag output

Comment 38 Moe Jette 2015-02-04 05:57:25 MST

(In reply to Will French from comment #36)
> Hi Moe,
> 
> Thanks for the quick reply and detailed explanation. It's very helpful.
> About the training -- is that on-site or is there an option to remote in?

Both are options.


> I'm attaching the output from sdiag.
> 
> Note we made a few changes to our backfill parameters since yesterday:
> 
> SchedulerParameters     =
> bf_interval=120,bf_continue,bf_resolution=300,bf_max_job_test=1000,
> bf_max_job_user=50
> 
> One other question -- how does the backfill algorithm handle job arrays?
> Really what I want to know is if job arrays are scheduled more efficiently
> (improves responsiveness) than the equivalent of submitting n jobs by
> invoking sbatch n times. We have a lot of users who do the latter.

Job arrays are MUCH more efficient with respect to scheduling and general system overhead. For most logic, the entire array is treated as a single job record.


> Best,
> 
> Will

Comment 39 Moe Jette 2015-02-04 07:10:02 MST

(In reply to Will French from comment #37)
> Created attachment 1612 [details]
> sdiag output

I'm going to suggest an addition to your SchedulerParameters option that should greatly reduce worst-case delays, but probably not make much difference to average delays:
max_sched_time=2

I see your current worst case consuming over 5 seconds (half of MessageTimeout):
> Main schedule statistics (microseconds):
> 	Max cycle:    5363032
>	Mean cycle:   259843

As I understand, that would give you 
SchedulerParameters=bf_interval=120,bf_continue,bf_resolution=300,bf_max_job_test=1000,bf_max_job_user=50,max_sched_time=2

There is also one other thing in the sdiag output that I'll need to have someone investigate:
ACCOUNTING_UPDATE_MSG (10001) count:9 ave_time:8451672 total_time:76065050

Comment 40 Moe Jette 2015-02-04 07:42:54 MST

One more thing about your configuration:

SlurmctldDebug          = debug3
SlurmdDebug             = debug2

These are so detailed, that it will definitely adversely impact performance, especially when it comes to accounting and scheduling logic. You probably want to normally run with them both set to "info" or "verbose".

Comment 41 Will French 2015-02-04 07:50:12 MST

We actually have:

SlurmctldDebug=3
SlurmdDebug=3

which appears to correspond to "info":

root@vmps11:~# scontrol show config | grep -i debug
DebugFlags              = (null)
SlurmctldDebug          = info
SlurmdDebug             = info

You may have been looking at the older version of the slurm.conf attached to this thread.

I went ahead and added max_sched_time=2 per your suggestion. Thanks for looking into this.