Ticket 2235

Summary:	Lower priority jobs skips the queue
Product:	Slurm	Reporter:	Akmal Madzlan <akmalm>
Component:	Scheduling	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	phils
Version:	14.11.10
Hardware:	Linux
OS:	Linux
Site:	DownUnder GeoSolutions	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld log

Description Akmal Madzlan 2015-12-09 13:49:18 MST

Created attachment 2496 [details]
slurmctld log

teamfraser,team 500        999_migprep_dz4_filter                           phils PD       0:00      1 (Resources)     17119018_[3077-3492]

* It's eligible to run on many partitions (specifically including teamrottoGpu and idleGpu, which will be relevant in a moment):

Partition=teamfraser,teamfraserGpu,idle,idleGpu,teamrottoGpu,teamregentGpu,tomo

* It was submitted 45 minutes ago, so there should be no other tasks, on any of those partitions, started in the last 45 minutes. And there aren't. Except for all of these:

idleGpu         190        wa512Pr_KDM                                      phils  R      30:01      1 pnod0804        17100520_700
teamrottoGpu    190        wa512Pr_KDM                                      phils  R      26:41      1 pnod0787        17100520_701
idleGpu         190        wa512Pr_KDM                                      phils  R      26:41      1 clus290         17100520_702
idleGpu         190        wa512Pr_KDM                                      phils  R      23:21      1 pnod0822        17100520_703
idleGpu         190        wa512Pr_KDM                                      phils  R      20:00      1 pnod0805        17100520_704
idleGpu         190        wa512Pr_KDM                                      phils  R      16:40      1 clus361         17100520_705
idleGpu         190        wa512Pr_KDM                                      phils  R      15:49      1 pnod0871        17100520_706
idleGpu         190        wa512Pr_KDM                                      phils  R      14:59      1 pnod0799        17100520_707
teamrottoGpu    190        wa512Pr_KDM                                      phils  R      12:29      1 clus145         17100520_708
idleGpu         190        wa512Pr_KDM                                      phils  R      12:29      1 pnod0798        17100520_709
teamrottoGpu    190        wa512Pr_KDM                                      phils  R       9:59      1 pnod0753        17100520_710
teamrottoGpu    190        wa512Pr_KDM                                      phils  R       9:09      1 pnod0751        17100520_711
idleGpu         190        wa512Pr_KDM                                      phils  R       9:09      1 clus282         17100520_712
idleGpu         190        wa512Pr_KDM                                      phils  R       8:19      1 clus281         17100520_713
teamrottoGpu    190        wa512Pr_KDM                                      phils  R       7:28      1 pnod0779        17100520_714
idleGpu         190        wa512Pr_KDM                                      phils  R       7:28      1 clus627         17100520_715
idleGpu         190        wa512Pr_KDM                                      phils  R       6:38      1 clus359         17100520_716
idleGpu         190        wa512Pr_KDM                                      phils  R       4:08      1 pnod0792        17100520_717
teamrottoGpu    190        wa512Pr_KDM                                      phils  R       1:38      1 clus271         17100520_718
teamrottoGpu    190        wa512Pr_KDM                                      phils  R      10:01      1 pnod0764        17100520_719
idleGpu         190        wa512Pr_KDM                                      phils  R       5:51      1 clus158         17100520_720
idleGpu         190        wa512Pr_KDM                                      phils  R      30:52      1 pnod0827        17100520_738
idleGpu         190        wa512Pr_KDM                                      phils  R      20:00      1 pnod0821        17100520_739
idleGpu         190        wa512Pr_KDM                                      phils  R      14:59      1 pnod0806        17100520_740
idleGpu         190        wa512Pr_KDM                                      phils  R      12:29      1 pnod0809        17100520_741

I keep requeueing them, and they keep slowly coming back on.

Furthermore, it's only those tasks. 

Most of those jobs is scheduled by backfill. Here's sdiag output if it could be any help

[root@pud22 ~]# sdiag
*******************************************************
sdiag output at Thu Dec 10 09:13:30 2015
Data since      Thu Dec 10 08:00:00 2015
*******************************************************
Server thread count: 3
Agent queue size:    0

Jobs submitted: 105
Jobs started:   367
Jobs completed: 361
Jobs canceled:  7
Jobs failed:    0

Main schedule statistics (microseconds):
	Last cycle:   29442
	Max cycle:    85066
	Total cycles: 423
	Mean cycle:   29755
	Mean depth cycle:  25
	Cycles per minute: 5
	Last queue length: 18

Backfilling stats
	Total backfilled jobs (since last slurm start): 16224
	Total backfilled jobs (since last stats cycle start): 42
	Total cycles: 85
	Last cycle when: Thu Dec 10 09:13:23 2015
	Last cycle: 27023
	Max cycle:  105529
	Mean cycle: 60974
	Last depth cycle: 19
	Last depth cycle (try sched): 19
	Depth Mean: 32
	Depth Mean (try depth): 25
	Last queue length: 18
	Queue length mean: 32


[root@pud22 ~]# scontrol show config | grep SchedulerPa
SchedulerParameters     = bf_continue,bf_max_job_test=1000,bf_interval=50,bf_resolution=600,defer,max_depend_depth=3,sched_interval=20,batch_sched_delay=3,bf_max_job_part=100,nohold_on_prolog_fail

Comment 1 Tim Wickberg 2015-12-10 08:28:46 MST

Can you describe the problem in a bit more detail? 

It's not clear to me what may be happening - from the description and snippets of log it looks like those jobs are simply backfilling as intended ahead of other jobs on the system?

Can you attach output from sinfo, squeue, and your slurm.conf file? sdiag doesn't tell me much here, and the reading through log without the slurm.conf and squeue status to refer to is difficult.

cheers,
- Tim

Comment 2 Phil Schwan 2015-12-10 14:09:43 MST

(In reply to Tim Wickberg from comment #1)
> Can you describe the problem in a bit more detail? 
> 
> It's not clear to me what may be happening - from the description and
> snippets of log it looks like those jobs are simply backfilling as intended
> ahead of other jobs on the system?

Well, it may simply be backfill behaviour.  Indeed, I have long believed that backfill is totally broken as designed.  I will explain.

My understanding is that the backfill scheduler is basically continuously walking the queue, trying every task to see if it can be scheduled yet.  If a node becomes available while it's halfway through the queue, it will schedule the next task -- even if it's nowhere near the top priority.  This is, I think we can all agree, broken behaviour for a queueing system.

You might then argue "Then you don't want backfill!  Stop using it!"

The problem with that argument is that without backfill, a single task at the top of the queue that can't be scheduled -- e.g. because it requires a licence or feature or amount of memory not currently available -- will prevent anything else from being scheduled.  Even if there are 100s of available nodes, and 100s of tasks that _can_ run, the high-priority task waiting for a specific resource blocks them all.

So with a heterogeneous cluster, where some tasks can only run on certain nodes, we need backfill (or something like it).

Normally I'm willing to put up with backfill's nonsense, because it tends to be fairly rare.  The problem that Akmal described in comment 0, however, was that it was happening A LOT.  Way more than I've ever seen.  Fully 50% of my nodes were consumed by low-priority tasks!

And it's not like those 50% of nodes all came available at once, and were scheduled in one giant backfill orgasm.  It was one at a time, the same backfill queue-jumping behaviour replayed in single file, over and over.

Furthermore, I would expect the backfill queue-jumping to be largely random: i.e. at whatever point in the queue it's reached when the node comes available, that's the task that get scheduled.  But that's not what we see in Akmal's data.  Despite there being dozens of jobs in the queue, it was always choosing consecutive tasks from the same job to jump the queue.

This suggested to me that something worse than its normal misbehaviour was at play.

> Can you attach output from sinfo, squeue, and your slurm.conf file? sdiag
> doesn't tell me much here, and the reading through log without the
> slurm.conf and squeue status to refer to is difficult.

Normally yes; I would have this all in my copious scrollback, but something crashed my machine yesterday.

I can tell you that all partitions were full (no idle nodes), and that as nodes finished these 5-10 minute tasks, they were either scheduling from a priority-600 job (correctly) or from this priority-190 job (incorrectly).

I'm willing to accept that there's not enough information at this point to pursue this with fervor, and simply note it as a data point for the future.  I'd also like to embark on a longer conversation about why backfill's design doesn't work for us, but that's probably not best done in bugzilla.

Comment 3 Tim Wickberg 2015-12-11 06:49:21 MST

(In reply to Phil Schwan from comment #2)
> My understanding is that the backfill scheduler is basically continuously
> walking the queue, trying every task to see if it can be scheduled yet.  If
> a node becomes available while it's halfway through the queue, it will
> schedule the next task -- even if it's nowhere near the top priority.  This
> is, I think we can all agree, broken behaviour for a queueing system.

That's not how backfill works, at least not with Slurm. (I can't speak for other schedulers, but if they do such a thing I would agree that it is broken behavior.)

There's a set of locks used while running through the backfill scheduling pass. While running through node and job changes are seen by the backfill scheduler - there's no "jumping" the queue if a node was returned in the middle of schedule planning.

The bf_continue setting can lead to something vaguely related to what you suggest, but that would only be for high-priority jobs submitted while the backfill scheduler is going through the schedule plan, and would be unlikely in most circumstances.

The backfill scheduler specifically will never schedule a lower priority job if that job would impact the scheduled start time of a higher priority job.

My guess is that you may be running either with very high default time limits (or worse, no time limit), and this is leading to the backfill scheduler opportunistically running anything with a relatively short time limit. With high, or no limits, it may be that the expected start time for most of your jobs is significantly far out in the future.

> You might then argue "Then you don't want backfill!  Stop using it!"
> 
> The problem with that argument is that without backfill, a single task at
> the top of the queue that can't be scheduled -- e.g. because it requires a
> licence or feature or amount of memory not currently available -- will
> prevent anything else from being scheduled.  Even if there are 100s of
> available nodes, and 100s of tasks that _can_ run, the high-priority task
> waiting for a specific resource blocks them all.
> 
> So with a heterogeneous cluster, where some tasks can only run on certain
> nodes, we need backfill (or something like it).
> 
> Normally I'm willing to put up with backfill's nonsense, because it tends to
> be fairly rare.  The problem that Akmal described in comment 0, however, was
> that it was happening A LOT.  Way more than I've ever seen.  Fully 50% of my
> nodes were consumed by low-priority tasks!
> 
> And it's not like those 50% of nodes all came available at once, and were
> scheduled in one giant backfill orgasm.  It was one at a time, the same
> backfill queue-jumping behaviour replayed in single file, over and over.
> 
> Furthermore, I would expect the backfill queue-jumping to be largely random:
> i.e. at whatever point in the queue it's reached when the node comes
> available, that's the task that get scheduled.  But that's not what we see
> in Akmal's data.  Despite there being dozens of jobs in the queue, it was
> always choosing consecutive tasks from the same job to jump the queue.
> 
> This suggested to me that something worse than its normal misbehaviour was
> at play.

Again, this would be best described by some other mix of settings. The "queue-jumping" behavior you describe does not exist in Slurm.

> > Can you attach output from sinfo, squeue, and your slurm.conf file? sdiag
> > doesn't tell me much here, and the reading through log without the
> > slurm.conf and squeue status to refer to is difficult.
> 
> Normally yes; I would have this all in my copious scrollback, but something
> crashed my machine yesterday.
> 
> I can tell you that all partitions were full (no idle nodes), and that as
> nodes finished these 5-10 minute tasks, they were either scheduling from a
> priority-600 job (correctly) or from this priority-190 job (incorrectly).
> 
> I'm willing to accept that there's not enough information at this point to
> pursue this with fervor, and simply note it as a data point for the future. 
> I'd also like to embark on a longer conversation about why backfill's design
> doesn't work for us, but that's probably not best done in bugzilla.

If you can provide better data to look through I'm happy to verify the behavior you're seeing. Otherwise I'll go ahead and mark this as resolved/infogiven for now - you can always reopen or submit a new bug later with further details.

If you'd like to discuss some potential alternative scheduling logic here is fine, or you're welcome to email me directly (tim@schedmd.com).

- Tim

Comment 5 Phil Schwan 2015-12-20 11:41:28 MST

Thanks, Tim, for your detailed reply.

(In reply to Tim Wickberg from comment #3)
> 
> The bf_continue setting can lead to something vaguely related to what you
> suggest, but that would only be for high-priority jobs submitted while the
> backfill scheduler is going through the schedule plan, and would be unlikely
> in most circumstances.

Good to know; that's definitely not this situation.

> The backfill scheduler specifically will never schedule a lower priority job
> if that job would impact the scheduled start time of a higher priority job.
> 
> My guess is that you may be running either with very high default time
> limits (or worse, no time limit), and this is leading to the backfill
> scheduler opportunistically running anything with a relatively short time
> limit. With high, or no limits, it may be that the expected start time for
> most of your jobs is significantly far out in the future.

Our time limits are 1 hour for all jobs.  We can't estimate how long a task will actually take, if for no other reason than because we have a heterogeneous cluster -- i.e. runtime strongly depends on which node gets assigned!  As we're unwilling to terminate tasks after a fixed arbitrary time, it means that everything ends up set to 1h.

Would this lead to the behaviour that I describe?

I had it happen again just now -- literally as I was writing this -- albeit in a much more limited form than in comment 0.  Fortunately, the situation was fairly static at the time, with relatively few jobs on the system, and no new ones being submitted.

The three nodes onto which tasks end up incorrectly scheduled (pnod0813, pnod0815, pnod0817) are covered by two partitions (teamswanGpu and idleGpu).  teamswanGpu has a higher priority (21) than idleGpu (11).

There are three relevant jobs in this example:

100_job - priority 100 on partitions teamswanGpu,idleGpu
150_job - priority 150 on idleGpu
 50_job - priority  50 on idleGpu

In this situation, with teamswanGpu being the partition of higher priority, I would expect three tasks from the 100_job to be scheduled when those three nodes came available.

The teamswanGpu job:

> JobId=17249010 ArrayJobId=17249010 ArrayTaskId=2190 JobName=100_job
>    Priority=100 Nice=0 Account=user QOS=normal
>    JobState=PENDING Reason=Resources Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
>    SubmitTime=2015-12-16T16:01:37 EligibleTime=2015-12-16T16:01:38
>    StartTime=2016-12-20T08:28:54 EndTime=2015-12-21T09:16:04
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    Partition=teamswan,teamswanGpu,idle,tomo AllocNode:Sid=pud178:29605

The high-priority idleGpu job:

> JobId=17305346 ArrayJobId=17305346 ArrayTaskId=2822 JobName=150_job
>    Priority=150 Nice=0 Account=user QOS=normal
>    JobState=PENDING Reason=Resources Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
>    SubmitTime=2015-12-19T11:33:42 EligibleTime=2015-12-19T11:33:45
>    StartTime=2016-12-20T08:14:36 EndTime=2015-12-21T09:14:36
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    Partition=idleGpu,teamcottGpu AllocNode:Sid=pud80:1348

The low-priority idleGpu job:

> JobId=17115775 ArrayJobId=17115775 ArrayTaskId=104 JobName=50_job
>    Priority=50 Nice=0 Account=user QOS=normal
>    JobState=PENDING Reason=Resources Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
>    SubmitTime=2015-12-09T18:57:19 EligibleTime=2015-12-09T18:57:19
>    StartTime=Unknown EndTime=2015-12-21T09:03:43
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    Partition=idleGpu AllocNode:Sid=pud155:2818

All were submitted more than a day ago.

When those three nodes came free, these three tasks were all scheduled out-of-order (and even within job 17115775, if it were operating in proper priority order, why would it not take 3 consecutive tasks?  93-101 remained pending.)

> idleGpu         50         50_job       user  R       0:08      1 pnod0813        17115775_92
> idleGpu         50         50_job       user  R       0:08      1 pnod0815        17115775_102
> idleGpu         50         50_job       user  R       0:08      1 pnod0817        17115775_103

When I requeued them, the nodes were then assigned to tasks from the 150_job:

> idleGpu         150        150_job      user  R      25:31      1 pnod0813        17305346_2036
> idleGpu         150        150_job      user  R      25:31      1 pnod0815        17305346_2037
> idleGpu         150        150_job      user  R      25:31      1 pnod0817        17305346_2038

From slurmctld.log:

> [2015-12-21T08:03:43.014] backfill: Started JobId=17115775_92 (17305337) on pnod0813
> [2015-12-21T08:03:43.017] backfill: Started JobId=17115775_102 (17317739) on pnod0815
> [2015-12-21T08:03:43.024] backfill: Started JobId=17115775_103 (17317740) on pnod0817
> ...
> [2015-12-21T08:04:02.313] Requeuing JobID=17115775_103(17317740) State=0x0 NodeCnt=0
> [2015-12-21T08:04:02.314] Requeuing JobID=17115775_92(17305337) State=0x0 NodeCnt=0
> [2015-12-21T08:04:02.314] Requeuing JobID=17115775_102(17317739) State=0x0 NodeCnt=0
> ...
> [2015-12-21T08:04:33.073] backfill: Started JobId=17305346_2036 (17317749) on pnod0813
> [2015-12-21T08:04:33.078] backfill: Started JobId=17305346_2037 (17317750) on pnod0815
> [2015-12-21T08:04:33.084] backfill: Started JobId=17305346_2038 (17317751) on pnod0817

I'm very happy to send sinfo, squeue, and slurm.conf, but as they contain some potentially-sensitive information about users and jobs, I'd prefer to send them outside of the public bugzilla.

Is there anything else I should gather while the event is fresh?

Comment 6 Tim Wickberg 2015-12-21 04:05:51 MST

There are two DebugFlags that may shed some light - "backfill" and "backfillmap". Note that these will generate a considerable amount of output, you'd likely only want them on briefly using "scontrol setdebugflags +backfill" (then -backfill to disable).

If you see this happen again, output from `sinfo`, `squeue`, `scontrol show nodes`, `scontrol show jobs` plus the logs for slurmctld would be helpful. Emailing them to me is fine.

Comment 7 Tim Wickberg 2016-01-06 06:22:39 MST

I'm checking back in on open bugs - has this recurred in the past few weeks?

Comment 8 Phil Schwan 2016-01-06 09:06:17 MST

It definitely has, but so unpredictably that I'm not sure yet how to capture it in the logs (short of enabling backfill{,map} full-time, which I'm not sure is practical...)

I may go ahead and enable them the next time our cluster is a little quieter.  It's chockablock right now.

Comment 9 Moe Jette 2016-01-14 03:58:11 MST

If you have "group" limits by association or QOS (e.g. total number of running jobs by group), a heuristic in Slurm's backfills scheduler could have been responsible for this. The relevant patch is here:
https://github.com/SchedMD/slurm/commit/3ee1632f96e943227365429db4e58d5467bc1a2a.patch

Comment 10 Phil Schwan 2016-01-16 12:35:30 MST

(In reply to Moe Jette from comment #9)
> If you have "group" limits by association or QOS (e.g. total number of
> running jobs by group), a heuristic in Slurm's backfills scheduler could
> have been responsible for this.

A good one to fix, but nope, we're not doing any group limits.

I'll spend some time this week trying to reproduce it on one of our quieter clusters where I can turn the logging all the way up.

Comment 11 Tim Wickberg 2016-02-02 07:11:37 MST

Any luck reproducing this?

Comment 12 Tim Wickberg 2016-02-05 08:38:25 MST

I'm going to go ahead and mark this as infogiven for now. Please re-open when/if this reoccurs.

Comment 13 Phil Schwan 2016-02-05 09:41:20 MST

Yep, fair enough.  I wasn't able to reproduce on a different, more-idle cluster, so I suspect I'll just have to do it on our big one and cope with the massive log output.