Ticket 2217 - pending jobs with available resources
Summary: pending jobs with available resources
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 15.08.8
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-12-02 23:54 MST by steven fellini
Modified: 2016-04-14 04:43 MDT (History)
3 users (show)

See Also:
Site: NIH
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 16.05.0-pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
output of spend command (11.91 KB, text/plain)
2015-12-02 23:54 MST, steven fellini
Details
slurm.conf (12.65 KB, text/plain)
2015-12-02 23:55 MST, steven fellini
Details
squeue output (144.31 KB, text/plain)
2015-12-03 07:34 MST, steven fellini
Details
spend output (20.16 KB, text/plain)
2015-12-03 07:36 MST, steven fellini
Details
cluster state 2016-01-13 (465.68 KB, application/x-gzip)
2016-01-19 23:24 MST, steven fellini
Details
cluster state 2016-01-28 (415.68 KB, application/octet-stream)
2016-01-28 06:10 MST, steven fellini
Details
cluster state 2016-03-04 (360.64 KB, application/x-gzip)
2016-03-03 23:03 MST, steven fellini
Details
slurm.conf (13.75 KB, text/plain)
2016-03-04 01:07 MST, steven fellini
Details
slurmdbd.conf (1.18 KB, text/plain)
2016-03-04 04:17 MST, steven fellini
Details
ctld.log (14.17 MB, text/plain)
2016-03-04 04:35 MST, rl303f
Details
sdiag part 1 (674.39 KB, image/tiff)
2016-03-04 05:21 MST, steven fellini
Details
sdiag part 2 (717.12 KB, image/tiff)
2016-03-04 05:22 MST, steven fellini
Details
sdiag part 3 (806.41 KB, image/tiff)
2016-03-04 05:22 MST, steven fellini
Details
Backport of bug_2565 patch to v15.08.10 (7.03 KB, patch)
2016-04-11 04:47 MDT, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description steven fellini 2015-12-02 23:54:31 MST
Created attachment 2478 [details]
output of spend command

We have an ongoing problem with jobs in a specific partition which
appear in PENDING state when there are sufficient resources to run them.  
Enclosed are: 
(1) output from an NIH-written program ("spend”) which summarizes pending 
jobs with reasons of “Resources” or “Priority”.  The second section reports 
available resources in that partition, and the third section identifies some 
specific nodes which have the most free resources (by cpu & memory).
(2) slurm.conf

Looking at the ‘spend’ output: the few dozen jobs of user patidarr are
waiting for resources.  These jobs are requesting exclusive 65/240GB
nodes, and none are available, so these should be pending.

However users buhuleod, davisbw and frankg have pending
jobs that are requesting 1 or 8 cpus and 10GB or less.  There are 
hundreds of free nodes meeting these requirements, yet they are in a 
PENDING/Priority state.  These jobs will generally remain queued several 
hours before they eventually get scheduled to run.  If we bump the 
priority of these jobs they will generally start.
Comment 1 steven fellini 2015-12-02 23:55:30 MST
Created attachment 2479 [details]
slurm.conf
Comment 2 Tim Wickberg 2015-12-03 06:51:53 MST
Can you confirm that those stuck jobs aren't requesting any other resources such as licenses or features?

Actually - looking at the spend.txt output I think things are working as expected. The highest priority jobs are all exclusive-node, so Slurm is trying to drain out nodes to provide to those jobs.

Without seeing the current node usage and remaining run times on the nodes, I'm guessing that the jobs that could fit in the limited available memory all have run times that don't allow them to be backfilled in at present. The backfill scheduler is conservative - if backfilling a job would delay the launch of the job expected to launch next on that node by even a second it will not schedule that job to launch.

If you're manually increasing the priority, then those jobs can launch as they're no longer subject to that constraint as they've become the highest priority job in the queue.

If you still think something else is at play, can you provide output from something like:

squeue --format "%.18i %.9P %10S %.7Q %.10l %.2t %.10M %.6D %.4C %R"

rather than the spend command?

Increasing the debuglevel and looking through slurmctld.log may also be helpful. ("scontrol setdebug verbose" and maybe also "scontrol setdebugflags +backfill")

- Tim
Comment 3 steven fellini 2015-12-03 07:34:46 MST
Created attachment 2480 [details]
squeue output
Comment 4 steven fellini 2015-12-03 07:35:37 MST
Hi Tim,
No licenses or features were requested by those pending jobs.

Here are the jobs in question:
User                         Jobid   Part     Prio          Reason Excl   Nnodes  Ncpus        Mem
--------------------------------------------------------------------------------------------------
buhuleod            7095988[0-199]     b1       51        Priority    -        1      1  10GB/node
buhuleod            7095993[0-199]     b1       51        Priority    -        1      1  10GB/node
davisbw             7091501[0-899]     b1       47       Resources    -        1      1   2GB/node
frankg              7096084[0-848]     b1       32        Priority    -        1      8   9GB/node


The job pending for Resources is asking for 1 cpu/2 GB.  There are over 500 nodes that the jobs 
could run on.  All of the higher priority pending jobs are waiting on nodes with more memory (to
satisfy 65/240GB requests).  The higher priority jobs could never run on the 500 free nodes we're
talking about.

I've just run your squeue and another spend.  Look at job 7114232.  It's been queued since yesterday
and wants 1 node, 16 cpu, 12 GB.  There are at least 40 nodes on which it could fit and on which
none of the higher priority jobs could fit.
Comment 5 steven fellini 2015-12-03 07:36:25 MST
Created attachment 2481 [details]
spend output
Comment 6 Tim Wickberg 2015-12-03 08:13:23 MST
(In reply to steven fellini from comment #4)
> I've just run your squeue and another spend.  Look at job 7114232.  It's
> been queued since yesterday
> and wants 1 node, 16 cpu, 12 GB.  There are at least 40 nodes on which it
> could fit and on which
> none of the higher priority jobs could fit.

There are 40 nodes that it could fit on right now, and for a brief duration until more nodes free up and the higher priority job would be expected to run. Accuracy of timelimits have a huge impact on our ability to backfill correctly (or at all).

Job 7114232 requires 12 days of wall-clock time - Slurm can't backfill that job in without disrupting a higher priority job in the future.

If the runtime limit on that was decreased to a day or so I'd expect it to be able to launch immediately.
Comment 7 steven fellini 2015-12-03 08:44:02 MST
Sorry but I still don't understand.  What higher priority jobs are you talking about? NONE of the currently pending higher priority jobs could possibly run on nodes that 7114232 can run on now (the 40 nodes we're talking about have 19 GB; the higher priority jobs all want 200+ GB).
Comment 8 Tim Wickberg 2015-12-04 02:51:31 MST
Slurm isn't considered just the resources available immediately, it's also building out a model of when additional nodes become available in the future based on the TimeLimits for the active jobs. The nodes that are currently free right now aren't always going to be free - they have jobs that slurm expects to launch on them sometime in the future. (There is an open bug somewhere about marking such nodes as "earmarked" or something - calling them "idle" leads to a lot of confusion. They may be idle for the moment, but they are expected to be busy in the future.)

Slurm then works out when the highest priority jobs will be able to start in the future. This is all done with the normal scheduling logic. Higher priority jobs will then have nodes set aside for them to ensure they're able to start - without this mechanism larger jobs would "starve" and never be able to launch as long as smaller jobs were around to keep consuming idle resources.

The backfill scheduling portion is then able to come in and find gaps - resources that would need to sit idle for some time before the higher priority job starts. Jobs that fit within those gaps - gaps of specific resources *for a particular duration of time* can have jobs back-filled in and started immediately.

If a job does not fit that gap - in your case, any job that could fit on the resources available has too long of a timelimit set, so they cannot be run now. If that time limit exceeds the projected window we have those resources available by even a second it will not be scheduled to launch.
Comment 9 Susan Chacko 2015-12-04 03:56:01 MST
(In reply to Tim Wickberg from comment #8)

Hi Tim,

The problem is that there are _no_ higher priority jobs that could possibly run on the free nodes. All the higher priority jobs require larger memory than is available on those free nodes, therefore these nodes should not logically be earmarked for any jobs. The first job that could fit on those nodes has remained queued for 2 days now, while 150+ nodes that could have run this job sit idle. 

We can't explain this behaviour, and are hoping there is some slurm config parameter that could ameliorate this problem. Could it have something to do with the fact that this partition is heterogenous with respect to memory sizes? 

Susan.
Comment 10 Tim Wickberg 2015-12-08 04:16:29 MST
I think I understand what you're saying - that the nodes that are currently idle would never be run on by the highest priority jobs (too little memory primarily), and the other jobs should be launching.

But I'm having a hard time piecing this together from the various outputs to visualize where the issue may be coming from.

If you don't mind, can you send in a snapshot of the current system:

scontrol show nodes
scontrol show partitions
scontrol show jobs

sinfo
squeue --format "%.18i %.9P %10S %.7Q %.10l %.2t %.10M %.6D %.4C %.9m %R"

- Tim

(Everything in one file is fine, no need to split up commands. If you'd rather not have that attached to the bug you can email it to me directly - tim@schedmd.com.)
Comment 11 Tim Wickberg 2015-12-08 12:50:58 MST
I don't see an issue on b1 at the moment - there are only three jobs pending in the dump you sent, and they all are asking for equivalent resources which aren't currently available:

tim@bluedwarf:~$ grep PD\  scontrol.txt |grep -v Depe | grep b1\ 
   7391176_[0-199]        b1 N/A             35 7-00:00:00 PD       0:00      1    1       10G (Priority)
     7391085_[199]        b1 N/A             35 7-00:00:00 PD       0:00      1    1       10G (Resources)
       7381792_193        b1 N/A             32 7-00:00:00 PD       0:00      1    1       10G (Priority)

I'm going to switch my commentary to the ibfdr partition here, I suspect the same has applied to the other partitions at varying points and am hoping this will highlight how the backfill scheduler is behaving on your system.

When you took that snapshot, you had these pending on ibfdr:

           6394506     ibfdr N/A         987731 10-00:00:00 PD       0:00     64 1024     1024M (Resources)
           6400501     ibfdr N/A           1020 7-00:00:00 PD       0:00     32  512     1024M (QOSMaxCpusPerUserLimit)
           7108835     ibfdr N/A            864 7-00:00:00 PD       0:00      8  128     1024M (QOSMaxCpusPerUserLimit)
           7258475     ibfdr N/A            567 10-00:00:00 PD       0:00      8  128     1024M (Priority)
           7265324     ibfdr N/A            558 10-00:00:00 PD       0:00     12  192     1024M (Priority)
           7276712     ibfdr N/A            535 10-00:00:00 PD       0:00     10  160     1024M (Priority)
           7286572     ibfdr N/A            429 10-00:00:00 PD       0:00     10  160     1024M (Priority)
           7286587     ibfdr N/A            420 10-00:00:00 PD       0:00     10  160     1024M (Priority)
           7296561     ibfdr N/A            337 10-00:00:00 PD       0:00     10  160     1024M (Resources)
           7340048     ibfdr N/A            149 10-00:00:00 PD       0:00      5   65        4G (Resources)

At that point in time 138 out of 192 nodes were in use by 15 running jobs[1]. Leaving 54 nodes idle. Those 54 nodes are being held idle so that job 6394506 can start on 64 nodes. No other pending job will run in a short enough window of time to finish before 6394506 needs to launch, so those are expected to sit vacant at the moment. If we scheduled any of the other jobs - 7340048 for instance - then the expected start time of 6394506 would have to be moved out into the future again.

Without that constrain, several of those other pending jobs could launch now within the 54 free nodes - but then 6394506 may never be able to launch.

I've verified that Slurm does not prevent jobs from running by (incorrectly) assuming that idle nodes could run the highest priority when that job has memory/cpu requirements that exceed the resources on those nodes.

- Tim

[1] grep R\  scontrol.txt | grep ibfdr |awk '{jobs+=1;nodes+=$8}END {print jobs,nodes}'
Comment 12 steven fellini 2015-12-09 02:30:58 MST
We were aware that the "b1" partition problem was not occurring when we sent you the snapshot.  We can however reproduce the problem by submitting a particular mix of jobs.  Would you like us to do that and send you another snapshot?

Re the "ibfdr" partition, as you point out it's well behaved as are the partitions _other_ than "b1".  We suspect the problem may have to do with "b1" being heterogeneous with respect to cpu/memory resources (the other partitions are homogeneous).

Also want to point out the following comment from src/plugins/sched/backfill:

****************************************************************************\
*  backfill.c - simple backfill scheduler plugin.
*
*  If a partition is does not have root only access and nodes are not shared
*  then raise the priority of pending jobs if doing so does not adversely
*  effect the expected initiation of any higher priority job. We do not alter
*  a job's required or excluded node list, so this is a conservative
*  algorithm.
*
*  For example, consider a cluster "lx[01-08]" with one job executing on
*  nodes "lx[01-04]". The highest priority pending job requires five nodes
*  including "lx05". The next highest priority pending job requires any
*  three nodes. Without explicitly forcing the second job to use nodes
*  "lx[06-08]", we can't start it without possibly delaying the higher
*  priority job.
*****************************************************************************
Comment 13 Tim Wickberg 2015-12-09 03:03:50 MST
On 12/09/2015 11:30 AM, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=2217
>
> --- Comment #12 from steven fellini <sfellini@nih.gov> ---
> We were aware that the "b1" partition problem was not occurring when we sent
> you the snapshot.  We can however reproduce the problem by submitting a
> particular mix of jobs.  Would you like us to do that and send you another
> snapshot?

Yes, please.

> Re the "ibfdr" partition, as you point out it's well behaved as are the
> partitions _other_ than "b1".  We suspect the problem may have to do with "b1"
> being heterogeneous with respect to cpu/memory resources (the other partitions
> are homogeneous).

I've done some testing with mixed nodes and haven't caught that behavior 
yet. I'm interested in seeing if you have a mix that triggers this - if 
you can give me some hints as to what you submit to provoke the behavior 
that'd help as well.

My notes from attempting to reproduce this follow.

- Tim

As an example, I have three nodes defined here. bessie[08-09] have 
RealMemory=15737, bessie10 has RealMemory=10000.

scontrol create partitionname=mixed nodename=bessie[08-10]

sbatch --wrap "sleep 1000" -p mixed --exclusive -N 1 --mem=12000 -t 10 
--job-name=job_1
sbatch --wrap "sleep 1000" -p mixed --exclusive -N 1 --mem=12000 -t 5 
--job-name=job_2
sleep 10
sbatch --wrap "sleep 1000" -p mixed --exclusive -N 2 --mem=12000 -t 10 
--job-name=job_3
sbatch --wrap "sleep 1000" -p mixed --exclusive -N 1 --mem=1000  -t 10 
--job-name=job_4 --nice

job_1 and job_2 start immediately on bessie08 and bessie09.

job_3 is blocked waiting on bessie08 and bessie09 becoming free, which 
won't happen until 10 minutes. If the backfill algorithm was not aware 
of the heterogeneous nodes and was only considering cpu cores it would 
instead have assumed bessie09 and bessie10 would suffice, and job_4 
would be stalled on the higher-priority job_3.

Right after submission:
$ squeue
              JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
                812     mixed    job_3      tim PD       0:00      2 
(Resources)
                813     mixed    job_4      tim PD       0:00      1 
(Priority)
                810     mixed    job_1      tim  R       0:21      1 
bessie08
                811     mixed    job_2      tim  R       0:21      1 
bessie09

Shows job_4 held up for a few seconds - we need the backfill scheduler 
pass to run. Waiting another 30 seconds (varies based on your 
SchedulerParameters) then shows correct behavior:
$ squeue
              JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
                812     mixed    job_3      tim PD       0:00      2 
(Resources)
                810     mixed    job_1      tim  R       1:01      1 
bessie08
                811     mixed    job_2      tim  R       1:01      1 
bessie09
                813     mixed    job_4      tim  R       0:27      1 
bessie10

scontrol show job 812 (job_3) shows SchedNodeList=bessie[08-09].
Comment 14 steven fellini 2015-12-11 10:59:19 MST
Tim - we are a bit slammed right now.  Still owe you some outputs.  Please keep this open.
Comment 15 Tim Wickberg 2016-01-06 06:21:28 MST
I'm running through and reviewing outstanding bugs - have things stabilized for you in the past few weeks or are you still seeing issues?

cheers, and happy new year,
- Tim
Comment 16 steven fellini 2016-01-06 23:48:30 MST
Tim,

We haven't seen instances of the problem lately, so you
can probably close the ticket and we'll reopen if we 
start seeing it again.

We did make a number of sched parameter changes which may
or may not have made a difference:

SchedulerParameters=bf_continue,defer,max_sched_time=5
Comment 17 Tim Wickberg 2016-01-07 01:55:22 MST
Alright, closing for now. Let us know if you run into this again.

cheers,
- Tim
Comment 18 steven fellini 2016-01-19 23:21:57 MST
Tim,

We're seeing the problem again.  User pickardfc's jobs are apparently being "blocked" by user patidarr even though there are free nodes that could run the pickardfc jobs (but not the patidarr jobs).

Tarball attached.
Comment 19 steven fellini 2016-01-19 23:24:46 MST
Created attachment 2626 [details]
cluster state 2016-01-13
Comment 20 Moe Jette 2016-01-21 06:05:39 MST
Tim has been out sick for the past week.

Judging from your logs, the backfill scheduler isn't even attempting to determine resources and a start time for pickardfc's jobs. I believe this problem will go away if you add "bf_max_job_test=10000" to SchedulerParameters in slurm.conf. Documentation below:

bf_max_job_test=#
    The maximum number of jobs to attempt backfill scheduling for (i.e. the queue depth). Higher values result in more overhead and less responsiveness. Until an attempt is made to backfill schedule a job, its expected initiation time value will not be set. The default value is 100. In the case of large clusters, configuring a relatively small value may be desirable. This option applies only to SchedulerType=sched/backfill.

Assuming that fixes the problem, I'd like for you to capture the sdiag output periodically (say hourly) through a typical workday and I can help tune the system for better responsiveness and throughput. There are about 20 different scheduling parameters available and getting the values tuned can help a great deal.
Comment 21 Moe Jette 2016-01-21 06:07:41 MST
PS. You can change the slurm.conf file and run "scontrol reconfig" for the scheduling parameters to take effect without restarting daemons.
Comment 22 Moe Jette 2016-01-21 07:31:42 MST
I've had a chance to study your configuration and logs some more. Here are some more suggestions for SchedulerParameters in your slurm.conf.  First off all your current "SchedulerParameters=bf_continue,defer" is a fine start, but given the number of jobs in your system, more options would provide better performance. See the slurm.conf man page for detailed descriptions of the parameters.

bf_max_job_test=10000
Number of jobs to test, as already described.

bf_resolution=600
This will reduce the overhead of the backfill scheduler. It will change tracking of resource allocations at 10 minute intervals rather than 1 minute intervals, potentially decreasing the overhead by up to about 80%

bf_window=#
You_ MIGHT_ consider increasing how far in the future the backfill scheduler looks, but that does increase its overhead, which is high to begin with (you are seeing 24 seconds run time per cycle today). If you do look more than 24 hours into the future, that should be balanced by limiting the number of jobs examined per user and/or partition using bf_max_job_user=# and/or bf_max_job_part=#
Comment 23 Moe Jette 2016-01-22 02:00:13 MST
Is your system running fine now?
Comment 24 steven fellini 2016-01-22 02:05:36 MST
Moe,

Thanks for your suggestions.  We're going to hold off config changes until after the weekend.  We'll let you know.
Comment 25 Moe Jette 2016-01-27 02:01:25 MST
Is your system running fine now?
Comment 26 steven fellini 2016-01-28 05:33:44 MST
Moe,

We're still having problems, even with

SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,bf_resolution=600
Comment 27 Moe Jette 2016-01-28 06:04:20 MST
Could you attach the output of the following commands plus your slurmctld log file:

mkdir slurm_01_28_16_logs
cd slurm_01_28_16_logs
sdiag >sdiag.out
squeue >squeue.out
sinfo >sinfo.out
scontrol show job >scontrol_job.out
scontrol show node >scontrol_node.out
scontrol show partition >scontrol_part.out
scontrol show reservation >scontrol_resv.out
tar -cvf slurm_logs.tar *.out

 (Then attach the file "slurm_logs.tar", you should probably compress the slurmctld log file before sending)
Comment 28 steven fellini 2016-01-28 06:10:44 MST
Created attachment 2665 [details]
cluster state 2016-01-28
Comment 29 Moe Jette 2016-01-28 07:27:13 MST
Are there some specific jobs or partitions that I should look at?

There are several thousand pending jobs in these logs.
All of the jobs I have checked so far either need more memory than currently available (in the "ccr" partition) or are newly submitted jobs (the "norm" partition).
Comment 30 Moe Jette 2016-01-28 08:29:12 MST
When I noticed your sdiag output didn't seem to match the your configured SchedulerParameters, I checked the version of Slurm that you are running and found that it's 14.11.9 which lacks many scalability improvements for large job counts and job arrays. For one example of the enhancements, each task of a job array occupies a separate job table record in version 14.11 rather than one record for the entire job array, so the number of job records Slurm is managing in your case is order 60,000. Also the backfill scheduler in your environment is running for at most 30 seconds and only getting through 200 pending jobs (with thousands of running jobs, its likely trying to run each one at hundreds of different start times). If possible, I would strongly recommend upgrading to version 15.08. One of our customers with a similar number of nodes and overlapping queues is running >1 million jobs per day with version 15.08.7.

In terms of what you can do with version 14.11, here are some suggested changes to SchedulerParameters, currently: 
bf_continue,defer,bf_max_job_test=10000,bf_resolution=600

Add:
bf_max_job_part=50
Backfill scheduler will only test the first 50 jobs per partition

bf_max_job_array_resv=10
Backfill scheduler will only reserve resources for first 10 elements of pending job array

bf_interval=60
Run the backfill scheduler once every 60 seconds, this will check more jobs less frequently

The net result will be this:
SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,bf_max_job_part=50,bf_max_job_array_resv=10,bf_interval=60

I'd suggest that you give that a try and if things aren't running a lot better, send another set of logs in a day or two.
Comment 31 steven fellini 2016-01-28 09:56:04 MST
>> Are there some specific jobs or partitions that I should look at?

It's the 'b1' partition that we notice the problem most often.

This partition has a small number of 256GB nodes and a much larger number of 24GB nodes.  If there are a large number of jobs requiring large memory (i.e. 256GB nodes) such that most are pending, then any jobs submitted with lesser memory requirements (which could run on the 24GB nodes) stay pending.

We could probably solve this by simply pulling out the 256GB nodes from that partition, but our feeling is that slurm should handle this?

Thanks, Steve.
Comment 32 Moe Jette 2016-01-29 02:31:09 MST
(In reply to steven fellini from comment #31)
> >> Are there some specific jobs or partitions that I should look at?
> 
> It's the 'b1' partition that we notice the problem most often.
> 
> This partition has a small number of 256GB nodes and a much larger number of
> 24GB nodes.  If there are a large number of jobs requiring large memory
> (i.e. 256GB nodes) such that most are pending, then any jobs submitted with
> lesser memory requirements (which could run on the 24GB nodes) stay pending.
> 
> We could probably solve this by simply pulling out the 256GB nodes from that
> partition, but our feeling is that slurm should handle this?

It could, but the overhead in the backfill scheduler would be huge for the number of jobs that you have. Assuming the top priority job in the partition/queue can't run, the backfill scheduler will need to determine when and where the pending jobs can run. This is how the algorithm works from a high level.

1. Determine which pending jobs could possibly be run (e.g. dependencies and limits OK) and put them into a sorted list. With Slurm version 14.11, each task of a job array will have a separate job record (not so in v15.08)
2. For every running job, put them into a list sorted by expected end time (based upon time limit).
3. For each pending job
     test if it can start immediately
     for each running job
         simulate the termination of that running job
         test if the pending job can start then
           if so then mark those resources off limits for that job's run time
     end
   end

Given the number of your jobs (you've got order 20,000 pending jobs and order 5,000 running jobs) that means 100 million tests need to be performed, each of which is not particularly fast because some many things need to be checked.

Here are some ways to improve matters:
* Upgrading to Slurm version 15.08 will streamline handling the job arrays, get you improved scheduling algorithms, and more tuning options (those will buy you perhaps 30% improvement)
* Configuring bf_interval=300 (higher than suggested in comment #30) would run the backfill scheduler less frequently, but going deeper in the queue.
* Configuring bf_max_job_user may be helpful (see slurm.conf man page for details).
* Putting the larger memory nodes in their own partition would keep the small and large memory jobs in a separate queues.
Comment 33 Moe Jette 2016-02-05 01:38:31 MST
Do you have any update on this?
Comment 34 steven fellini 2016-02-07 23:41:15 MST
As of Friday afternoon, we were still seeing the problem in the 'b1' partition.  Currently no pending jobs in that partition (early in the week), but we'll continue to watch.

SchedulerParameters currently at:

SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,bf_max_job_part=50,bf_max_job_array_resv=10,bf_interval=300

Steve.
Comment 35 Moe Jette 2016-02-08 01:58:38 MST
(In reply to steven fellini from comment #34)
> As of Friday afternoon, we were still seeing the problem in the 'b1'
> partition.  Currently no pending jobs in that partition (early in the week),
> but we'll continue to watch.
> 
> SchedulerParameters currently at:
> 
> SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,
> bf_resolution=600,bf_max_job_part=50,bf_max_job_array_resv=10,bf_interval=300

Given the number of running and pending jobs, the backfill scheduler will need to be tuned to look at a smaller number of them.

You've got some options with version 14.11:
* Configuring "bf_max_job_user" should be helpful (see slurm.conf man page for details).
* The "bf_min_age_reserve" configuration parameter may also be helpful (see slurm.conf man page for details).
* Putting the larger memory nodes in their own partition would keep the small and large memory jobs in a separate queues, but make the system use more difficult to use.

Upgrading to version 15.08 should give you some additional relief:
* Upgrading to Slurm version 15.08 will streamline handling the job arrays, get you improved scheduling algorithms, and more tuning options (those will buy you perhaps 30% improvement).
Comment 36 Moe Jette 2016-02-12 08:51:48 MST
Any updates?
Comment 37 steven fellini 2016-02-12 10:49:45 MST
Moe,

We've prevailed on the one or two users who were submitting large numbers of large memory jobs to the 'b1' partition even when they didn't need that much memory.

So we haven't been seeing any problems this week since the piling up of jobs for the larger memory nodes was what was preventing other jobs from getting scheduled on smaller memory ones.

We are planning to move to 15.08 asap but are not confident that will change the issue.
We really don't want to have to be creating lots of partitions to suit hardware characteristics, just the opposite, we want to have as few partitions as possible.

So, we're still thinking of this as a deficiency, but for the time being not a hardship.  That could change tomorrow.  

Thanks for the time you've put into this.
Comment 38 steven fellini 2016-03-03 23:03:56 MST
Created attachment 2817 [details]
cluster state 2016-03-04

We upgraded to 15.08 a few days ago and our scheduling problems are worse than ever.  I've raised impact to "High Impact" as we are getting complaints from our users.

We have IDLE nodes in the 'norm' and 'gpu' partitions with many jobs pending.

Enclosing cluster state as of a few minutes ago.
Comment 39 Moe Jette 2016-03-04 00:43:23 MST
Do you have any advanced reservations? (see "scontrol show res")
Comment 40 steven fellini 2016-03-04 00:45:36 MST
[steve@biowulf ~]$ scontrol show res
No reservations in the system
Comment 41 steven fellini 2016-03-04 00:46:11 MST
[steve@biowulf ~]$ scontrol show res
No reservations in the system
Comment 42 Moe Jette 2016-03-04 00:57:14 MST
Same slurm.conf as attached, correct?
If not, please attach new one.
Comment 43 steven fellini 2016-03-04 01:07:29 MST
Created attachment 2818 [details]
slurm.conf

current slurm.conf attached
Comment 44 Moe Jette 2016-03-04 01:08:52 MST
I noticed that your squeue output does not include the interactive jobs, while they clearly exist based upon the sinfo and "scontrol show job" output. Since you have overlapping partitions, I really need that information to perform an analysis.

The way Slurm handles overlapping partitions is once the first job in that partition/queue can't be scheduled due to lack of resources, the nodes associated with that partition are removed from consideration for jobs in other partitions. This the jobs in lower priority partitions from being allocated those resources and delaying jobs in higher priority partitions. (The jobs can still use nodes outside of the blocked higher priority partition.) That can trigger a cascade effect with respect to scheduling resources, which might be what you see.
Comment 45 steven fellini 2016-03-04 01:44:32 MST
Moe,
Not sure I understand.  The squeue output I sent does include several jobs in the interactive partition, all of them in R state.  None are pending for resources.
Comment 46 Moe Jette 2016-03-04 01:50:14 MST
(In reply to steven fellini from comment #45)
> Moe,
> Not sure I understand.  The squeue output I sent does include several jobs
> in the interactive partition, all of them in R state.  None are pending for
> resources.

Mea culpa. The partition field was truncated:
$ grep inter squeue.out
          15397361 interacti 2016-03-03  992060 1-12:00:00  R   19:10:04      1    2        8G cn1738

Exactly which version of 15.08 did you install?
Comment 47 Moe Jette 2016-03-04 02:21:12 MST
Here are a few more things that may be relevant:

The jobs are sorted for scheduling first by partition priority then by job priority. That means all of the jobs in partitions interactive, niddk, nimh, ccr, ccrclin, and b1 get tested for resources before any jobs in partition gpu and norm. That means the ordering of squeue output may be misleading.

Your configuration has "bf_max_job_user=100", so only the first 100 jobs for each user get tested (based upon the ordering described above). The attached squeue output does not include user information and while the output of "scontrol show job" does contain that information, that's a lot of data to parse through. Perhaps "bf_max_job_user" should be configured higher.

Each task of a job array counts against the "bf_max_job_user" limit.
Comment 48 Moe Jette 2016-03-04 02:37:21 MST
I'm not sure exactly what is happening, but would like to suggest increasing the values for bf_max_job_part and bf_max_job_user as shown below, then run "scontrol reconfig" and see if that gets jobs on those idle nodes.

Current values:
SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,bf_max_job_part=50,bf_max_job_array_resv=10,bf_interval=300,bf_max_job_user=100

Suggested values:
SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,bf_max_job_part=300,bf_max_job_array_resv=10,bf_interval=300,bf_max_job_user=200
Comment 49 steven fellini 2016-03-04 02:39:03 MST
(In reply to Moe Jette from comment #46)
> (In reply to steven fellini from comment #45)
> > Moe,
> > Not sure I understand.  The squeue output I sent does include several jobs
> > in the interactive partition, all of them in R state.  None are pending for
> > resources.
> 
> Mea culpa. The partition field was truncated:
> $ grep inter squeue.out
>           15397361 interacti 2016-03-03  992060 1-12:00:00  R   19:10:04    
> 1    2        8G cn1738
> 
> Exactly which version of 15.08 did you install?

15.08.8
Comment 50 Moe Jette 2016-03-04 04:03:03 MST
It may also be helpful if I can get a copy of your slurmctld log file (at /var/log/slurm/ctld.log).
Comment 51 steven fellini 2016-03-04 04:17:27 MST
Created attachment 2819 [details]
slurmdbd.conf
Comment 52 steven fellini 2016-03-04 04:29:47 MST
(In reply to Moe Jette from comment #50)
> It may also be helpful if I can get a copy of your slurmctld log file (at
> /var/log/slurm/ctld.log).

Sorry, misread your msg, log file on its way.
Comment 53 rl303f 2016-03-04 04:35:19 MST
Created attachment 2820 [details]
ctld.log
Comment 54 Moe Jette 2016-03-04 04:40:21 MST
In your job logs, I see the top priority pending job in the "gpu" queue is waiting due with "Reason=QOSMaxCpuPerUserLimit" and that same user has a large number of pending jobs in the queue. Can you tell me what is happening with respect to that limit? 

That one user's jobs could block the entire queue due to "bf_max_job_part=50".

Here's the job:
JobId=15329061 JobName=job-9.sh
   UserId=joehanesr(34049) GroupId=joehanesr(34049)
   Priority=470137 Nice=0 Account=munson QOS=gpu
   JobState=PENDING Reason=QOSMaxCpuPerUserLimit Dependency=(null)
Comment 55 Moe Jette 2016-03-04 05:03:30 MST
FYI there are thousands of entries in the log about users trying to cancel each others jobs. I'm hoping that's accidental, but that's a lot of accidents. The messages look like this:

[2016-03-01T12:21:13.727] error: Security violation, JOB_CANCEL RPC for jobID 15315274 from uid 34966
[2016-03-02T10:02:01.785] error: Security violation, JOB_CANCEL RPC for jobID 15343675 from uid 36777
[2016-03-03T14:17:50.220] error: Security violation, JOB_CANCEL RPC for jobID 15329225 from uid 34385
Comment 56 Susan Chacko 2016-03-04 05:09:21 MST
(In reply to Moe Jette from comment #55)
> FYI there are thousands of entries in the log about users trying to cancel
> each others jobs. I'm hoping that's accidental, but that's a lot of
> accidents. The messages look like this:

Yes, we're aware of that and trying to figure out what's going on.
Comment 57 Susan Chacko 2016-03-04 05:14:43 MST
(In reply to Moe Jette from comment #54)
> In your job logs, I see the top priority pending job in the "gpu" queue is
> waiting due with "Reason=QOSMaxCpuPerUserLimit" and that same user has a
> large number of pending jobs in the queue. Can you tell me what is happening
> with respect to that limit? 
> 
> That one user's jobs could block the entire queue due to
> "bf_max_job_part=50".
> 
> Here's the job:
> JobId=15329061 JobName=job-9.sh
>    UserId=joehanesr(34049) GroupId=joehanesr(34049)
>    Priority=470137 Nice=0 Account=munson QOS=gpu
>    JobState=PENDING Reason=QOSMaxCpuPerUserLimit Dependency=(null)

Hi Moe,

Right now, 
SchedulerParameters     = bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,bf_max_job_part=300,bf_max_job_array_resv=10,bf_interval=300,bf_max_job_user=200

The user joehanesr has 1 job queued for QOSMaxCpuPerUserLimit and 363 jobs queued for Priority, on this partition. 

Shouldn't the scheduler look through 200 of joehanesr's jobs, and then move on to the next user? The next user's jobs could run because we have idle GPU nodes available. 

Secondly, we had bf_max_job_part=50 until this morning. We routinely have users with 1000s of jobs or job-array-subjobs queued. Shouldn't we have seen this problem before now?
Comment 58 steven fellini 2016-03-04 05:21:48 MST
Created attachment 2824 [details]
sdiag part 1

Moe,

I'm attaching 3 files, each of which plots 'sdiag' output over the past 24 hours.  Don't know if will help, but...
Comment 59 steven fellini 2016-03-04 05:22:15 MST
Created attachment 2825 [details]
sdiag part 2
Comment 60 steven fellini 2016-03-04 05:22:57 MST
Created attachment 2826 [details]
sdiag part 3
Comment 61 Moe Jette 2016-03-04 06:17:52 MST
(In reply to Susan Chacko from comment #57)
> (In reply to Moe Jette from comment #54)
> > In your job logs, I see the top priority pending job in the "gpu" queue is
> > waiting due with "Reason=QOSMaxCpuPerUserLimit" and that same user has a
> > large number of pending jobs in the queue. Can you tell me what is happening
> > with respect to that limit? 
> > 
> > That one user's jobs could block the entire queue due to
> > "bf_max_job_part=50".
> > 
> > Here's the job:
> > JobId=15329061 JobName=job-9.sh
> >    UserId=joehanesr(34049) GroupId=joehanesr(34049)
> >    Priority=470137 Nice=0 Account=munson QOS=gpu
> >    JobState=PENDING Reason=QOSMaxCpuPerUserLimit Dependency=(null)
> 
> Hi Moe,
> 
> Right now, 
> SchedulerParameters     =
> bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,
> bf_max_job_part=300,bf_max_job_array_resv=10,bf_interval=300,
> bf_max_job_user=200
> 
> The user joehanesr has 1 job queued for QOSMaxCpuPerUserLimit and 363 jobs
> queued for Priority, on this partition. 
> 
> Shouldn't the scheduler look through 200 of joehanesr's jobs, and then move
> on to the next user? The next user's jobs could run because we have idle GPU
> nodes available. 

Correct, but with "bf_max_job_part=50" user "joehanesr" would basically block the queue.

That's the conclusion that I came to based upon log records like that below:
[2016-03-03T09:31:08.215] debug2: job 15329062 being held, if allowed the job request will exceed QOS gpu max tres(cpu) per user limit 128 with already used 128 + requested 16
[2016-03-03T09:31:08.215] debug3: backfill: Failed to start JobId=15329062: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)


> Secondly, we had bf_max_job_part=50 until this morning. We routinely have
> users with 1000s of jobs or job-array-subjobs queued. Shouldn't we have seen
> this problem before now?

There is definitely a distinction between jobs and job arrays here. Job arrays are a single job record and based upon your "bf_max_job_array_resv=10" configuration, only 10 of the tasks will be examined and counted against the limit of 50. User "joehanesr" has a few hundred separate jobs rather than a job array.
Comment 62 Susan Chacko 2016-03-04 06:21:29 MST
(In reply to Susan Chacko from comment #56)
> (In reply to Moe Jette from comment #55)
> > FYI there are thousands of entries in the log about users trying to cancel
> > each others jobs. I'm hoping that's accidental, but that's a lot of
> > accidents. The messages look like this:
> 
> Yes, we're aware of that and trying to figure out what's going on.

Ok, we talked to a couple of users and figured out what was going on. One user had 'scancel' with no arguments, in an attempt to delete all his own jobs. Another user had 'scancel --jobname somejob' with a generic-enough jobname that it matched some of another user's jobs. 

So user error in all cases that we've investigated. We're telling them to use '-u user' for all scancels, to keep our logs cleaner.
Comment 63 Moe Jette 2016-03-04 06:29:22 MST
> Secondly, we had bf_max_job_part=50 until this morning.

How are things working with the higher value?

Judging from what I know of your workload now, I would expect utilization to be much better.
Comment 64 Susan Chacko 2016-03-04 06:50:34 MST
(In reply to Moe Jette from comment #63)
> > Secondly, we had bf_max_job_part=50 until this morning.
> 
> How are things working with the higher value?
> 
> Judging from what I know of your workload now, I would expect utilization to
> be much better.

Better in the norm partition and the ccr partition. 

In the GPU partition, we still have a clog, but based on your explanation, that's because of joehanesr's 360+ jobs. We have MaxSubmit set to 4000, and MaxArraySize set to 1001 so any user could potentially submit 4000 independent jobs, or an array of 1000 subjobs. We want users to be able to submit large numbers of jobs, as that's the nature of our workload. 

What would be the implications of setting bf_max_job_part to a really large value, like 5000, to prevent any single user from blocking the queue?
Comment 65 Moe Jette 2016-03-04 07:10:44 MST
(In reply to Susan Chacko from comment #64)
> (In reply to Moe Jette from comment #63)
> > > Secondly, we had bf_max_job_part=50 until this morning.
> > 
> > How are things working with the higher value?
> > 
> > Judging from what I know of your workload now, I would expect utilization to
> > be much better.
> 
> Better in the norm partition and the ccr partition. 
> 
> In the GPU partition, we still have a clog, but based on your explanation,
> that's because of joehanesr's 360+ jobs. We have MaxSubmit set to 4000, and
> MaxArraySize set to 1001 so any user could potentially submit 4000
> independent jobs, or an array of 1000 subjobs. We want users to be able to
> submit large numbers of jobs, as that's the nature of our workload. 
> 
> What would be the implications of setting bf_max_job_part to a really large
> value, like 5000, to prevent any single user from blocking the queue?

The fundamental issue is that the backfill scheduling can be a very time consuming operation (see comment #32 for the details). The intent of the bf_max_job_part and bf_max_job_user configuration parameters is to reduce the overhead of backfill scheduling so that a greater variety of jobs can be considered for starting. The idea is that a user's jobs will tend to be similar or subject to similar limits. In this particular case, determining exactly when and where joehanesr's 360+ job should start is probably not beneficial. Setting bf_max_job_part to a value much higher than bf_max_job_user would be beneficial in your case. Rather than increasing bf_max_job_part to 5000, would decreasing bf_max_job_user handle your workload better? Perhaps some combination of changing them both to keep the user limit well below the partition limit?

The downside is that if a user submits large numbers of jobs to multiple partitions, the jobs in the lower priority partitions would never be considered by the backfill scheduler. Realistically, you'll want to keep the total number of pending jobs that the backfill scheduler considers on the order of 1000 for performance reasons (as described in comment #32, each of those jobs is being considered for starting at about 5000 different times for a total of about 5,000,000 fairly heavy weight operations).
Comment 66 Moe Jette 2016-03-09 00:52:43 MST
Do you have any updates on this?
Comment 67 rl303f 2016-03-09 02:14:11 MST
(In reply to Moe Jette from comment #66)
> Do you have any updates on this?

So far we seem to be doing okay with most of the partitions except for the
gpu partition where one user's jobs seem to get in the way of scheduling other
users' gpu jobs.  We have been manually bumping the priority on those stuck
jobs when we see gpu resources available.

OTOH, we are currently at:

SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,bf_max_job_part=300,bf_max_job_array_resv=10,bf_interval=300,bf_max_job_user=200

And we are thinking about increasing bf_max_job_part=300 to bf_max_job_part=600.

Do you feel that should be okay or have any benefit?

Thanks for your help with this!
Comment 68 Moe Jette 2016-03-09 02:21:03 MST
(In reply to rl303f from comment #67)
> (In reply to Moe Jette from comment #66)
> > Do you have any updates on this?
> 
> So far we seem to be doing okay with most of the partitions except for the
> gpu partition where one user's jobs seem to get in the way of scheduling
> other
> users' gpu jobs.  We have been manually bumping the priority on those stuck
> jobs when we see gpu resources available.
> 
> OTOH, we are currently at:
> 
> SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,
> bf_resolution=600,bf_max_job_part=300,bf_max_job_array_resv=10,
> bf_interval=300,bf_max_job_user=200
> 
> And we are thinking about increasing bf_max_job_part=300 to
> bf_max_job_part=600.
> 
> Do you feel that should be okay or have any benefit?
> 
> Thanks for your help with this!

The downside is that the backfill scheduler will likely be running continuously. Even then, it may not get through all of the jobs since it will start over after 300 seconds (bf_interval=300). If you do increase those limits (user and partition), I would also increase bf_interval and monitor the output of the sdiag command to see how much time the backfill scheduler is running and how many of the jobs it is processing. System responsiveness (i.e. time responding to any Slurm command) may be slightly adversely effected.
Comment 69 Moe Jette 2016-03-17 02:35:32 MDT
I wanted to check in with you to see how scheduling was working with your current configuration and propose a changed in the backfill scheduling algorithm.

As previously discussed the backfill scheduling algorithm is very heavy weight due to a nested loop where the expected start time of each pending job is determined by simulating the termination of each running job at the end of its time limit (at for however many jobs need to release resources for the pending job under consideration). The proposed change would short-circuit the logic for some jobs. Specifically, if a pending job's priority is below some configured threshold OR if it's time in pending state is below some threshold, then only determine if the job can start immediately. If it can't start immediately, the job will be left pending with no resources reserved for it. The idea is that if most jobs don't satisfy the configured threshold, the backfill algorithm changes from order NxM to order N (where N is the number of pending jobs and M is the number of running jobs), which permits far more jobs to be considered. Do you think that would be helpful in the case of your workload?
Comment 70 Moe Jette 2016-03-18 08:42:18 MDT
We recently done some work on another bug that may dramatically increase the number of jobs Slurm's backfill scheduler can handle. See:
https://bugs.schedmd.com/show_bug.cgi?id=2565#c2

With that change, a full analysis of when and where pending jobs can start will be performed only for jobs above some particular priority or jobs that have been pending for at least some threshold. After that, it will do a quick check to see if any of the other jobs can be started immediately on remaining resources.

You would also look to remove or greatly increase these parameters:
bf_max_job_part=50,bf_max_job_user=100

NERSC will be testing in the week of March 21.
Comment 71 steven fellini 2016-03-21 05:46:57 MDT
(In reply to Moe Jette from comment #70)
> We recently done some work on another bug that may dramatically increase the
> number of jobs Slurm's backfill scheduler can handle. See:
> https://bugs.schedmd.com/show_bug.cgi?id=2565#c2
> 
> With that change, a full analysis of when and where pending jobs can start
> will be performed only for jobs above some particular priority or jobs that
> have been pending for at least some threshold. After that, it will do a
> quick check to see if any of the other jobs can be started immediately on
> remaining resources.
> 
> You would also look to remove or greatly increase these parameters:
> bf_max_job_part=50,bf_max_job_user=100
> 
> NERSC will be testing in the week of March 21.

Thanks Moe, if NERSC reports success we'll apply the patch.
Comment 72 Moe Jette 2016-03-25 02:44:07 MDT
(In reply to steven fellini from comment #71)
> Thanks Moe, if NERSC reports success we'll apply the patch.

NERSC reports very good results. See:
https://bugs.schedmd.com/show_bug.cgi?id=2565#c8
Comment 73 Moe Jette 2016-04-05 07:52:43 MDT
Do you have any update on this?
Comment 74 steven fellini 2016-04-05 23:27:17 MDT
We are running 15.08.9 with the NERSC patch on our development cluster; we'll get it going on production early next week.
Comment 75 rl303f 2016-04-11 03:36:09 MDT
     (In reply to steven fellini from comment #74)
> We are running 15.08.9 with the NERSC patch on our development cluster;
> we'll get it going on production early next week.

We went ahead and moved our development cluster up to 15.08.10 due to
"backfill scheduler race condition that could cause invalid pointer in
select/cons_res plugin. Bug introduced in 15.08.9."

However, our attempt to apply the NERSC patch (bug_2565.patch) from
https://bugs.schedmd.com/attachment.cgi?id=2886 resulted in an error
and it would not apply.  (It did successfully apply to 15.08.9)

Below is the error:

$ git apply /usr/local/src/slurm-15.08/bug_2565.patch  
/usr/local/src/slurm-15.08/bug_2565.patch:53: space before tab in indent.
        if (sched_params &&
error: patch failed: src/plugins/sched/backfill/backfill.c:964
error: src/plugins/sched/backfill/backfill.c: patch does not apply

We just want to confirm that this is because 15.08.10 already includes
the NERSC bug_2565 code thereby making the patch unnecessary?  Or does
15.08.10 still need that patch but there is some code incompatibility
preventing successful application of the patch? 

Thank you!
Comment 76 Moe Jette 2016-04-11 04:47:05 MDT
Created attachment 2992 [details]
Backport of bug_2565 patch to v15.08.10

We do not plan to add this functionality to version 15.08 for the sake of improved stability of that version, relatively late in it's release cycle. I've attached a version of the patch that will apply cleanly to v15.08.10.
Comment 77 rl303f 2016-04-11 06:17:54 MDT
(In reply to Moe Jette from comment #76)
> Created attachment 2992 [details]
> Backport of bug_2565 patch to v15.08.10
> 
> We do not plan to add this functionality to version 15.08 for the sake of
> improved stability of that version, relatively late in it's release cycle.
> I've attached a version of the patch that will apply cleanly to v15.08.10.

Many thanks, Moe.  The new patch applied successfully.  I guess we better
start thinking about moving up to 16.05 soon.
Comment 78 Moe Jette 2016-04-14 03:49:56 MDT
Have you found the bf_min_prio_reserve option helpful?
How are things running now?
Comment 79 steven fellini 2016-04-14 04:37:44 MDT
(In reply to Moe Jette from comment #78)
> Have you found the bf_min_prio_reserve option helpful?
> How are things running now?

Moe, 

We put the patch into effect yesterday morning, setting bf_min_prio_reserve to 100000. We think there has been an improvement.  We're seeing two things: (1) time of backfilling cycles has reduced drastically, from mean times of near 300 s during peak hours to around 40 s, and (2) almost all jobs with prio>100000 have a start time.

We haven't run with the patch long enough to be sure, but I think you can close the ticket and if problems reemerge we can always open it again.

Thanks for your help.

Steve.
Comment 80 Moe Jette 2016-04-14 04:43:06 MDT
(In reply to steven fellini from comment #79)
> We haven't run with the patch long enough to be sure, but I think you can
> close the ticket and if problems reemerge we can always open it again.

Excellent! That's about what NERSC found.

WARNING:
You will need to treat this as a local patch until upgrading to version 16.05. If you do upgrade to a newer version of 15.08, the patch may not apply properly (it may put some new code in the wrong place) and the result may crash slurmctld. There is a ticket related to the batch patch apply here:
https://bugs.schedmd.com/show_bug.cgi?id=2634