Ticket 2762

Summary: backfill parameter not working as expected
Product: Slurm Reporter: Satrajit Ghosh <satrajit.ghosh>
Component: slurmctldAssignee: Tim Wickberg <tim>
Status: OPEN --- QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 15.08.11   
Hardware: Linux   
OS: Linux   
Site: -Other- Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm.conf
sdiag 160524T131000-5

Description Satrajit Ghosh 2016-05-24 01:47:22 MDT
this looks very related to: https://bugs.schedmd.com/show_bug.cgi?id=2588

we have situations where users would like to submit thousands of jobs and forget about it and just leave it to the scheduler to figure things out. we also have TRES enabled for qos which ensures that no users occupies more than 1/3 of the resources.

however, backfilling appears to go through all jobs of a given user even when bf_max_job_user is set to something like 50.

we would have expected the backfill scheduler to:

1. sort by priority
2. ignore any jobs for users already at TRES@qos limit

for remaining jobs:
3. loop through jobs, backfilling as necessary 
  a. but only up to bf_max_job_user number of jobs per user.
  b. if one has already seen bf_max_job_user jobs from that user, ignore any more jobs during the current round of backfill

this would be very helpful in our scenario.

primary question: is the role of the bf_max_job_user to limit the number of jobs for a given user that slurmctld has to cycle through? and if so is the above observation a bug.

secondary question: when qos TRES is set and user is at limit, does the backfill process ignore the user as we would have expected?
Comment 1 Tim Wickberg 2016-05-24 03:06:08 MDT
(In reply to Satrajit Ghosh from comment #0)
> this looks very related to: https://bugs.schedmd.com/show_bug.cgi?id=2588
> 
> we have situations where users would like to submit thousands of jobs and
> forget about it and just leave it to the scheduler to figure things out. we
> also have TRES enabled for qos which ensures that no users occupies more
> than 1/3 of the resources.
> 
> however, backfilling appears to go through all jobs of a given user even
> when bf_max_job_user is set to something like 50.

Yes, those jobs will still be considered, albeit rather briefly to establish that the user is no longer allowed to submit further jobs.

> we would have expected the backfill scheduler to:
> 
> 1. sort by priority
> 2. ignore any jobs for users already at TRES@qos limit

You have to test a job to "ignore" it; the backfill scheduler makes a single pass through the priority-sorted list, it's not "filtering" it on successive passes as you've implied here.

So the implementation is closer to:

(1) For each partition, sort jobs by priority.
(2) For each job, test it. (Until we yield the current backfill loop, or bf_max_job_test/bf_max_job_part is exceeded. Per user limits, QOS/TRES are all tested individually when considering the job.

> for remaining jobs:
> 3. loop through jobs, backfilling as necessary 
>   a. but only up to bf_max_job_user number of jobs per user.
>   b. if one has already seen bf_max_job_user jobs from that user, ignore any
> more jobs during the current round of backfill

I wouldn't say they're "ignored"; it'd be more accurate to say that, when tested, they're immediately disqualified from further consideration. They still have to be tested to match up that user though.

> this would be very helpful in our scenario.
> 
> primary question: is the role of the bf_max_job_user to limit the number of
> jobs for a given user that slurmctld has to cycle through? and if so is the
> above observation a bug.

It limits the number of jobs from a user that are considered; they'll still be briefly looked at during each backfill cycle, and still count against bf_max_job_test/bf_max_job_part.

> secondary question: when qos TRES is set and user is at limit, does the
> backfill process ignore the user as we would have expected?

No, the TRES limit is tested on a job (not per-user), and each job tested would count against that bf_max_job_user limit.
Comment 2 Satrajit Ghosh 2016-05-24 03:45:45 MDT
Thank you Tim for the clarification. 

In that case we were having some severe slow down issues in the backfill operation. 

backfill parameters:  
bf_continue,bf_window=11520,bf_max_job_test=10000,bf_interval=300,bf_max_job_user=50

one distribution of pending jobs that we were looking at was similar to the following:

2000 user1
300 user2
50 user3
10 user4
and then a bunch of users with 1 or 2 pending jobs. 

user1 was at TRES limit.

jobs were being backfilled really slowly even though the cluster had many resources available. to test i had a job in there that was simply a single core job with minimal memory. the cluster definitely had resources to run this job. my priority was lower than the person with 2000 jobs, but it took the backfill process almost 20 mins to schedule my job.

since such scenarios are hard to replicate, is there a way to capture the state of the system when we observed issues for debugging later? or to create artificial multi-user job distributions to test?

in general we are trying to balance:
1. ease of use, a user can blindly submit as many jobs
2. resource constraints using TRES and qos
3. fairshare

but i wasn't expecting backfill to take that much time given the per user job test limit and the distribution above.
Comment 3 Tim Wickberg 2016-05-24 03:54:52 MDT
Can you attach your slurm.conf file?

How many users are active on the system? I've noticed that the implementation for bf_max_job_user may not work well for high user counts as its somewhat naive in approach, and results in an additional O(num_users) factor for the backfill loop.

I'd also be curious to know roughly how many jobs you're running per day, and if you can attach the output from 'sdiag' that may help me understand the workload better.
Comment 4 Satrajit Ghosh 2016-05-24 04:10:24 MDT
Created attachment 3131 [details]
slurm.conf
Comment 5 Satrajit Ghosh 2016-05-24 04:11:39 MDT
Created attachment 3132 [details]
sdiag 160524T131000-5
Comment 6 Satrajit Ghosh 2016-05-24 04:13:18 MDT
added both files. 

unfortunately the cluster is much more quite this week relative to two weeks back (pre conference deadline). but in general we have been 30 - 50 users typically. and most jobs are small jobs.
Comment 8 Tim Wickberg 2016-06-09 20:13:18 MDT
The sdiag shows "Depth Mean: 2", which is very low... that matches up to your perception of slugging performance.

It looks like setting a higher bf_resolution value may help here - 300 or 600 (seconds) instead of the default of 60 would significantly improve the performance for the backfill algorithm (by roughly 5x or 10x), especially since you have bf_window=11520 (minutes). I'd also recommend lowering that as well to start.