Ticket 10271 - Reduced backfill performance when enabling job preemption
Summary: Reduced backfill performance when enabling job preemption
Status: RESOLVED DUPLICATE of ticket 9365
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 19.05.7
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Ben Roberts
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-11-23 11:36 MST by Steve Ford
Modified: 2020-12-16 09:54 MST (History)
1 user (show)

See Also:
Site: MSU
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (69.30 KB, text/plain)
2020-11-23 11:36 MST, Steve Ford
Details
slurm.conf with preemption enabled (69.37 KB, text/plain)
2020-11-23 11:37 MST, Steve Ford
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Steve Ford 2020-11-23 11:36:57 MST
Created attachment 16780 [details]
slurm.conf

Hello,

We are looking to implements a 'scavenger' type queue where users can run preemptable jobs to fill otherwise idle resources. QOS preemption looked like the best way to do this given our  many partitions with various PriorityTiers.

We enabled preemption in slurm.conf with PreemptType=preempt/qos and PreemptMode=Requeue and created a scavenger QOS with PreemptMode=Requeue and a separate partition to hold these jobs. Upon enabling this configuration, we saw a considerable impact to backfill scheduler performance. Previously, the average backfill cycle was 250s. After the change, the next two cycles were 1400s and 1600s. We quickly reverted to the old configuration.

Is reduced backfill performance expected when enabling job preemption? Are there any additional configuration changes we can make to mitigate this issue? 

Thanks,
Steve
Comment 1 Steve Ford 2020-11-23 11:37:35 MST
Created attachment 16781 [details]
slurm.conf with preemption enabled
Comment 3 Ben Roberts 2020-11-24 13:15:03 MST
Hi Steve,

Adding preemption to the mix does add some overhead as the scheduler also has to evaluate running jobs to see which ones are preemptable.  If the primary concern is the amount of time spent on the backfill cycle you can lower the bf_max_time to somewhere in the 250 - 300 second range.  This will cause the backfill scheduler stop evaluating jobs once that time limit has been reached.  Since you have the bf_continue flag enabled, the backfill scheduler will pick up from where it left off the next time it starts.  The down side is that this would mean that it will take longer to make it through the entire job queue, so if a job is at the bottom of the queue it could take several iterations before the job is evaluated.  

It's possible that some of the other parameters could be adjusted a little to meet the needs of your cluster.  I see that you have the bf_window set to 10081 minutes.  Was this configured to be just a little longer than your longest running job?  How large is your job queue typically?  Do you tend to have a lot of jobs from a few users or is it usually spread pretty evenly across your user base?

Another thing I will mention is that there was recently work done in bug 9365 to increase the efficiency of the backfill scheduler with PreemptType=preempt/qos and SelectType=select/cons_tres.  This work was checked in to the 20.11 code, so when you are able to upgrade to this version you should see some performance improvements.  

If you have any questions about this please let me know.

Thanks,
Ben
Comment 4 Ben Roberts 2020-12-03 15:48:14 MST
Hi Steve,

I wanted to check in and see whether you have any additional questions about this.  Let me know if so or I'll go ahead and close the ticket.

Thanks,
Ben
Comment 5 Ben Roberts 2020-12-16 09:54:50 MST
Hi Steve,

I haven't heard any follow up questions about this so I assume the information I sent helped.  I'll go ahead and mark this as a duplicate of bug 9365.  If you do have any additional questions feel free to respond to the ticket.

Thanks,
Ben

*** This ticket has been marked as a duplicate of ticket 9365 ***