We use PriorityTier settings on our partitions to allow certain users priority access to hardware. The high-priority partitions are PriorityTier=10 while low priority are PriorityTier=5. The hardware in the high-priority partitions is a sub-set of the hardware in low-priority partition. When users have more than 20 (bf_max_job_user_part) jobs queued in a high-priority partition, the jobs queued below those are not considered by the backfill scheduler and instead low-priority partition jobs are started. We would increase bf_max_job_user_part, but our backfill scheduler cycles are already long (200-500 seconds). We would like a way to set bf_max_job_user_part on each partition instead of globally for the entire cluster.
Hi Steve - as you mentioned this is a SchedulerParameters and enforced the same per partition. Your request, as I read it, would have us add bf_max_job_user_part=# functionality to make this unique for each partition. This would require us to scope out the work required which may not be trivial to implement. Are you interested in sponsoring development for this NRE work?
Jason, We may be interested in sponsoring development for this feature. How much development work would this take? Thanks, Steve
Hi Steve - While discussing this internally we found that you may be able to tune your backfill settings to optimize backfill. We provide a configuration parameter "bf_min_prio_reserve" which limits which jobs are considered in backfill based on a minimum priority. bf_min_prio_reserve=# The backfill and main scheduling logic will not reserve resources for pending jobs unless they have a priority equal to or higher than the specified value. In addition, jobs with a lower priority will not prevent a newly submitted job from starting immediately, even if the newly submitted job has a lower priority. This can be valuable if one wished to maximum system utilization without regard for job priority below a certain threshold. The default value is zero, which will reserve resources for any pending job and delay initiation of lower priority jobs. Also see bf_job_part_count_reserve and bf_min_age_reserve. Default: 0, Min: 0, Max: 2^63. Setting this would also require you to tweak some other settings such as bf_max_job_user_part however, this creates a subset of jobs and should not cause as much of a performance hit as you would see by just increasing bf_max_job_user_part. We would ask that you try turning these setting to see if this produces a more desirable effect. Note that you will also need to consider the priority of jobs below bf_min_prio_reserve and how that priority changes/increases over time to meet your SLAs.
Hey Jason, Setting bf_min_prio_reserve and bf_min_age_reserve has our backfill scheduler running much quicker. We also applied this patch: https://bugs.schedmd.com/attachment.cgi?id=11398&action=diff. We've been able to comfortably increase bf_max_job_user_part from 20 to 50. The same issue is still possible when a user has more than 50 jobs queued in a partition, however, we decided to instruct our users to re-order their jobs by adjusting their nice values to work with this limit. Thanks, Steve
Steve - That is good to hear that you have made progress with those tuning parameters. Shall I proceed to close out this request?
Hello Jason, I think it's safe to close out this request. Best, Steve
Making as resolved.