Coming from bug 5176. It is desired to add something equivalent to Moab's XFACTOR priority component: http://www.adaptivecomputing.com/blog-hpc/using-moab-job-priorities-exploring-priority-sub-components/ XFACTOR = 1 + (EffQueueTime / WallClockLimit) which seems that could be easily accomplished by adding an AGE_RELATIVE_TO_TIME to the PriorityFlags, which if set would divide the Age Factor by the job's TimeLimit, continuing taking into account ACCRUE_ALWAYS and/or PriorityMaxAge.
*** Ticket 5947 has been marked as a duplicate of this ticket. ***
*** Ticket 5964 has been marked as a duplicate of this ticket. ***
We think we can add this feature to 19.05 without (a priori) too much effort. We're also leveraging different implementation approaches: 1. The most simple. Just add a new AGE_RELATIVE_TO_TIME flag to PriorityFlags so that the AgeFactor is divided by the job's TimeLimit (or partition MaxTime if no job request). Then if the result < PriorityMaxAge, divide by PriorityMaxAge; otherwise set the factor to 1.0. 2. More elaborated and would time a few more time. Add a new and independent PriorityWeightAgeRelative factor and a new PriorityMaxAgeRelative value. Not sure if it would make sense to have both (PriorityWeightAge, PriorityMaxAge) and (PriorityWeightAgeRelative, PriorityMaxAgeRelative) duples enabled at the same time or if they should be mutually exclusive. Thoughts?
(In reply to Alejandro Sanchez from comment #3) > We think we can add this feature to 19.05 without (a priori) too much > effort. We're also leveraging different implementation approaches: > > 1. The most simple. Just add a new AGE_RELATIVE_TO_TIME flag to > PriorityFlags so that the AgeFactor is divided by the job's TimeLimit (or > partition MaxTime if no job request). Then if the result < PriorityMaxAge, > divide by PriorityMaxAge; otherwise set the factor to 1.0. Sounds nice and simple, but we want independent job age as well as job relative age parameters. > 2. More elaborated and would time a few more time. Add a new and independent > PriorityWeightAgeRelative factor and a new PriorityMaxAgeRelative value. Not > sure if it would make sense to have both (PriorityWeightAge, PriorityMaxAge) > and (PriorityWeightAgeRelative, PriorityMaxAgeRelative) duples enabled at > the same time or if they should be mutually exclusive. I think that job age and job relative are two quite independent weights, both of which can be very useful. Please do not make them mutually exclusive. > Thoughts? The function AGE_RELATIVE_TO_TIME can be simply job wait time / TimeLimit. But it will be nice to have something like the Maui/Moab parameter XFMINWCLIMIT (see the Moab documentation) so that jobs submitted with a *very* short TimeLimit (say, 1 minute) don't jump to the top of the queue too quickly. If you plot the mathematical function with and without XFMINWCLIMIT, you'll see a different curve. For our cluster I'd configure something like XFMINWCLIMIT = 10 minutes.
I think the PriorityAgeMinTimeLimit (XFMINWCLIMIT) can easily be taken into account by almost the same effort. If let's say you configure it to 10 minutes and the job requests a time limit of 1 minute (job TimeLimit < PriorityAgeMinTimeLimit) then the logic should divide the job age / 10 , right?
(In reply to Alejandro Sanchez from comment #5) > I think the PriorityAgeMinTimeLimit (XFMINWCLIMIT) can easily be taken into > account by almost the same effort. If let's say you configure it to 10 > minutes and the job requests a time limit of 1 minute (job TimeLimit < > PriorityAgeMinTimeLimit) then the logic should divide the job age / 10 , > right? According to the Moab documentation http://www.adaptivecomputing.com/blog-hpc/using-moab-job-priorities-exploring-priority-sub-components/ the formula would be something like these pseudo-statements: tmp = Wait-time / max (PriorityAgeMinTimeLimit, TimeLimit) PriorityAgeRelative = min (tmp, PriorityMaxAgeRelative) job_priority += PriorityWeightAgeRelative * PriorityAgeRelative Does this make sense?
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7) > tmp = Wait-time / max (PriorityAgeMinTimeLimit, TimeLimit) > PriorityAgeRelative = min (tmp, PriorityMaxAgeRelative) > job_priority += PriorityWeightAgeRelative * PriorityAgeRelative > > Does this make sense? Yes, although not sure if the MaxAgeTime should be MIN()'d before or after multiplying by the Weight. If it is before, then PriorityMaxAgeRelative should be handled internally as a double C type with range [0.0, 1.0]. Do you agree with that?
(In reply to Alejandro Sanchez from comment #8) > (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7) > > tmp = Wait-time / max (PriorityAgeMinTimeLimit, TimeLimit) > > PriorityAgeRelative = min (tmp, PriorityMaxAgeRelative) > > job_priority += PriorityWeightAgeRelative * PriorityAgeRelative > > > > Does this make sense? > > Yes, although not sure if the MaxAgeTime should be MIN()'d before or after > multiplying by the Weight. If it is before, then PriorityMaxAgeRelative > should be handled internally as a double C type with range [0.0, 1.0]. Do > you agree with that? Yes, I think you're right! The PriorityMaxAgeRelative should probably be a large integer (say, 100000) defining the cap on the PriorityAgeRelative priority factor. So perhaps the pseudo-code should be changed into: tmp = Wait-time / max (PriorityAgeMinTimeLimit, TimeLimit) job_priority += min (PriorityMaxAgeRelative, PriorityWeightAgeRelative * tmp) Do you agree that this is the correct formula to use?
I like the previous one more. If we use the second one I think the MaxAgeRelative would be redundant since tmp at most can be 1.0 and if you want to limit the factor by the weight multiplication, you just need to adjust the weight to the maximum you want, so the MaxAgeRelative wouldn't buy us anything new. If instead MaxAgeRelative limits the maximum tmp division, perhaps it makes more sense? Another thing that I'm questioning is if the already available PriorityMaxAge should also limit the maximum age for PriorityMaxAgeRelative before doing the division by the time limit. What do you think?
Note that for Moab the limit is applied before being weighted: XFACTORCAP – The maximum number of points that XFACTOR sub-component value. can provide to the calculation, which will then be multiplied by the component and sub-component weights.
I'm even thinking we don't need MaxAgeRelative at all. The tmp division will always be normalized between 0.0 .. 1.0 and the result after weight multiplication will then be at most PriorityWeightAgeTime. Then we can avoid using an extra unneeded param. Do you agree with that? The MinTime can make sense though. And I'd like to know your opinion on the question about the mentioned PriorityMaxAge if it should also limit this new multifactor factor age before dividing by time limit.
(In reply to Alejandro Sanchez from comment #12) > I'm even thinking we don't need MaxAgeRelative at all. The tmp division will > always be normalized between 0.0 .. 1.0 and the result after weight > multiplication will then be at most PriorityWeightAgeTime. Then we can avoid > using an extra unneeded param. Do you agree with that? I believe the tmp value: tmp = Wait-time / max (PriorityAgeMinTimeLimit, TimeLimit) can be any float number between 0.0 and a very large number. For example, a job with TimeLimit = 1 minute which has waited for 24 hours has tmp=1440.0 > The MinTime can make sense though. And I'd like to know your opinion on the > question about the mentioned PriorityMaxAge if it should also limit this new > multifactor factor age before dividing by time limit. I guess I'm getting confused here. The PriorityAgeMinTimeLimit above ensures that very short jobs don't get a top priority too fast. The capping must ensure that there is a ceiling to the priority achieved. How can we express this in Slurm terminology so that the nice XFACTOR functionality of Moab is achieved? Could you perhaps describe your intended functionality mathematically using appropriate new Slurm parameters similar to what I propose? For me, reading math is sometimes simpler than words :-)
True, we need a ceiling cap. Let's put expressions. Let: double priority_age_time be the priority object factor member holding age relative to time. uint32_t PriorityWeightAgeTime be the configurable weight for such factor. uint32_t PriorityAgeMinTime be the minimum time limit (expressed in time format and internally converted to a number representing minutes) that can be used to divide the age. double PriorityMaxAgeTime be the age/time which will be given the maximum (1.0) priority_age_time factor in computing priority. All priority object factors are floating point numbers normalized between 0.0 and 1.0 as documented[1]. Calculating the factors is done in set_priority_factors() function[2][3]. So in set_priority_factors() we should add logic that would calculate the new priority_age_time factor as follows: uint32_t tmp_time = MAX(JobTimeLimit, PriorityAgeMinTime); double tmp = (double) QueueWaitTime / (double) tmp_time; if (tmp < PriorityMaxAgeTime) { priority_age_time = tmp / PriorityMaxAgeTime; } else { priority_age_time = 1.0; } Once calculated, we know that priority_age_time is within [0.0, 1.0]. Then in _get_priority_internal() we multiply[4] such factor by PriorityWeightAgeTime. So after weighted, the AgeTime contribution can at most be PriorityWeightAgeTime since the maximum factor is 1.0. My question is if the already available PriorityMaxAge, which currently only applies to the priority_age factor, if it should also limit the QueueWaitTime in the algorithm above for the calculation of priority_age_time. Does it make sense now? [1] https://slurm.schedmd.com/priority_multifactor.html#general [2] https://github.com/SchedMD/slurm/blob/slurm-18.08/src/plugins/priority/multifactor/priority_multifactor.c#L526 [3] https://github.com/SchedMD/slurm/blob/slurm-18.08/src/plugins/priority/multifactor/priority_multifactor.c#L2003 [4] https://github.com/SchedMD/slurm/blob/slurm-18.08/src/plugins/priority/multifactor/priority_multifactor.c#L541
(In reply to Alejandro Sanchez from comment #14) > True, we need a ceiling cap. Let's put expressions. Let: > > double priority_age_time be the priority object factor member holding age > relative to time. > uint32_t PriorityWeightAgeTime be the configurable weight for such factor. > uint32_t PriorityAgeMinTime be the minimum time limit (expressed in time > format and internally converted to a number representing minutes) that can > be used to divide the age. > double PriorityMaxAgeTime be the age/time which will be given the maximum > (1.0) priority_age_time factor in computing priority. > > All priority object factors are floating point numbers normalized between > 0.0 and 1.0 as documented[1]. Calculating the factors is done in > set_priority_factors() function[2][3]. > > So in set_priority_factors() we should add logic that would calculate the > new priority_age_time factor as follows: > > uint32_t tmp_time = MAX(JobTimeLimit, PriorityAgeMinTime); > double tmp = (double) QueueWaitTime / (double) tmp_time; > if (tmp < PriorityMaxAgeTime) { > priority_age_time = tmp / PriorityMaxAgeTime; > } else { > priority_age_time = 1.0; > } > > Once calculated, we know that priority_age_time is within [0.0, 1.0]. Then > in _get_priority_internal() we multiply[4] such factor by > PriorityWeightAgeTime. So after weighted, the AgeTime contribution can at > most be PriorityWeightAgeTime since the maximum factor is 1.0. Reading your algorithm, everything seems to be clarified now. I agree with the algorithm and code above! > My question is if the already available PriorityMaxAge, which currently only > applies to the priority_age factor, if it should also limit the > QueueWaitTime in the algorithm above for the calculation of > priority_age_time. That's a good question. After thinking about it, my opinion is that PriorityMaxAge should *not* apply to the priority_age_time calculation. My reasoning is that PriorityMaxAge and priority_age_time are used for completely different purposes, and that it's important to keep things simple and understandable. It would be confusing to Slurm admins and users if these two priority factors were intertwined. > Does it make sense now? Yes, as far as I'm concerned this is an excellent design! Documentation of slurm.conf and a new column next to the AGE column in the output of the sprio command would seem to complete your suggested implementation.
I'm out of the office until November 19th. Jeg er ikke på kontoret, tilbage igen 19. november. Best regards / Venlig hilsen, Ole Holm Nielsen
I'm not seeing this new feature being added to the NEWS page for 19.05. Could I ask you to push for acceptance of the patch? Thanks, Ole
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #28) > I'm not seeing this new feature being added to the NEWS page for 19.05. > Could I ask you to push for acceptance of the patch? > Thanks, Ole We're still having some internal discussion around this bug. We'll update you as soon as possible. Thanks.
Hi Ole and David, I want to update you on the state of this bug as well as apologizing for the long delay. This has taken longer than initially expected due to the extensive internal discussion around different approaches. As proven by the 'prio_agetime' branch, I already had a working proof of concept addition of a new factor AgeTime + the min and max bounding options as initially discussed in this bug. After the mentioned internal and long talks, we've decided to discard that work (that branch will eventually be removed) in favor of a Tim's innovate idea, already committed[1] to master branch. The idea has been expressed as a new sub-plugin to multifactor that lets a site develop their own way to manipulate a site factor. This factor can be set by an administrator using scontrol, through a job submit plugin or through a site_factor plugin. To that end, two new options have been added: PrioritySiteFactorPlugin (defining the enabled plugin type) PrioritySiteFactorParameters (for internal use by the plugin) An API has been exposed and documented, with the following functions: init fini site_factor_p_reconfig site_factor_p_set site_factor_p_update The site factor can be updated every PriorityCalcPeriod and the factor parameters refreshed upon slurmctld reconfiguration. There's a very basic example (and non-sense random) implementation of this plugin available under: src/plugins/site_factor/none/site_factor_none.c While this interface will be intended for sites to develop their own implementations (same way sites already develop their own job submit or SPANK plugins nowadays) we committed to deliver an equivalent method to emulate Moab's XFACTOR for Slurm. So we will attach a site_factor/xfactor implementation for you as a promised exceptional concession. The plugin interface is already available in master branch as mentioned, so at this point, I'm not in a hurry anymore due to this bug since adding a new implementation such as site_factor/xfactor won't require extra changes to the API. I'll try to attach it in the following days/weeks, subject to my current constraints working on other higher priority things close to 19.05 release. Please, let us know if you have any questions and thanks for your patience and understanding. [1] Currently committed changes related to site factor plugin: 5884527c - Add PrioritySiteFactorParameters / PrioritySiteFactorPlugin options. d2c71a41 - Add site_factor plugin. 2987c5db - Convert "Admin Priority" to "Site Priority". ccdfa085 - Docs - document new site_factor plugin API. 6be046f0 - Docs - actually add the new page for the site_factor API. bec96f1e - Fix typo.
Hi Alejandro, (In reply to Alejandro Sanchez from comment #41) > The idea has been expressed as a new sub-plugin to multifactor that lets a > site develop their own way to manipulate a site factor. This factor can be > set by an administrator using scontrol, through a job submit plugin or > through a site_factor plugin. > > To that end, two new options have been added: > PrioritySiteFactorPlugin (defining the enabled plugin type) > PrioritySiteFactorParameters (for internal use by the plugin) > > An API has been exposed and documented, with the following functions: > init > fini > site_factor_p_reconfig > site_factor_p_set > site_factor_p_update ... > While this interface will be intended for sites to develop their own > implementations (same way sites already develop their own job submit or > SPANK plugins nowadays) we committed to deliver an equivalent method to > emulate Moab's XFACTOR for Slurm. So we will attach a site_factor/xfactor > implementation for you as a promised exceptional concession. It's a great idea to create a new sub-plugin to multifactor allowing sites more control over job priority calculations. Thanks a lot for developing a site_factor/xfactor plugin for 19.05 which will be very welcome at out site! Once we start using the xfactor plugin, I wonder how we can inquire Slurm about the priority values added to all the other priority factors? I hope that you will update the sprio command so that a new site_factor priority column can be printed? Thanks, Ole
sprio is already updated to print the SITE.
(In reply to Alejandro Sanchez from comment #43) > sprio is already updated to print the SITE. Thanks! I assume that you refer to the 19.05 version of sprio? The site is not in 18.08. /Ole
Yes indeed, master branch future 19.05 only.
Created attachment 10085 [details] 5202_site_factor_xfactor_v1 Hi, please, see attached a patchset composed out of [PATCH 1/2] and [PATCH 2/2]. Clarifying notes: - [PATCH 1/2] contains the actual meat of the site_factor/xfactor plugin. Mod configure.ac Mod src/plugins/site_factor/Makefile.am New src/plugins/site_factor/xfactor/Makefile.am New src/plugins/site_factor/xfactor/site_factor_xfactor.c - [PATCH 2/2] just contains the modifications of running 'autoreconf'. Since we are adding a new plugin it is required to remake the build system files in specified dirs and subdirs. Mod aclocal.m4 Mod configure Mod src/plugins/site_factor/Makefile.in Mod src/plugins/site_factor/xfactor/Makefile.in The patchset can be interactively applied with: $ git am -3 -i /path/to/5202_site_factor_xfactor_v1.patch I've consciously decided to split the patchset in these two patches, since 1/2 will probably always cleanly apply without conflicts. Perhaps the only one file that can conflict is configure.ac. Patch 2/2 is more prone to eventually conflict. If that happens you can always just apply 1/2 and skip applying 2/2, then manually run 'autoreconf' and rebuild the patchset yourself if desired: $ git format-patch --stdout origin/<branch_name> > ~/path/to/file.patch To enable the plugin I've configured: PrioritySiteFactorPlugin=site_factor/xfactor PrioritySiteFactorParameters=xfactor_min_time=2,xfactor_max=10000000,xfactor_weight=100000 xfactor_min_time /* Minimum time limit in minutes. */ xfactor_max /* Maximum weightened xfactor value. */ xfactor_weight /* Weight for xfactor factor. */ I've done a very basic sanitization of the accepted param values in _parse_parameters() function. Feel free to explore/change at will the minimum/maximum accepted values for each of them. When calling slurm_get_priority_site_factor_params() function, the caller MUST xfree() the xstrdup'ed string of options, as I've done after parsing. Both init() and site_factor_p_reconfig() call _parse_parameters(). site_factor_p_[set|update] functions internally call _calc_factor(). The set one just for the given job record and the update iterates over the list of all pending jobs in the slurmctld active in-memory list (those still not purged after MinJobAge). While the whole job_record pointer is available in the plugin, the plugin is meant to _only_ modify the job_record->site_factor and don't touch anything else to reduce the likelihood of corrupting things. Note that once the whole patchset is applied, if you want to do any custom amendments to site_factor_xfactor.c then you just need to recompile with the generated Makefile under .../build/src/plugins/site_factor/xfactor directory. Please, when possible start testing this in a non-production system with a Slurm version >= 19.05.0rc1 + this patchset applied on top. At this point and given the patchset and the above instructions, I assume you are ready to maintain it as a standalone custom patch by yourself. I'll leave the bug open for a while waiting for any follow-up questions and feedback. Thanks to everybody involved in this.
Hi, I'm gonna close this as per the new functionality provided + the standalone implementation provided. Thanks.
I'm out of the office, back on Friday. Jeg er ikke på kontoret, tilbage på fredag. Best regards / Venlig hilsen, Ole Holm Nielsen