Good Afternoon SLURM Support We have some confusions with our setup for child accounts from different parent accounts. Our setup: -- AccountingStorageEnforce = associations,limits,qos AccountingStorageTRES = cpu,mem,energy,node,billing,gres/gpu AccountingStorageType = accounting_storage/slurmdbd PriorityParameters = (null) PriorityDecayHalfLife = 14-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = PriorityMaxAge = 14-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 10000 PriorityWeightFairShare = 40000 PriorityWeightJobSize = 10000 PriorityWeightPartition = 10000 PriorityWeightQOS = 0 PriorityWeightTRES = (null) SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CORE_MEMORY -- Question 1: We have Seven parent accounts (p001 - p007) and many child accounts. p001 has a share of 3 p007 has a share of 61 Under p001 has: a001 a002 Under p007 has: c001 c002 c003 c004 c005 c006 c007 Shouldn't a child account under p007 (for example, c003) always has a higher fairshare value than a child account under p001 (for example, a001) ? Since p007 has a share of 61, while p001 has a share of 3 only. Question 2: If this is not the correct behaviour, if we switch to Fair Tree Fairshare, PriorityFlags=FAIR_TREE, Will this scenario be the correct assumption ? We have just upgraded from 16.05 to 17.11.05. Kindly advise. Thanks Cheers Damien
In additionally. (related to this) As we have: - PriorityDecayHalfLife = 14-00:00:00 - Priority will decay for a period of 14 days (a fortnightly). Question, Can we see when this period start or end in realtime ? perhaps 'sdiag' ? so we can advise end-user when this will re-calculate again. Please advise. Many Thanks. Cheers Damien
Hi Damien. Before replying to your questions I noticed one thing worth commenting after seeing your setup: As our colleague Marshall explained in bug 5104 comment 27, your PriorityWeight<something> factors are all the same (10000), except for PriorityWeightFairshare (40000). Because they're all the same, none of them are going to affect job priority whatsoever - they might as well be zero. We generally recommend ordering each of the PriorityWeight<something> factors from most to least important, then setting them each an order of magnitude apart. This should help some more jobs get scheduled. (In reply to Damien from comment #0) > Shouldn't a child account under p007 (for example, c003) always has a higher > fairshare value than a child account under p001 (for example, a001) ? Since > p007 has a share of 61, while p001 has a share of 3 only. Not necessary always should be higher. Take into account that the number of siblings at the same level also influences to the formula as mentioned in https://slurm.schedmd.com/priority_multifactor.html#fairshare S = (Suser / Ssiblings) * (Saccount / Ssibling-accounts) * (Sparent / Sparent-siblings) * ... But still with your setup, even if account p007 has 7 sub-accounts c[001-007] so there are more siblings to divvy up the assigned shares, the NormShares are higher as compared to a[001-002] hanging from p001: alex@ibiza:~/t$ sshare Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- root 1.000000 0 1.000000 0.500000 p001 3 0.044776 0 0.000000 1.000000 a001 1 0.022388 0 0.000000 1.000000 a002 1 0.022388 0 0.000000 1.000000 p007 63 0.940299 0 0.000000 1.000000 c001 1 0.134328 0 0.000000 1.000000 c002 1 0.134328 0 0.000000 1.000000 c003 1 0.134328 0 0.000000 1.000000 c004 1 0.134328 0 0.000000 1.000000 c005 1 0.134328 0 0.000000 1.000000 c006 1 0.134328 0 0.000000 1.000000 c007 1 0.134328 0 0.000000 1.000000 alex@ibiza:~/t$ Note a001 has NormShares = 0.022388 which is (0.044776 / 2) which is lower than any of the c00X accounts. (This above is _without_ PriorityFlags=FAIR_TREE set). Note though that this doesn't mean that if for instance a user from account c00x starts monopolizing the cluster resources, thus being over-serviced, that at some point this user could end up having a worst FairShare factor as compared to a user hanging from an a00x account that was under-serviced and almost never used the resources, even if the user hanging from an a00x account has fewer shares assigned. Does it make sense? > Question 2: > > If this is not the correct behaviour, if we switch to Fair Tree Fairshare, > PriorityFlags=FAIR_TREE, Will this scenario be the correct assumption ? PriorityFlags=FAIR_TREE results in changes to several fairshare calculations. Fair Tree prioritizes users such that if accounts A and B are siblings and A has a higher fairshare factor than B, all children of A will have higher fairshare factors than all children of B. Note that if FAIR_TREE flag is set, the output of sshare -l will display a new field Level FS. If an account has a higher Level FS value than any other sibling user or sibling account, all children of that account will have a higher FairShare value than the children of the other account. This is true at every level of the association tree. We usually recommend to set this flag. A more detailed guide is on the web: https://slurm.schedmd.com/fair_tree.html (In reply to Damien from comment #1) > Question, Can we see when this period start or end in realtime ? perhaps > 'sdiag' ? > so we can advise end-user when this will re-calculate again. Everytime slurmctld is started/reconfigured, it spawns a _decay_thread which is responsible for applying the decay factor to the usage every PriorityCalcPeriod, which is set to 5 minutes in your setup. With DebugFlags=Priority set, a message similar to this is logged (I have PriorityCalcPeriod=00:01:00): slurmctld: Decay factor over 60 seconds goes from 0.999999732638889 -> 0.999983958459854 But there's no information displayed in 'sdiag' related to this. Please, let me know if you have further questions, thanks.
If the web guide isn't enough, there's a complementary presentation here: https://slurm.schedmd.com/SC14/BYU_Fair_Tree.pdf
Hi Alejandro Your explanation does make sense. With our current configurations: -- AccountingStorageEnforce = associations,limits,qos AccountingStorageTRES = cpu,mem,energy,node,billing,gres/gpu AccountingStorageType = accounting_storage/slurmdbd PriorityParameters = (null) PriorityDecayHalfLife = 14-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = PriorityMaxAge = 14-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 10000 PriorityWeightFairShare = 40000 PriorityWeightJobSize = 10000 PriorityWeightPartition = 10000 PriorityWeightQOS = 0 PriorityWeightTRES = (null) SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CORE_MEMORY -- How does SLURM re-calculate fairshare value of any particular account ? For example, If we have an heavy usage for an account for two weeks, does this account remains low fairshare for the rest of the months, or at certain time, this will reset, and re-calculate again, and it will be at the same starting point with the rest of every accounts ? or it will stay this (fairshare value) forever ? Kindly advise. Thanks. Cheers Damien
(In reply to Damien from comment #4) > For example, If we have an heavy usage for an account for two weeks, does > this account remains low fairshare for the rest of the months, or at certain > time, this will reset, and re-calculate again, and it will be at the same > starting point with the rest of every accounts ? or it will stay this > (fairshare value) forever ? Since you have these parameters: PriorityDecayHalfLife = 14-00:00:00 PriorityCalcPeriod = 00:05:00 Then every 5 minutes (PriorityCalcPeriod) the half-life decay will be recalculated. Then you have PriorityDecayHalfLife set to 14 days. This option controls how long prior resource use is considered in determining how over- or under-serviced an association is (user, bank account and cluster) in determining job priority. The record of usage will be decayed over time, with half of the original value cleared at age PriorityDecayHalfLife. I'd anyway adjust the PriorityWeight* options as commented in my previous comment, as well as set PriorityFlags=FAIR_TREE on top of what you already have configured. Please, let me know if you have any more questions. Thank you.
Good Morning Alejandro We wanted to switch to fairtree priority as mentioned in the previous notes, but I don't see it under '/opt/slurm-17.11.4/lib/slurm' , Do it need a separate .so file or this is built-in ? , no additional binary needed. Please advise. Thanks. Cheers Damien
In order to configure it you just need to set: PriorityFlags=FAIR_TREE in your slurm.conf and reconfigure Slurm. Almost all of the FAIR_TREE logic is available under the .../lib/slurm/priority_multifactor.so and you don't need other libraries to be installed. Please let us know if you have further questions. Thanks.
Hi Damien. Is there anything else you need from here? Thanks.
Hi Damien. I'm closing this as resolved/infogiven. Please, re-open if there's anything else you need from there. Thank you.
Thanks. Please close this ticket.