Ticket 12320

Summary:	Configuration question on TRESBillingWeights for special partition
Product:	Slurm	Reporter:	Jason Booth <jbooth>
Component:	Scheduling	Assignee:	Albert Gil <albert.gil>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	damien.leong
Version:	20.02.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7824
Site:	Monash University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Jason Booth 2021-08-19 12:47:02 MDT

This is the situation, 

Our production cluster has some additional specialised servers (for paid partners). These partners or accounts have very high usage levels in the existing cluster, therefore their current fairshare metric are fairly low. 

The additional machines are for paid users only, but because of their high usage and fairshare in the existing cluster, their jobs to these specialised machines are being delayed because of fairshare scheduling .

It is because of the high usage of the specialized nodes, these paid users have less priority when they want to use non-specialized nodes (The common nodes inside the cluster).

Our current TresBilling rate is 1 CPU = 1 Billing unit, For example:
CfgTRES=cpu=24,mem=257669M,billing=24,


***Question

We are thinking if we put all these specialised nodes in a 'special' partition for paid users only (Using AllowAccount=pa003 AllowQOS=pa003). Can we configure this partition to have TRESBillingWeights=0, or <1, so whoever users use this special partition, it will NOT affect or contribute to their respective accounts' fairshare metrics ?  

Is this a viable solution ?

Or can you propose an optimised option ? 



Kindly advise

Thanks

Damien

Comment 1 Albert Gil 2021-08-20 03:00:45 MDT

Hi Damien,

I'm trying to understand how is your current configuration, could you attach your slurm.conf highlighting these specialized nodes, and explain a bit your current accounts/users hierarchy?
I mean, do these "paid users" have 1 account/assoc to get access to all the cluster, or do they have more than one account/assoc to access paid and regular nodes?
How do you avoid regular users to access the specialized nodes?

Also, to better understand your goals:
When a paid user submits a job, is it important for them to target the specialized nodes, or they just want the job to be run in any node as soon as possible on any node (specialized or not)?

And finally do you want to keep using fairshare between paid users to prioritize jobs that will run on specialized nodes? Meaning, do you want that previous usage of those specialized nodes count on that fairshare between paid users, or do you want only the usage of regular nodes count even to use specialized nodes?

My guesses are:
- Paid and regular users want their jobs to be scheduled as soon as possible, no matter what kind of nodes they run on (besides actual TRES requested, of course).
- Paid users should also have fairshare between them based on their usage of specialized nodes, and also with regular users but only based on their usage of regular nodes (not aggregating the usage of specialized nodes).
- Your main problem is that usage of specialized nodes is so high that when a paid user tries to use a regular node their priority is too low even if they use those nodes less than regular users.

Is that correct?

Regards,
Albert

Comment 2 Damien 2021-08-20 10:11:44 MDT

Hi Albert,

Thanks for replies.

The setup structure is represented as https://bugs.schedmd.com/show_bug.cgi?id=7824. 

We are planning to integrate more specialised servers(DGXs) into our production cluster. These nodes are sponsored by paid partners.

- These specialised nodes are exclusive to these paid users only. They don't want to be penalised for using this.  

- The paid users also want their respective fair-share values(normalised) for the normal nodes in the cluster.  

- The paid users can choose to submit their jobs to whatever nodes, either specialised nodes, or normal nodes.

  
So your guesses are spot on:
- Paid and regular users want their jobs to be scheduled as soon as possible.
- Paid users should have fairshare between them based on their usage of the normal nodes.
- Your main problem is that usage of specialised nodes is so high that when a paid user tries to use a regular node their priority is too low even if they use those nodes less than regular users.



***Question

We are thinking if we put all these specialised nodes in a 'special' partition for paid users only (Using AllowAccount=paid01 AllowQOS=paid). Can we configure this partition to have TRESBillingWeights=0, or <1, so whoever users use this special partition, it will NOT affect or contribute to their respective accounts' fairshare metrics ?  

Is this a viable solution ?

Or can you propose an optimised option ? 



Kindly advise

Thanks

Damien

Comment 3 Albert Gil 2021-08-23 05:53:27 MDT

Hi Damien,

> ***Question
> 
> We are thinking if we put all these specialised nodes in a 'special'
> partition for paid users only (Using AllowAccount=paid01 AllowQOS=paid).

I don't see any alternative to avoid regular users to use those specialized nodes than put them into a  special partition and using some Allow* or Deny* option.

How are you doing it now?

> Can
> we configure this partition to have TRESBillingWeights=0, or <1, so whoever
> users use this special partition, it will NOT affect or contribute to their
> respective accounts' fairshare metrics ?  
> 
> Is this a viable solution ?

Yes, this is a good way to do it.

For example, if we have these two partitions:

PartitionName=main Nodes=c[1-3] Default=YES 
PartitionName=paid Nodes=c4     Default=NO  AllowAccounts=paid TRESBillingWeights=cpu=0

Then, users in the "paid" Account should submit jobs to multiple partitions like:

$ srun -p main,paid sleep 30

Slurm will do its best to schedule the job as soon as possible to any partition/node available.
If job is finally scheduled into the "main" partition because it was faster to scheduler there, then the user/account will be charged with normal usage as any regular user gets, but if the job is finally scheduled to the "paid" partition to any special nodes, then no usage for that job will be charged to the user/account.

Note that I'm assuming that users are only in 1 account (a paid one or a regular one).
If users are in different accounts, then their usage is independent, meaning that depending on the account being use at submission time (-A) they will have a different fairshare priority (you can only specify 1 Account with -A).

Regards,
Albert

Comment 4 Damien 2021-08-23 08:01:30 MDT

Hi Albert,

Thanks for your reply.

How are we doing this now ? Well, this is not idea, we use a couple of methods:

1) Use reservations. For example, reserves the whole special partition for the paid customers/accounts only. 

But, we found out that reservation do count 'raw_usage', therefore the respective account's fairshare value will suffer when it is used against normal partition/nodes after using the reserve special nodes.

2) Configure separate accounts for the same set of users, one for the special partition, the other for the normal partitions, and instruct them to use accordingly.

Creating a lot of overheads, either for both the admins and the users.



Cheers

Damien

Comment 5 Albert Gil 2021-08-23 08:44:19 MDT

Hi Damien,

> How are we doing this now ? Well, this is not idea, we use a couple of
> methods:
> 
> 1) Use reservations. For example, reserves the whole special partition for
> the paid customers/accounts only. 

I see.
So the special nodes are always reserved for the desired users/accounts only.
I didn't though about this alternative because in my mind Reservations are more for "temporal setups", but it will prevent regular users to access those nodes, yes.

> But, we found out that reservation do count 'raw_usage', therefore the
> respective account's fairshare value will suffer when it is used against
> normal partition/nodes after using the reserve special nodes.

Yes.

> 2) Configure separate accounts for the same set of users, one for the
> special partition, the other for the normal partitions, and instruct them to
> use accordingly.

This point confuses me a bit, because if users are able to use two different Accounts, if they use their regular Account to submit jobs to regular nodes, then the usage done in their special Account shouldn't be an issue because it's totally ignored. If they do specify the regular Account at submit time, of course. I guess that they always use the special Account even to submit jobs to the regular nodes, right?

Anyway, having 2 Accounts allows independent Usages/Fairshares, but it will not achieve the goal of "scheduling as soon as possible no matter into what nodes".
It wouldn't because is the user at submission time who decides what Account to use (only 1) and that would decide what nodes are token into account to schedule the job.
By the other side, as users can submit jobs to multiple partitions, the is the scheduler who makes the final decision to "schedule as soon as possible".

> Creating a lot of overheads, either for both the admins and the users.

Yes, I think that only 1 Account per user and setting up nodes into different Partitions with the right AllowAccounts or AllowQOS, and TRESBillingWeights=cpu=0 is the best option for you.

I'm resolving the issue as infogiven, but please don't hesitate to reopen it if you need further support (or open a related bug if you do it after a couple of years ;-)

Regards,
Albert

Comment 6 Damien 2021-08-25 10:27:31 MDT

Hi Albert,

Sorry, I still need more clarifications, which may save myself in-depth testings.

As mentioned, previously the two current methods that I am using aren't good solutions. The paid users don't want separate accounts, there are too much overhead processes.
 
Yes, I am incline to setup a special partition (with paid access/Allow_QoS) for these selected users, with TRESBillingWeights=cpu=0,gpu=0 (DGX boxes). 

"I think that only 1 Account per user and setting up nodes into different Partitions with the right AllowAccounts or AllowQOS, and TRESBillingWeights=cpu=0 is the best option for you." 

Current settings:
scontrol show config |grep -i priority
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 14-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = Yes
PriorityFlags           = 
PriorityMaxAge          = 14-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 30000
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 50000
PriorityWeightJobSize   = 30000
PriorityWeightPartition = 30000
PriorityWeightQOS       = 40000
PriorityWeightTRES      = (null)



These are the questions: 

1) In this case, I will have to implement a relative value for PriorityWeightTRES correct ?

2) In a multi-factor setup, will "TRESBillingWeights" be taken under considerations during scheduling ? We are hoping that if this resource is available, there will be no waiting (behave like a reservation)

3) If I setup this special partition to have TRESBillingWeights=cpu=0,gpu=0, Does that mean that I am force to put value(>1) for the other partitions(normal) as well?

4) In a TRESBillingWeights setup, How will this affect my monthly usage reporting via sreport and sacct ? We didn't keep track of TRESBilling before

5) If there a long queue waiting for this special partition (they are already in used) by the paid user, Will this affects the overall scheduling response for the whole cluster for the normal users ?



Kindly advise

Many Thanks

Damien

Comment 7 Albert Gil 2021-08-26 04:58:58 MDT

Hi Damien,

> Sorry, I still need more clarifications, which may save myself in-depth
> testings.

No problem at all! ;-)

> 1) In this case, I will have to implement a relative value for
> PriorityWeightTRES correct ?

Not really.
Let me explain it in depth to avoid confusions (sorry if something is too obvious or repetitive):

The key idea that I want to explain is that we need to have a clear distinction between Priority and Usage.

The Priority is only to order pending jobs in the queue, while Usage is in general related to Limits and per Users/Assoc.
What creates a relation between Priority and Usage is the Fairshare algorithm: the more Usage you have done in the past, the less Priority your pending jobs will have (depending on the Share you have been assigned).

With sshare you can see Usage, Shares and the resulting FairShare factor of each user/assoc.
And with sprio you can see all the Priority factors and the resulting Priority of each pending job.
Please note these two Priority Factors that you can see with sprio (--all): FAIRSHARE and TRES.
These two different are the key difference to understand the difference between TRESBillingWeights and PriorityWeightTRES.

What TRESBillingWeights does is to control how much *Usage* is accounted to a user/assoc due jobs that uses those TRES on their running time (very similar to what the UsageFactor does in a QOS level).
Will TRESBillingWeights have an impact on Priority? Yes, but indirectly, through Fairshare.
TRESBillingWeights will impact directly to the Usage columns on sshare, so it will also impact on the FairShare factor of each user/assoc on sshare.
And that FairShare factor of a user/assoc will also appear as the FAIRSHARE factor of ssprio for all the pending jobs of that user/assoc, therefore it will contribute to the jobs Priority (depending on PriorityWeightFairShare).

By the other side the PriorityWeightTRES does NOT change any Usage but it directly changes the Priority of each individual job depending in the *requested* TRES of that job. It's totally unrelated to the Fairshare algorithm.
The PriorityWeightTRES will appear as the TRES column in sprio (--all), so it will impact directly to each individual job independently, but it won't have any impact at all in the values shown by sshare.

In summary:
- TRESBillingWeights is directly related to Usage of users/assocs (and inderctly to its job's Priority using Faishare), while PriorityWeightTRES is only related to individual job's priorities.
- You don't need PriorityWeightTRES to make the Fairshare work, and you only need to tune TRESBillingWeights to change how much usage will be added to users/assocs for their jobs running on each partition.


> 2) In a multi-factor setup, will "TRESBillingWeights" be taken under
> considerations during scheduling ?

I guess that the above explanation clarifies this too (I hope it doesn't add more confusion ;-).
TRESBillingWeights is *only* taken into account to compute the Usage fields of each user/assoc that you can see in sshare.
The rest of the scheduling process is exactly the same: that Usage is one of the main elements to obtain the FairShare factor, and that will be one of the priority factors to compute the Priority of each job (see sprio)

> We are hoping that if this resource is
> available, there will be no waiting (behave like a reservation)

Yes, TRESBillingWeights will NOT add any waiting delay.
It's only used to compute the Usage fields of each user/assoc that you can see in sshare.
Actually, even if you don't set it up, it's always used with its default value: cpu=1.


> 3) If I setup this special partition to have TRESBillingWeights=cpu=0,gpu=0,
> Does that mean that I am force to put value(>1) for the other
> partitions(normal) as well?

The default value of TRESBillingWeights is cpu=1.
The other TRES are not taken into account by default.
Therefore, you only need to set TRESBillingWeights=cpu=0 for the paid partition, and use the default value on the regular one.


> 4) In a TRESBillingWeights setup, How will this affect my monthly usage
> reporting via sreport and sacct ? We didn't keep track of TRESBilling before

The TRESBillingWeights will not change the values reported of each TRES in sreport, except for the Billing TRES.
That is, in sreport the CPU usage of nodes in the paid partition will be accounted exactly the same of the CPUs used in the regular partition. And the same for GPUs or any other TRES, except the Billing one.
If you didn't keep track of Billing TRES before, you can keep ignoring it.
I guess that you are mainly interested in jobs Priority (so in Fairshare factors, so in Usage accounted per user and per partition, so in TRESBillingWeights per partition).


> 5) If there a long queue waiting for this special partition (they are
> already in used) by the paid user, Will this affects the overall scheduling
> response for the whole cluster for the normal users ?

No.
Well, one could argue that the more nodes, partitions, jobs, more qoses, reservations, etc.. the more calculations the scheduler will need to do, so more time to respond... yes. But no.
Slurm does a great job to internally keep independent queues per partition, so one partition can be totally busy but if a job requests another partition, and there's space there, the job will be immediately scheduled.
One could say that this was the main goals when partitions were created: to have independent queues.
So, you won't notice any response change.
On the contrary: the scheduler will work even faster because it's probably more optimized for multiple partitions than for reservations (please don't take this last sentence too serious, though ;-).


Regards,
Albert

Comment 8 Albert Gil 2021-09-03 02:27:25 MDT

Hi Damien,

Do you have more related questions?

Regards,
Albert

Comment 9 Albert Gil 2021-09-10 02:01:48 MDT

Hi Damien,

If this is ok for you I'm closing the bug again as infogiven.
But please don't hesitate to reopen it if you need further support.

Regards,
Albert