Ticket 16756

Summary: Fairshare Accounting
Product: Slurm Reporter: Jeff Fahnoe <jfahnoe>
Component: SchedulingAssignee: Jason Booth <jbooth>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 22.05.3   
Hardware: Linux   
OS: Linux   
Site: Wistar Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: SSHARE
slurm.conf
squeue
priority

Description Jeff Fahnoe 2023-05-16 14:28:21 MDT
Created attachment 30317 [details]
SSHARE

We are seeing an issue where jobs are not getting scheduled sooner even though they should move in front of others we believe based on fairshare value.   When we run sprio we are not seeing any fairshare values coming across but we are seeing them when we run ssshare command - so it appears to me something is not correct in the config, even though we do have PriorityType=priority/multifactor
in slurm.conf. 

I've attached the relevant output of those commands - we are struggling to make sure that fairshare seems to be working and prioritizing appropriately.
Comment 1 Jeff Fahnoe 2023-05-16 14:29:42 MDT
Created attachment 30319 [details]
slurm.conf
Comment 2 Jeff Fahnoe 2023-05-16 14:30:03 MDT
Created attachment 30320 [details]
squeue
Comment 3 Jeff Fahnoe 2023-05-16 14:30:29 MDT
Created attachment 30321 [details]
priority
Comment 4 Jason Booth 2023-05-16 15:19:29 MDT
It looks like you are missing the weights.

https://slurm.schedmd.com/priority_multifactor.html#configexample


For example:


PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=2

Before enabling these, please review the parameters and adjust them to your site's needs.

We have found that increments of 1000 seem to work better than in the 100s regarding the weights.
Comment 5 Jeff Fahnoe 2023-05-16 15:56:39 MDT
Thanks.  Is there not a default value or does it default to 0?  Is this why priority is all equal?  Does this make it really working like FIFO now?

Is there a typical best practice number for these?

I noticed the example that QOS was set to 2.   Was this on purpose?   Does this mean it is valued less because it is a multiplier or more because the weight is a divider?

Doesn't Age and Jobsize already get calculated in fair share or is fair share only Consumed resources over the past x days?

Thanks for the additional information - just trying to understand it all.

Jeff





Jeff Fahnoe
Chief Information Officer
The Wiatar Institute
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, May 16, 2023 5:19 PM
To: Jeff Fahnoe <jfahnoe@Wistar.org>
Subject: [EXT] [Bug 16756] Fairshare Accounting


Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=16756#c4> on bug 16756<https://bugs.schedmd.com/show_bug.cgi?id=16756> from Jason Booth<mailto:jbooth@schedmd.com>

It looks like you are missing the weights.

https://slurm.schedmd.com/priority_multifactor.html#configexample


For example:


PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=2

Before enabling these, please review the parameters and adjust them to your
site's needs.

We have found that increments of 1000 seem to work better than in the 100s
regarding the weights.

________________________________
You are receiving this mail because:

  *   You reported the bug.

NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.
Comment 6 Jason Booth 2023-05-17 09:06:16 MDT
> Is there not a default value or does it default to 0?  Is this why priority is 
> all equal?  Does this make it really working like FIFO now?

The defaults are indeed 0 and this is why you do not see values.

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_PriorityWeightQOS
[2] https://slurm.schedmd.com/slurm.conf.html#OPT_PriorityWeightPartition

> Is there a typical best practice number for these?

There are starting points, however each site usually ends up adjusting these to
their workload, and user/use cases.

> PriorityWeightAge=1000
> PriorityWeightFairshare=10000
> PriorityWeightJobSize=1000
> PriorityWeightPartition=1000

This mostly depends on what your site wants to prioritize and how much priority 
you want to give groups or users.

> I noticed the example that QOS was set to 2.   Was this on purpose?   Does this 
> mean it is valued less because it is a multiplier or more because the weight is 
> a divider?

This was just an example. I just wanted to draw your attention to the parameters 
that are frequently used. I would highly suggest you look over what is available
and use those to determine what is important to your site, such as QOS or single 
users.

> Doesn't Age and Jobsize already get calculated in fair share or is fair share 
> only Consumed resources over the past x days?

By default, these are 0 and are not factored in unless set. Both can be used to 
influence job priority.

Regarding time and age. These are also influenced by PriorityCalcPeriod PriorityDecayHalfLife. 

[3] https://slurm.schedmd.com/slurm.conf.html#OPT_PriorityCalcPeriod
[4] https://slurm.schedmd.com/slurm.conf.html#OPT_PriorityDecayHalfLife

An older Slug presentation that covers some example use cases.

[5] https://slurm.schedmd.com/SLUG19/Priority_and_Fair_Trees.pdf

Most of what is covered is drawn from the following two web documents.

[6] https://slurm.schedmd.com/priority_multifactor.html
[7] https://slurm.schedmd.com/fair_tree.html
Comment 7 Jason Booth 2023-05-26 13:52:08 MDT
Jeff just following up on this and resolving it out. Please feel free to re-open if you need further information regarding this issue.