Ticket 6875

Summary:	Resource Limits in percentage instead of absolute numbers
Product:	Slurm	Reporter:	timeu <uemit.seren>
Component:	Configuration	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	18.08.6
Hardware:	Linux
OS:	Linux
Site:	IMP	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description timeu 2019-04-17 13:58:12 MDT

We have following situation: 

We have various node types (high-clock, high-core, high-mem and GPU) nodes in quite different quantities (mostly high-core, only a few high-mem and GPU nodes)  We created 4 partitions for those node types with the default being the one with most nodes (high-core nodes). 

We now want to create 3 QOS that specify the maximum walltime: 
 - short: 8 hours
 - medium: 2 days
 - long: 8 days

At the same time we want to allow jobs that are short to potentially consume 100%, jobs in the medium QOS to consume 50 % and jobs in the long QOS to consume 20 % of the resources of each partiion.
I checked the documentation and also tried to google this, but as far as I can tell GrpTRES, et al can only be defined in absolute numbers and not in percentages. 
However because the paritition have un-equal number of resources, we can not have only 3 QOSes (short, medium and long) but we would need to create 3 x 4 (number of partitons) QOSes to cover our use case. 

So my question is, is there any way to specify the resource limits in percentage values (50 %) instead of absolute numbers (500 cores) and if not what would be the best workaround/approach to solve our use case ? 

Thanks in advance
Uemit

Comment 2 Marcin Stolarek 2019-05-21 03:10:12 MDT

Uemit, 

As far as I understand you're looking for an option to configure global hard limits on allocated resources based on wall time. 

If this is the case the most efficient way I see to achieve this would be to configure three partitions - short, medium, long - with different "MaxTime" and assign all nodes you have to short partition(default), 50% of all nodes to medium and only 20% to long. Instead of separate partitions for high-clock, high-mem, etc. I'd recommend using "feature" configuration option for nodes. 


If this is not the case - please elaborate a little bit on what are you trying to achieve. Do you want a 50% limit to be applied per user or per account instead of having it globally? 

Did you consider use of priority plugin to give shorter jobs a boost instead of "hard" limits? This approach may be beneficial since it provides overall higher utilization of resources. 

cheers,
Marcin

Comment 3 timeu 2019-05-21 05:26:02 MDT

@Macin

thanks for the reply. 
I don't think that your suggested approach would work for us
We want to have distinct partitions for the different node types because we want to avoid overflow to the expensive nodes (high-mem nodes). 
The user has to specifically submit to the corresponding partition if he/she wants to target the high-mem nodes for example. 

Our current workaround is to create the MxN QOSes (short, medium and long for each partition) when we create the slurm cluster and then have a lua script that will re-write the user submitted qos (short, medium, long) to the actual qos/partition pair (c_short, g_medium, etc). 

Service Class   Prio    TimeLimit Resource/User         TotalQoSLimit   
m_short         1000                 cpu=202,mem=4109G    cpu=404,mem=8218G
g_short         1000                   cpu=14,mem=173G      cpu=28,mem=346G
c_short         1000                 cpu=966,mem=4132G   cpu=1932,mem=8265G
m_medium         500                  cpu=80,mem=1643G    cpu=202,mem=4109G
g_medium         500                     cpu=5,mem=69G      cpu=14,mem=173G
c_medium         500                 cpu=386,mem=1653G    cpu=966,mem=4132G
m_long           100                   cpu=40,mem=821G     cpu=80,mem=1643G
g_long           100                     cpu=2,mem=34G        cpu=5,mem=69G
c_long           100                  cpu=193,mem=826G    cpu=386,mem=1653G
short              0     08:00:00                                          
medium             0   2-00:00:00                                          
long               0  14-00:00:00                                          

The user will use only short, medium and long QOS and the partition (c, m, g) and we will re-write it to the actual qos. 
For example if a user submits a job with q medium qos (medium) to the high-mem partition (m), we re-write it to m_medium 

This approach works, however if we could define the resource limits as percentages we could avoid the MxN combinations of qos/partitions. 

There is another advantage of having percentages instead of absolute values: 
Our slurm cluster might not be static and we might dynamically add and remove nodes from it. With the absolute values we always have to re-calculate/re-generate the QOSes. if we could specifiy the resource limits in percentages, we would not need to do that. 

I hope this clarifies our use case.

Comment 5 Marcin Stolarek 2019-05-22 03:05:37 MDT

In this case, the option you can try is to configure GrpTRES limits based on "billing", with a command like:
> sacctmgr create qos medium GrpTRES=billing=50 MaxWall=2-0:0:0

Billing is calculated based on  TRESBillingWeights[1] option defined per partition. For instance, setting TRESBillingWeights="CPU=0.5" will result in billing of 50 when 100 CPUs are in use. If you'd like to take other parameters like memory into account you may find PriorityFlags=MAX_TRES setting useful. The default behavior calculates billing as a sum of all parameters, with MAX_TRES billing for each resource is calculated separately and the highest value is treated as final result. 


I believe it's very close to percentage configuration you've been looking for. Let me know if this works for you. 

cheers,
Marcin 

[1] https://slurm.schedmd.com/slurm.conf.html

Comment 6 Marcin Stolarek 2019-05-29 08:50:15 MDT

Since there were no further questions from you within a week. I'll close this ticket as "info given". Should you need any further information, please do not hesitate to reopen.

cheers,
Marcin