22437 – Per node TRESBillingWeights?

Ticket 22437 - Per node TRESBillingWeights?

Summary: Per node TRESBillingWeights?

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	23.11.10
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Stephen Kendall
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2025-03-25 14:52 MDT by Bill Broadley
Modified:	2025-04-04 16:30 MDT (History)
CC List:	1 user (show)

See Also:
Site:	NREL
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Bill Broadley 2025-03-25 14:52:11 MDT

I believe slurm only supports a TRESBillingWeights per partition.  We have a GPU partition, each node has 4 Nvidia H100.  The problem is 132 nodes have 384GB ram and 24 nodes have 768GB ram.  Using PriorityFlags=MAX_TRES.  So with our calculations we charge 100 per node hour, but if a user lands on a 768GB ram node and Max_Tres uses RAM (not CPU or GPUs) the user end up being charged 200 per node hour.

We use the billing values from Slurm to track usage and charge against a user's allocation.

Can a partition called GPU include partitions called gpu-small and gpu-large?

Can we have two PartitionName=gpu lines, one with nodes with 384GB ram, and a second line for the 768GB ram nodes?

Currently we have 8 GPU partitions:

PartitionName=debug-gpu       Nodes=h100          MaxTime=0-1  QOS=debug-gpu      OverSubscribe=NO        PriorityTier=3 DefMemPerCPU=1024 TRESBillingWeights="CPU=0.8125,Mem=0.29583G,GRES/gpu=26" # 154 nodes
PartitionName=debug-gpu-stdby Nodes=h100          MaxTime=0-1  QOS=debug-gpu      OverSubscribe=NO        PriorityTier=1 DefMemPerCPU=1024 TRESBillingWeights="CPU=0.8125,Mem=0.29583G,GRES/gpu=26" # 154 nodes
PartitionName=gpu-h100s       Nodes=h100       MaxTime=0-4  QOS=p_h100s_shared OverSubscribe=NO   PriorityTier=2 DefMemPerCPU=1024 TRESBillingWeights="CPU=0.8125,Mem=0.29583G,GRES/gpu=26" # 154 nodes
PartitionName=gpu-h100s-stdby Nodes=h100       MaxTime=0-4  QOS=p_h100s_shared OverSubscribe=NO   PriorityTier=1 DefMemPerCPU=1024 TRESBillingWeights="CPU=0.8125,Mem=0.29583G,GRES/gpu=26" # 154 nodes
PartitionName=gpu-h100        Nodes=h100       MaxTime=2-0  QOS=p_h100_shared  OverSubscribe=NO   PriorityTier=2 DefMemPerCPU=1024 TRESBillingWeights="CPU=0.8125,Mem=0.29583G,GRES/gpu=26" # 154 nodes
PartitionName=gpu-h100-stdby  Nodes=h100       MaxTime=2-0  QOS=p_h100_shared  OverSubscribe=NO   PriorityTier=1 DefMemPerCPU=1024 TRESBillingWeights="CPU=0.8125,Mem=0.29583G,GRES/gpu=26" # 154 nodes
PartitionName=gpu-h100l       Nodes=h100       MaxTime=10-0 QOS=p_h100l_shared OverSubscribe=NO   PriorityTier=2 DefMemPerCPU=1024 TRESBillingWeights="CPU=0.8125,Mem=0.29583G,GRES/gpu=26" # 154 nodes
PartitionName=vto             Nodes=h100       MaxTime=2-0  QOS=vto            OverSubscribe=NO   PriorityTier=3 DefMemPerCPU=1024 TRESBillingWeights="CPU=0.8125,Mem=0.29583G,GRES/gpu=26" # 154 nodes

The reasons for the above:
* debug-gpu        = max 1 job per user for debugging
* gpu-h100s        = jobs under 4 hours, max all nodes
* gpu-h100         = jobs under 2 days, max half nodes
* gpu-h100l        = jobs under 10 days, max quarter of nodes
* <partname>-stdby = for those without a remaining allocation

We could double the above partitions from 8 to 16, one partition for 384GB ram GPU nodes and one for 768GB.  Although we plan to upgrade some nodes to 1.5TB.  So we'd need 24 partitions for that.

Alternative we could calculate the billing ourselves, based on the allocated CPU, GPU, and RAM.

Any other suggestions?

Comment 4 Stephen Kendall 2025-03-26 15:54:21 MDT

(In reply to Bill Broadley from comment #0)
> I believe slurm only supports a TRESBillingWeights per partition.

This is correct.

> We have a
> GPU partition, each node has 4 Nvidia H100.  The problem is 132 nodes have
> 384GB ram and 24 nodes have 768GB ram.  Using PriorityFlags=MAX_TRES.  So
> with our calculations we charge 100 per node hour, but if a user lands on a
> 768GB ram node and Max_Tres uses RAM (not CPU or GPUs) the user end up being
> charged 200 per node hour.

It might be useful to get more details about how you're intending the billing calculation to work out. Are nodes frequently requesting all memory on the node, or full exclusivity? I'm wondering particularly if it would work to lessen the memory weight for all nodes or remove it entirely.

In any case, you will probably want to set a higher node weight on the higher nodes (if you don't have it already set). This will prevent smaller jobs from unnecessarily filling up the higher-memory nodes, making them more available for larger jobs that can only be satisfied by the higher memory capacity, even if the billing weights would be the same either way.
https://slurm.schedmd.com/slurm.conf.html#OPT_Weight

> Can a partition called GPU include partitions called gpu-small and gpu-large?
> 
> Can we have two PartitionName=gpu lines, one with nodes with 384GB ram, and
> a second line for the 768GB ram nodes?

While the partitions can't be nested, you can set up additional partitions with different billing weights, and this would be my suggestion if it is necessary to have different billing weights for these different node types. Also note that you can use 'PartitionName=DEFAULT' lines to simplify the lines for each partition.
https://slurm.schedmd.com/slurm.conf.html#OPT_PartitionName

For example:

PartitionName=DEFAULT Nodes=h100-small OverSubscribe=NO DefMemPerCPU=1024 TRESBillingWeights="CPU=0.8125,Mem=0.29583G,GRES/gpu=26"
PartitionName=debug-gpu-small MaxTime=0-1 QOS=debug-gpu PriorityTier=3
PartitionName=debug-gpu-small-stdby MaxTime=0-1 QOS=debug-gpu PriorityTier=1
. . .
PartitionName=DEFAULT Nodes=h100-med TRESBillingWeights="CPU=0.8125,Mem=0.15G,GRES/gpu=26"
PartitionName=debug-gpu-med MaxTime=0-1 QOS=debug-gpu PriorityTier=3
PartitionName=debug-gpu-med-stdby MaxTime=0-1 QOS=debug-gpu PriorityTier=1
. . .

In this case, users may wish to submit to multiple partitions, keeping in mind that the controller will utilize the partition offering the earliest start time. If you think this might be a difficult change for your users, you could set up a 'job_submit' script to automatically add the extra partitions to job submissions.
https://slurm.schedmd.com/sbatch.html#OPT_partition
https://slurm.schedmd.com/job_submit_plugins.html

I also noticed while looking through the partitions that it might not even be optimal to have so many different partitions. Are you using partition-based preemption or some other functionality that is enabled by having so many overlapping partitions? If not, it might be better to streamline these down to just one for each unique 'TRESBillingWeights' needed, and use a job QOS to enforce the max time limits and prioritization.

Let me know if you have any further questions.

Best regards,
Stephen

Comment 5 Bill Broadley 2025-03-26 16:52:36 MDT

> It might be useful to get more details about how you're intending the billing 
> calculation to work out. Are nodes frequently requesting all memory on the 
> node, or full exclusivity? I'm wondering particularly if it would work to 
> lessen the memory weight for all nodes or remove it entirely.

We will end up with 384GB, 768GB, and 1536GB nodes, so reducing the cost of memory or removing it entirely is an option.  The node weight is a nice compliment, so small jobs will end up on smaller nodes by default.

I did consider making partitions for the 3 memory sizes, but that would balloon the number of GPU partitions from 8 to 24.  I could hide much of that from users by modifying their partition list to include eligible partitions automatically.

> Are you using partition-based preemption or some other functionality that is enabled by having so many overlapping partitions?

No preemption.  But we do have short partition (under 4 hours) that can run on the entire pool (154) of GPU nodes, standard (under 2 days) that can run on half the pool, and long (under 10 day) that can run on 1/4th of the pool.  Otherwise we could merge those partitions.  It's not possible to have a QoS that only impacts jobs under 4 hours of runtime is there?

Or maybe the submit lua script could change the job's QoS based on the runtime/timelimit?

Comment 6 Stephen Kendall 2025-03-26 18:03:46 MDT

(In reply to Bill Broadley from comment #5)
> No preemption.  But we do have short partition (under 4 hours) that can run
> on the entire pool (154) of GPU nodes, standard (under 2 days) that can run
> on half the pool, and long (under 10 day) that can run on 1/4th of the pool.
> Otherwise we could merge those partitions.  It's not possible to have a QoS
> that only impacts jobs under 4 hours of runtime is there?

I see you already have various partition QOS's in place, so I assume these already enforce the appropriate resource limits. By adding a 'MaxWall' value these could instead be used as job QOS's, so instead of distinguishing the types of jobs using partitions, they would distinguish them by QOS.
https://slurm.schedmd.com/sacctmgr.html#OPT_MaxWall_2

There are a few ways in which this would be a bit different:

1. While multiple partition options on a single job submission have been available for a while, multiple QOS options was only recently added on 24.11. If users are currently submitting to multiple partitions, they would either need to get used to replacing them with a single QOS (and maybe modifying their job with 'scontrol' if they need to change it) or the cluster would need to be upgraded to 24.11 to get this functionality.
https://slurm.schedmd.com/upgrades.html
https://slurm.schedmd.com/sbatch.html#OPT_qos

2. The partition's 'PriorityTier' is evaluated before job priority, while the QOS priority factor can be enabled to impact job priority. However, with a high enough QOS priority weight relative to other factors, you can pretty closely mimic the PriorityTier behavior if that is preferred.
https://slurm.schedmd.com/priority_multifactor.html
https://slurm.schedmd.com/slurm.conf.html#OPT_PriorityTier

> Or maybe the submit lua script could change the job's QoS based on the
> runtime/timelimit?

This is a great place to do a lot of clever things, although it's best to get as close to the desired behavior with built-in Slurm functionality so that job_submit is only used for things that can't be done otherwise. In this case, changing the QOS automatically could work for jobs specifying a time limit. However, the job_submit script will not recognize the default time limit as most items are purely based on what the user specifies. (Add 'JobSubmitPlugins=require_timelimit' if you want time limits to be required.)
https://slurm.schedmd.com/slurm.conf.html#OPT_JobSubmitPlugins

You may also be able to help users adapt to a QOS-based workflow using a cli_filter script or the cli_filter user defaults plugin.
https://slurm.schedmd.com/cli_filter_plugins.html

Best regards,
Stephen

Comment 7 Stephen Kendall 2025-03-31 14:14:23 MDT

I hope my previous comments were helpful. Do you have any further questions on this topic?

Best regards,
Stephen

Comment 8 Stephen Kendall 2025-04-04 16:30:37 MDT

Since I haven't heard from you in a while, I will close this ticket. Please reopen it if you need any further assistance.

Best regards,
Stephen