Ticket 5770

Summary:	Bank Account Allocations, and Node-Based Accounting
Product:	Slurm	Reporter:	Bill Marmagas <zorba>
Component:	Accounting	Assignee:	Jason Booth <jbooth>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	17.11.5
Hardware:	Linux
OS:	Linux
Site:	VTech BI	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Bill Marmagas 2018-09-24 13:08:51 MDT

For clarification on versions, our central slurmdbd server is running 17.11.5, while slurmctld and slurmd on our clusters are still at 17.02.11.

This ticket has two accounting questions because I believe they are related.

(1)  Is there anything like "Moab Gold" for Slurm that is current and being maintained?  We would like to be able to provide an allocation of “money” that the user would charge to.  I can see how I could set up bank account limits by setting association-level GrpCPUMins for each account, and adding the safe option to AccountingStorageEnforce.  I have also found some wrapper scripts that are supposed to give a Gold-like command interface to such a setup, but not ones that have been recently maintained.

(2)  Our management wants to bill for node hours instead of CPU hours.  Is there a way to do this in Slurm such that we properly account for node utilization for both exclusive node  and non-exclusive node (i.e., packed) jobs?  Also, in relation to question #1 above, I'm not sure if GrpCPUMins would work for us for setting bank account limits under such a scenario, or whether there may be another alternative to allow that type of node-hour-centric bank account limit.

Thanks.

Comment 1 Jason Booth 2018-09-24 13:44:05 MDT

Hi Bill,

 We will look into this and get back to you on suggestions we have. I would also like to point out that we take our severity levels very seriously and ask that you set the severity accordingly since a severity 2 and severity 1 would disrupt current work which we are engaged in and are also attached to the service level agreements. The severity should reflect the impact on the system only. In this case, it seems like you are asking for migration and configuration assistance which is best suited for a severity 3 or 4. 

Below is a link to the support site which describes ticket severity.

https://www.schedmd.com/support.php

SEVERITY LEVELS
Severity 1 — Major Impact

A Severity 1 issue occurs when there is a continued system outage that affects a large set of end users. The system is down and non-functional due to Slurm problem(s) and no procedural workaround exists.
Severity 2 — High Impact

A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system.
Severity 3 — Medium Impact

A Severity 3 issue is a medium-to-low impact problem that includes partial non-critical loss of system access or which impairs some operations on the system but allows the end user to continue to function on the system with workarounds.
Severity 4 — Minor Issues

A Severity 4 issue is a minor issue with limited or no loss in functionality within the customer environment. Severity 4 issues may also be used for recommendations for future product enhancements or modifications.

Comment 2 Bill Marmagas 2018-09-24 14:08:07 MDT

Yes, now that I see the definitions of the severity levels you sent I certainly agree this should not be 2 but instead 3 like you changed it to.  Please excuse this oops; I just got support this month, and didn’t realize both 1 and 2 are outage-level severities.

Comment 3 Jason Booth 2018-09-24 14:13:42 MDT

> Please excuse this oops; I just got support this month, and didn’t realize both 1 and 2 are outage-level severities.

Not a problem Bill. We generally exceed on the SLA so you almost always see same day communications with regards to issue you will log. I can not promise that will be the case each time but we are fairly responsive even for sev 3 and 4.

Best regards,
Jason

Comment 5 Jason Booth 2018-09-24 16:18:07 MDT

Hi Bill,

First off, welcome to SchedMD. We are excited to have you working with us. 

> (1)  Is there anything like "Moab Gold" for Slurm that is current and being maintained?

Slurm has an accounting mechanism which is maintained by SchedMD.
https://slurm.schedmd.com/accounting.html


>  We would like to be able to provide an allocation of “money” that the user would charge to.  I can see how I could set up bank account limits by setting association-level GrpCPUMins for each account, and adding the safe option to AccountingStorageEnforce.  

Slurm refers to these as resource limits, which you have already discovered. There is not a one to one mapping from MAM/gold to Slurms accounting feature, however, we do have resource limits and you can assign an allotment and reset the values each month or just keep upping the max each time funds are allocated.

https://slurm.schedmd.com/resource_limits.html

You can set a GrpTRESMins with a TRES of type CPU. The list of trackable resources are listed here:

https://slurm.schedmd.com/tres.html

       GrpTRESMins=<TRES=max TRES minutes,...>
              The total number of TRES minutes that can possibly be used by past, present and future jobs running from this association and its children.  To clear a  previously set value use the modify command with a new value of -1 for each TRES id.

              NOTE: This limit is not enforced if set on the root association of a cluster.  So even though it may appear in sacctmgr output, it will not be enforced.

              ALSO  NOTE:  This  limit  only  applies  when  using  the Priority Multifactor plugin.  The time is decayed using the value of PriorityDecayHalfLife or PriorityUsageResetPeriod as set in the slurm.conf.  When this limit is reached all associated jobs running will be killed and all future jobs submitted with associations in the group will be delayed until they are able to run inside the limit.



You can report on the usage with sshare. For example: 'sshare -u jason'

$ sshare  -u jason
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          0.000000      120664      1.000000            
 staff                                   1    0.250000           0      0.000000            
 thisisalongac12                         1    0.250000      116812      0.968076            
  thisisalongac12         jason          1    0.166667           0      0.000000   0.750000 
  thisisalongac12         jason          1    0.166667      116755      0.999512   0.125000 
  thisisalongac12         jason          1    0.166667          56      0.000488   0.250000 


https://slurm.schedmd.com/sshare.html


> I have also found some wrapper scripts that are supposed to give a Gold-like command interface to such a setup, but not ones that have been recently maintained.

I assume you are referring to https://github.com/jcftang/slurm-bank

SchedMD does not maintain these so we are not sure how accurate they are. I did run a few tests and can say that the project does have some promise.

>(2)  Our management wants to bill for node hours instead of CPU hours.  Is there a way to do this in Slurm such that we properly account for node utilization for both exclusive node and non-exclusive node (i.e., packed) jobs?  

It is not clear what you mean by node hours. Is this just the total time a job spends on any given node in hours or a fraction of the total CPU time used for the given node in hours? The accounting is tracked in cpu-seconds by default and influenced by TRESBillingWeights. 

sshare
       Raw Usage
              The number of tres-seconds (cpu-seconds if TRESBillingWeights is not defined) of all the jobs charged to the account or user. This number will decay over  time
              when PriorityDecayHalfLife is defined.


> Also, in relation to question #1 above, I'm not sure if GrpCPUMins would work for us for setting bank account limits under such a scenario, or whether there may be another alternative to allow that type of node-hour-centric bank account limit.

It may be possible but I would need you to define what a node hour would look like and how it changes depending on the number of CPU/Procs.

-Jason

Comment 6 Jason Booth 2018-09-25 09:13:31 MDT

Hi Bill,

Here is another tool which the community has created with bank-like feature (below). Most sites will generally roll their own solution. Sites that come from MAM have also turned on decaying (expiration on the credits) either through PriorityDecayHalfLife or use the qos flags nodecay, and either resetting usage manually or on a periodic basis with PriorityUsageResetPeriod or "sacctmgr mod <entirity> set rawusage=0"


https://groups.google.com/forum/#!searchin/slurm-users/bank%7Csort:date/slurm-users/wX0BC2RF5lY/hZb4K3qdCAAJ

https://github.com/barrymoo/slurm-bank


Let me know if this helps satisfy your request.

-Jason

Comment 7 Bill Marmagas 2018-09-25 11:15:10 MDT

Bill Marmagas
Senior HPC Systems Administrator
Biocomplexity Institute of Virginia Tech

> On Sep 24, 2018, at 6:18 PM, bugs@schedmd.com wrote:
> 
> Jason Booth <mailto:jbooth@schedmd.com> changed bug 5770 <https://bugs.schedmd.com/show_bug.cgi?id=5770> 
> What	Removed	Added
> Assignee	support@schedmd.com	jbooth@schedmd.com
> 
> Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=5770#c5> on bug 5770 <https://bugs.schedmd.com/show_bug.cgi?id=5770> from Jason Booth <mailto:jbooth@schedmd.com>
> Hi Bill,
> 
> First off, welcome to SchedMD. We are excited to have you working with us. 
> 
> > (1)  Is there anything like "Moab Gold" for Slurm that is current and being maintained?
> 
> Slurm has an accounting mechanism which is maintained by SchedMD.
> https://slurm.schedmd.com/accounting.html <https://slurm.schedmd.com/accounting.html>
> 
> 
> >  We would like to be able to provide an allocation of “money” that the user would charge to.  I can see how I could set up bank account limits by setting association-level GrpCPUMins for each account, and adding the safe option to AccountingStorageEnforce.  
> 
> Slurm refers to these as resource limits, which you have already discovered.
> There is not a one to one mapping from MAM/gold to Slurms accounting feature,
> however, we do have resource limits and you can assign an allotment and reset
> the values each month or just keep upping the max each time funds are
> allocated.
> 
> https://slurm.schedmd.com/resource_limits.html <https://slurm.schedmd.com/resource_limits.html>
> 
> You can set a GrpTRESMins with a TRES of type CPU. The list of trackable
> resources are listed here:
> 
> https://slurm.schedmd.com/tres.html <https://slurm.schedmd.com/tres.html>
> 
>        GrpTRESMins=<TRES=max TRES minutes,...>
>               The total number of TRES minutes that can possibly be used by
> past, present and future jobs running from this association and its children. 
> To clear a  previously set value use the modify command with a new value of -1
> for each TRES id.
> 
>               NOTE: This limit is not enforced if set on the root association
> of a cluster.  So even though it may appear in sacctmgr output, it will not be
> enforced.
> 
>               ALSO  NOTE:  This  limit  only  applies  when  using  the
> Priority Multifactor plugin.  The time is decayed using the value of
> PriorityDecayHalfLife or PriorityUsageResetPeriod as set in the slurm.conf. 
> When this limit is reached all associated jobs running will be killed and all
> future jobs submitted with associations in the group will be delayed until they
> are able to run inside the limit.
> 
> 

We do currently run accounting, and use sacct to report usage, but are not currently using the account “allocations” or resource limits.  A couple follow-up questions:

Is GrpTRESMins=cpu equivalent to GrpCPUMins?  I’m interested in adding the AccountingStorageEnforce safe feature to "ensure a job will only be launched when using an association or qos that has a GrpCPUMins limit set if the job will be able to run to completion,” instead of having jobs getting killed when the limit is reached.  In other words, does that safe option also work with GrpTRESMins=cpu?  Conversely, does GrpCPUMins require Priority Multifactor plugin?  (We may want to go to that anyway, just want to be clear on the requirements.)

Also, does “the time is decayed” refer only to priority calculations, or does not affect the GrpTRESMins=cpu debiting?

> 
> You can report on the usage with sshare. For example: 'sshare -u jason'
> 
> $ sshare  -u jason
>              Account       User  RawShares  NormShares    RawUsage 
> EffectvUsage  FairShare 
> -------------------- ---------- ---------- ----------- -----------
> ------------- ---------- 
> root                                          0.000000      120664     
> 1.000000            
>  staff                                   1    0.250000           0     
> 0.000000            
>  thisisalongac12                         1    0.250000      116812     
> 0.968076            
>   thisisalongac12         jason          1    0.166667           0     
> 0.000000   0.750000 
>   thisisalongac12         jason          1    0.166667      116755     
> 0.999512   0.125000 
>   thisisalongac12         jason          1    0.166667          56     
> 0.000488   0.250000 
> 
> 
> https://slurm.schedmd.com/sshare.html <https://slurm.schedmd.com/sshare.html>
> 
> 
> > I have also found some wrapper scripts that are supposed to give a Gold-like command interface to such a setup, but not ones that have been recently maintained.
> 
> I assume you are referring to https://github.com/jcftang/slurm-bank <https://github.com/jcftang/slurm-bank>
> 
> SchedMD does not maintain these so we are not sure how accurate they are. I did
> run a few tests and can say that the project does have some promise.
> 

Yes, that is the tool I found, but it hasn’t been updated in a couple of years.  (I see in your next reply, though, that you found a tool by the same name that is newer, and looks a lot more promising to me.)

I’d prefer to use SchedMD maintained utilities as long as we can set and report on the usage of allocations, i.e., resource limits, adequately.

> >(2)  Our management wants to bill for node hours instead of CPU hours.  Is there a way to do this in Slurm such that we properly account for node utilization for both exclusive node and non-exclusive node (i.e., packed) jobs?  
> 
> It is not clear what you mean by node hours. Is this just the total time a job
> spends on any given node in hours or a fraction of the total CPU time used for
> the given node in hours? The accounting is tracked in cpu-seconds by default
> and influenced by TRESBillingWeights. 
> 
> sshare
>        Raw Usage
>               The number of tres-seconds (cpu-seconds if TRESBillingWeights is
> not defined) of all the jobs charged to the account or user. This number will
> decay over  time
>               when PriorityDecayHalfLife is defined.
> 
> 
> > Also, in relation to question #1 above, I'm not sure if GrpCPUMins would work for us for setting bank account limits under such a scenario, or whether there may be another alternative to allow that type of node-hour-centric bank account limit.
> 
> It may be possible but I would need you to define what a node hour would look
> like and how it changes depending on the number of CPU/Procs.

I got clarification on how we are trying to use node hours.  Node hours to us means the total time a job spends on any given node, regardless of the number of CPUs allocated.  The tricky part is accounting for packed jobs: where a single user has multiple jobs on a node, we want to only count one node for all the jobs (which themselves may have different elapsed times); in other words, some way to roll up the packed jobs under this concept of node hours.


> 
> -Jason
> 
> You are receiving this mail because:
> You reported the bug.

Comment 8 Bill Marmagas 2018-09-25 11:36:33 MDT

> On Sep 25, 2018, at 11:13 AM, bugs@schedmd.com wrote:
> 
> 
> Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=5770#c6> on bug 5770 <https://bugs.schedmd.com/show_bug.cgi?id=5770> from Jason Booth <mailto:jbooth@schedmd.com>
> Hi Bill,
> 
> Here is another tool which the community has created with bank-like feature
> (below). Most sites will generally roll their own solution. Sites that come
> from MAM have also turned on decaying (expiration on the credits) either
> through PriorityDecayHalfLife or use the qos flags nodecay, and either
> resetting usage manually or on a periodic basis with PriorityUsageResetPeriod
> or "sacctmgr mod <entirity> set rawusage=0"
> 
> 
> https://groups.google.com/forum/#!searchin/slurm-users/bank%7Csort:date/slurm-users/wX0BC2RF5lY/hZb4K3qdCAAJ <https://groups.google.com/forum/#!searchin/slurm-users/bank%7Csort:date/slurm-users/wX0BC2RF5lY/hZb4K3qdCAAJ>
> 
> https://github.com/barrymoo/slurm-bank <https://github.com/barrymoo/slurm-bank>
> 
> 
Thanks for sending the link to that tool.

I’d rather not automatically decay since the resource utilization will be tied to real money (our bank accounts represent fund numbers), but it looks like from the readme for that tool that I could use:

PriorityDecayHalfLife=0-00:00:00
PriorityUsageResetPeriod=NONE




> Let me know if this helps satisfy your request.
> 
> -Jason
> 
> You are receiving this mail because:
> You reported the bug.

Comment 9 Jason Booth 2018-09-25 13:58:33 MDT

> Is GrpTRESMins=cpu equivalent to GrpCPUMins?  
Yes. When you set GrpCPUMins this will also set GrpTRESMins=cpu.

sacctmgr show qos test format=GrpCPUMins,Name,GrpTRESMins
 GrpCPUMins       Name   GrpTRESMins 
----------- ---------- ------------- 
        300   test       cpu=300 


$ sacctmgr -i modify qos test set GrpTRESMins=cpu=301
 Modified qos...
  test
$ sacctmgr show qos test format=GrpCPUMins,Name,GrpTRESMins
 GrpCPUMins       Name   GrpTRESMins 
----------- ---------- ------------- 
        301   test       cpu=301 


> I’m interested in adding the AccountingStorageEnforce safe feature to "ensure a job will only be launched when using an association or qos that has a GrpCPUMins limit set if the job will be able to run to completion,” instead of having jobs getting killed when the limit is reached.  In other words, does that safe option also work with GrpTRESMins=cpu? Conversely, does GrpCPUMins require Priority Multifactor plugin?  (We may want to go to that anyway, just want to be clear on the requirements.)

The GrpTRESMins will also set GrpCPUMins and will also work the same way with regards to the safe feature. Also, GrpCPUMins does require the Priority Multifactor plugin. See the additional note in the man page or just below.

       GrpTRESMins=<TRES=max TRES minutes,...>
              The total number of TRES minutes that can possibly be used by past, present and future jobs running from this association and its children.  To clear a previously set value use the modify command with a new value of -1 for each TRES id.

              NOTE: This limit is not enforced if set on the root association of a cluster.  So even though it may appear in sacctmgr output, it will not be enforced.

              ALSO  NOTE:  This  limit  only applies when using the Priority Multifactor plugin.  The time is decayed using the value of PriorityDecayHalfLife or PriorityUsageResetPeriod as set in the slurm.conf.  When this limit is reached all associated jobs running will be killed and all future jobs submitted with associations in the group will be
              delayed until they are able to run inside the limit.



> Also, does “the time is decayed” refer only to priority calculations, or does not affect the GrpTRESMins=cpu debiting?

Decay refers to both the priority calculation, see decay factor (https://slurm.schedmd.com/priority_multifactor.html), and the rawusage. GrpTRESMins is just a limit and not the usage. 

Parameters that influence this are (mentioned previously):
PriorityDecayHalfLife
PriorityUsageResetPeriod



> I got clarification on how we are trying to use node hours.  Node hours to us means the total time a job spends on any given node, regardless of the number of CPUs allocated.  The tricky part is accounting for packed jobs: where a single user has multiple jobs on a node, we want to only count one node for all the jobs (which themselves may have different elapsed times); in other words, some way to roll up the packed jobs under this concept of node hours.

This feature does not exist in Slurm currently, so you would need to come up with a solution to handle this use case. You could query sacct and parse this information:
sacct --format=JobID,NodeList,Start,End

You may also want to consider writing your own accounting plugin if you need to capture your specific use case.

Comment 10 Bill Marmagas 2018-09-25 16:16:01 MDT

> On Sep 25, 2018, at 3:58 PM, bugs@schedmd.com wrote:
> 
> 
> Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=5770#c9> on bug 5770 <https://bugs.schedmd.com/show_bug.cgi?id=5770> from Jason Booth <mailto:jbooth@schedmd.com>
> > Is GrpTRESMins=cpu equivalent to GrpCPUMins?  
> Yes. When you set GrpCPUMins this will also set GrpTRESMins=cpu.
> 
> sacctmgr show qos test format=GrpCPUMins,Name,GrpTRESMins
>  GrpCPUMins       Name   GrpTRESMins 
> ----------- ---------- ------------- 
>         300   test       cpu=300 
> 
> 
> $ sacctmgr -i modify qos test set GrpTRESMins=cpu=301
>  Modified qos...
>   test
> $ sacctmgr show qos test format=GrpCPUMins,Name,GrpTRESMins
>  GrpCPUMins       Name   GrpTRESMins 
> ----------- ---------- ------------- 
>         301   test       cpu=301 
> 
> 
> > I’m interested in adding the AccountingStorageEnforce safe feature to "ensure a job will only be launched when using an association or qos that has a GrpCPUMins limit set if the job will be able to run to completion,” instead of having jobs getting killed when the limit is reached.  In other words, does that safe option also work with GrpTRESMins=cpu? Conversely, does GrpCPUMins require Priority Multifactor plugin?  (We may want to go to that anyway, just want to be clear on the requirements.)
> 
> The GrpTRESMins will also set GrpCPUMins and will also work the same way with
> regards to the safe feature. Also, GrpCPUMins does require the Priority
> Multifactor plugin. See the additional note in the man page or just below.
> 
>        GrpTRESMins=<TRES=max TRES minutes,...>
>               The total number of TRES minutes that can possibly be used by
> past, present and future jobs running from this association and its children. 
> To clear a previously set value use the modify command with a new value of -1
> for each TRES id.
> 
>               NOTE: This limit is not enforced if set on the root association
> of a cluster.  So even though it may appear in sacctmgr output, it will not be
> enforced.
> 
>               ALSO  NOTE:  This  limit  only applies when using the Priority
> Multifactor plugin.  The time is decayed using the value of
> PriorityDecayHalfLife or PriorityUsageResetPeriod as set in the slurm.conf. 
> When this limit is reached all associated jobs running will be killed and all
> future jobs submitted with associations in the group will be
>               delayed until they are able to run inside the limit.
> 
> 

Perfect, I was hoping GrpCPUMins and GrpTRESMins=cpu worked the same.

> 
> > Also, does “the time is decayed” refer only to priority calculations, or does not affect the GrpTRESMins=cpu debiting?
> 
> Decay refers to both the priority calculation, see decay factor
> (https://slurm.schedmd.com/priority_multifactor.html <https://slurm.schedmd.com/priority_multifactor.html>), and the rawusage.
> GrpTRESMins is just a limit and not the usage. 
> 
> Parameters that influence this are (mentioned previously):
> PriorityDecayHalfLife
> PriorityUsageResetPeriod
> 
> 

Ah, of course, decay affects the rawusage, not the limit.  That makes sense.  So, for example, if we wanted to prevent accounts from charging above a certain monthly cpu minute usage number, set by the resource limit, we might use something like:

PriorityDecayHalfLife=0
PriorityUsageResetPeriod=MONTHLY


> 
> > I got clarification on how we are trying to use node hours.  Node hours to us means the total time a job spends on any given node, regardless of the number of CPUs allocated.  The tricky part is accounting for packed jobs: where a single user has multiple jobs on a node, we want to only count one node for all the jobs (which themselves may have different elapsed times); in other words, some way to roll up the packed jobs under this concept of node hours.
> 
> This feature does not exist in Slurm currently, so you would need to come up
> with a solution to handle this use case. You could query sacct and parse this
> information:
> sacct --format=JobID,NodeList,Start,End
> 
> You may also want to consider writing your own accounting plugin if you need to
> capture your specific use case.
> 


Thanks.  I realize Node time is a somewhat unusual way to charge, versus CPU time, by management wants to capture something more representative of the value/cost of nodes.


> You are receiving this mail because:
> You reported the bug.

Comment 11 Jason Booth 2018-09-25 16:37:29 MDT

Hi Bill,

 This is correct. 'If set to 0 PriorityUsageResetPeriod must be set to some interval'. In this case, MONTHLY is used. 

PriorityDecayHalfLife=0
PriorityUsageResetPeriod=MONTHLY

Was there anything else you need for this issue?

-Jason

Comment 12 Bill Marmagas 2018-09-26 07:22:35 MDT

> On Sep 25, 2018, at 6:37 PM, bugs@schedmd.com wrote:
> 
> 
> Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=5770#c11> on bug 5770 <https://bugs.schedmd.com/show_bug.cgi?id=5770> from Jason Booth <mailto:jbooth@schedmd.com>
> Hi Bill,
> 
>  This is correct. 'If set to 0 PriorityUsageResetPeriod must be set to some
> interval'. In this case, MONTHLY is used. 
> 
> PriorityDecayHalfLife=0
> PriorityUsageResetPeriod=MONTHLY
> 
> Was there anything else you need for this issue?
> 

I think I have everything I need now.  Thanks!


> -Jason
> 
> You are receiving this mail because:
> You reported the bug.