10857 – Charge back for specific users

Ticket 10857 - Charge back for specific users

Summary: Charge back for specific users

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	20.02.5
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-02-12 10:08 MST by Simran
Modified:	2021-02-18 08:53 MST (History)
CC List:	1 user (show)

See Also:
Site:	Roche/PHCIX
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Simran 2021-02-12 10:08:05 MST

Hello,

We have a cluster running in AWS with a few different partitions that currently use specific EC2 instances (cpu/mem/gpu optimized):

--
$ scontrol show partitions | grep -i PartitionName
PartitionName=C-16Cpu-30GB
PartitionName=C-36Cpu-69GB
PartitionName=C-72Cpu-139GB
PartitionName=M-16Cpu-123GB
PartitionName=M-48Cpu-371GB
PartitionName=M-96Cpu-742GB
PartitionName=G-1GPU-8Cpu-58GB
PartitionName=G-4GPU-32Cpu-235GB
PartitionName=G-8GPU-64Cpu-471GB
--

I have to onboard new users that are not part of our business unit and need to track and charge back their usage for each of these partitions/instance_types.  These partitions each map to a specific instance type and has a AWS cost associated with it.  Is there a way in slurm that I can get a report of a given users usage of each of these partitions/instances so that I can then calculate the cost for charge back purposes.  Currently I can only pull high level metrics for a given user:

--
$ sreport cluster AccountUtilizationByUser start=01/01/21 end=01/31/21 | egrep -i 'Cluster|petr'
Cluster/Account/User Utilization 2021-01-01T00:00:00 - 2021-01-30T23:59:59 (2592000 secs)
  Cluster         Account     Login     Proper Name      Used   Energy
slurm-ma+        titanium  xxx          xxx   4011033        0
$ sreport cluster AccountUtilizationByUser --tres="gres/gpu" start=01/01/21 end=01/31/21 | egrep -i 'Cluster|petr'
Cluster/Account/User Utilization 2021-01-01T00:00:00 - 2021-01-30T23:59:59 (2592000 secs)
  Cluster         Account     Login     Proper Name      TRES Name     Used
slurm-ma+        titanium  xxx          xxx       gres/gpu   501379
--

However this does not tell me how much hours did this user use a specific partition.  Is there a way to get this data or a better approach on doing charge back for users in this type of an environment.  I would really appreciate your input.  

Thanks,
-Simran

Comment 1 Ben Roberts 2021-02-12 13:46:02 MST

Hi Simran,

For you to get the information you want for jobs that have already run I would recommend using 'sacct' to get data on jobs that meet the criteria you're trying to report on.  The sacct command will get some information about jobs that have run and had some information stored in the database.  You can filter on things like the user and partition, along with a date range to get just the information you want.  Here's an example of what a query might look like from my test system.
sacct --starttime=2021-02-10 --endtime=2021-02-11 --partition debug --user user1

You can also control the fields that are displayed.  You can see the fields that are available to be displayed by looking at 'sacct -e' or 'man sacct'.  

Going forward, you may want to consider using a Workload Characterization Key (WCKey) to attach a unique characteristic to jobs that you can then report on with sreport.  You can create a WCKey that corresponds to each partition you have and then jobs with that WCKey/partition combination will be reported correctly.  You can rely on users to request the correct WCKey when they submit, but the better option would probably be to create a submit filter that adds the correct WCKey based on the partition requested by a job.  

In order to use WCKeys you would need to add a couple lines to your slurm.conf:
AccountingStorageEnforce=wckey  (this may already have other entries)
TrackWCKey=yes

In your slurmdbd.conf file you would also need to add:
TrackWCKey=yes

You would also need to add WCKeys to your users with sacctmgr, like this:
sacctmgr add user user1 wckey=partition1_key

There is more information about using WCKeys in the documentation here:
https://slurm.schedmd.com/wckey.html

Once you have jobs running with a WCKey associated with them, you should be able to run reports that show the data you're looking for.  The two reports in particular that should work are:
sreport cluster UserUtilizationByWCKey
sreport cluster WCKeyUtilizationByUser

There are also job reports that don't sound like they would be as relevant for what you are asking for, but I'll point out:
sreport job SizesByAccountAndWckey
sreport job SizesByWckey

Let me know if this sounds like it will work for you, or if you have any questions about implementing WCKeys.

Thanks,
Ben

Comment 2 Simran 2021-02-12 14:47:20 MST

Hi Ben,

Thanks for your response.  I like the sreport approach since it aggregates the hours in it's output unlike the sacct output, which makes it a bit easy for me to report on a monthly chargeback mode.

The WCKey approach seems very interesting and something that I think I would like to pursue.  If I understand this correctly, I would have a unique wckey defined for each partition and enforce this via our submit script and add the appropriate wckey for each partition when the job is submitted depending on what partition is being used.  Then I should be able to get aggregated hours spent by a given user with a specific wckey (ec2 instance)?

We can test this in sandbox and see if this would work for us.  Once we have this implemented and it gives us the info we need, is this something that can be queried via slurmrestd or is this something we need to manually run the sreport commands for?  We need to automate the chargeback so thinking of the best approach.

Thanks,
-Simran

Comment 3 Simran 2021-02-12 15:03:31 MST

Also, I am assuming we can use our job_sumit.lua script to inject this wckey based on which partition is being used (ex: C-16Cpu-30GB injects --wckey=c5.4xlarge).  We don't need the user to provide this field and we just enforce it from our lua script and don't need to define any default wckey when the user is added to slurm either.  Let me know if I am missing something.

Thanks,
-Simran

Comment 4 Ben Roberts 2021-02-12 16:04:32 MST

Hi Simran,

Your understanding is correct, with unique WCKeys assigned to jobs in different partitions you can use sreport to get reports on hours spent by a specific user with a given WCKey.

The WCKey is an attribute that shows up for jobs, so you should be able to see it when you query the jobs via slurmrestd.  This shows the attributes (including WCKey) that you see when querying a job.
https://slurm.schedmd.com/rest_api.html#slurmctldGetJob

That's right, you can use your job_submit.lua script to add the WCKey so you don't need to define the default WCKey for each user.  

Thanks,
Ben

Comment 5 Simran 2021-02-15 17:37:02 MST

Hi Ben,

Would something like this in our job submit lua script work for what we have discussed:

    if (job_desc.partition == 'C-16Cpu-30GB') then
	job_desc.wckey = 'c5.4xlarge'
    elseif (job_desc.partition == 'C-36Cpu-69GB') then
	job_desc.wckey = 'c5.9xlarge'
    elseif (job_desc.partition == 'C-72Cpu-139GB') then
	job_desc.wckey = 'c5.18xlarge'
    elseif (job_desc.partition == 'M-16Cpu-123GB') then
	job_desc.wckey = 'r5.4xlarge'
    elseif (job_desc.partition == 'M-48Cpu-371GB') then
	job_desc.wckey = 'r5.12xlarge'
    elseif (job_desc.partition == 'M-96Cpu-742GB') then
	job_desc.wckey = 'r5.24xlarge'
    elseif (job_desc.partition == 'G-1GPU-8Cpu-58GB') then
	job_desc.wckey = 'p3.2xlarge'
    elseif (job_desc.partition == 'G-4GPU-32Cpu-235GB') then
	job_desc.wckey = 'p3.8xlarge'
    else (job_desc.partition == 'G-8GPU-64Cpu-471GB')
	job_desc.wckey = 'p3.16xlarge'
    end

Not sure if there is a easier way to achieve this.

Regards,
-Simran

Comment 6 Simran 2021-02-15 17:41:20 MST

We enabled this capability and updated our job submit lua script but all jobs seems to be getting * for the wckey.  We might be missing something in our lua script and would appreciate any feedback you can provide to set this up correctly:

--
simran@spcdp-usw2-1104:~$ sbatch batch-test.sh
Submitted batch job 1452
simran@spcdp-usw2-1104:~$ squeue -l -u simran
Tue Feb 16 00:40:20 2021
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              1452 C-16Cpu-3  simtest   simran CONFIGUR       0:01 UNLIMITED      1 spcdp-usw2-0004
simran@spcdp-usw2-1104:~$ scontrol show job 1452 | grep -i wckey
   Priority=4294901757 Nice=0 Account=palladium QOS=special WCKey=*
--

Here is our full job submit lua script:

--
$ cat job_submit.lua
--[[

   Custom job_submit script for Apollo Deep Learning Environment
      - only allow users to submit GPU jobs with --gres=gpu:<> flag

--]]

function slurm_job_submit(job_desc, part_list, submit_uid)

    if (job_desc.partition == 'G-1GPU-8Cpu-58GB' or job_desc.partition == 'G-4GPU-32Cpu-235GB' or job_desc.partition == 'G-8GPU-64Cpu-471GB') and (not job_desc.gres or ( job_desc.gres and not (string.match(job_desc.gres, "gpu:%d+") or string.match(job_desc.gres, "gpu:.-:%d+")))) then
        slurm.log_info("slurm_job_submit: GPU job submitted by user_id:%d rejected - no GPU resources specified", job_desc.user_id)
        slurm.user_msg("Invalid submission of GPU job. Jobs to the GPU partition must specify GPU resources, (for example --gres=gpu:1 or --gres=gpu:4)")
        return 2072
    end

    if (job_desc.partition == 'G-1GPU-8Cpu-58GB' or job_desc.partition == 'G-4GPU-32Cpu-235GB' or job_desc.partition == 'G-8GPU-64Cpu-471GB') and job_desc.qos then
        job_desc.qos = ''
        slurm.log_info("slurm_job_submit: Set user default QOS")
        slurm.user_msg("QOS is set automatically")
    end

    if (job_desc.partition == 'C-16Cpu-30GB') then
	job_desc.wckey = 'c5.4xlarge'
    elseif (job_desc.partition == 'C-36Cpu-69GB') then
	job_desc.wckey = 'c5.9xlarge'
    elseif (job_desc.partition == 'C-72Cpu-139GB') then
	job_desc.wckey = 'c5.18xlarge'
    elseif (job_desc.partition == 'M-16Cpu-123GB') then
	job_desc.wckey = 'r5.4xlarge'
    elseif (job_desc.partition == 'M-48Cpu-371GB') then
	job_desc.wckey = 'r5.12xlarge'
    elseif (job_desc.partition == 'M-96Cpu-742GB') then
	job_desc.wckey = 'r5.24xlarge'
    elseif (job_desc.partition == 'G-1GPU-8Cpu-58GB') then
	job_desc.wckey = 'p3.2xlarge'
    elseif (job_desc.partition == 'G-4GPU-32Cpu-235GB') then
	job_desc.wckey = 'p3.8xlarge'
    else (job_desc.partition == 'G-8GPU-64Cpu-471GB')
	job_desc.wckey = 'p3.16xlarge'
    end

    return slurm.SUCCESS
end


function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
   return slurm.SUCCESS
end

slurm.log_info("initialized")
return slurm.SUCCESS
--

Thanks,
-Simran

Comment 7 Simran 2021-02-16 12:36:32 MST

Ben,

Looks like this was related to a typo in the code where we were missing a then statement and using else instead of elseif.  We have fixed this and now I see the correct wckey being defined as the job runs.  Please let us know if there is a better way to assign the wckey than the if else loop.  However, now that each job has the correct key assigned, I still don't see it being reported in sreport.  Is there a delay on when sreport will have this information or am I missing something further?

--
Successfully ran and completed a dummy sleep 60 job:
$ scontrol show job 1466
JobId=1466 JobName=test-script.sh
   UserId=simran(85174) GroupId=dialout(20) MCS_label=N/A
   Priority=4294901743 Nice=0 Account=palladium QOS=special WCKey=p3.2xlarge
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:01:01 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2021-02-16T19:25:06 EligibleTime=2021-02-16T19:25:06
   AccrueTime=2021-02-16T19:25:06
   StartTime=2021-02-16T19:31:44 EndTime=2021-02-16T19:32:45 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-16T19:25:06
   Partition=G-1GPU-8Cpu-58GB AllocNode:Sid=spcdp-usw2-1104:18165
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=spcdp-usw2-0676
   BatchHost=spcdp-usw2-0676
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=14736M,node=1,billing=2,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=7368M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/simran/test-script.sh
   WorkDir=/home/simran
   StdErr=/home/simran/slurm-1466.out
   StdIn=/dev/null
   StdOut=/home/simran/slurm-1466.out
   Power=
   TresPerNode=gpu:1
   MailUser=(null) MailType=NONE

confirmed that the correct wckey is attached:
$ scontrol show job 1466 | grep -i wckey
   Priority=4294901743 Nice=0 Account=palladium QOS=special WCKey=p3.2xlarge

However, none of the sreport outputs are showing the details of this wckey:
# sreport cluster WCKeyUtilizationByUser start=2/1/21 end=2/30/21
--------------------------------------------------------------------------------
Cluster/WCKey/User Utilization 2021-02-01T00:00:00 - 2021-02-16T19:59:59 (1368000 secs)
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster           WCKey     Login     Proper Name     Used
--------- --------------- --------- --------------- --------
slurm-ma+               *                                205
slurm-ma+               *      jhui             Hui      200
slurm-ma+               *    ravih1            Ravi        4
slurm-ma+               *    simran         Hansrai        1

# sreport cluster UserUtilizationByWCKey start=2/1/21 end=2/30/21
--------------------------------------------------------------------------------
Cluster/User/WCKey Utilization 2021-02-01T00:00:00 - 2021-02-16T19:59:59 (1368000 secs)
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name           WCKey     Used
--------- --------- --------------- --------------- --------
slurm-ma+      jhui             Hui               *      200
slurm-ma+    ravih1            Ravi               *        4
slurm-ma+    simran         Hansrai               *        1
--

What am I missing here?

Thanks,
-Simran

Comment 8 Jason Booth 2021-02-16 12:47:35 MST

Hi Simran - Please be aware that we treat bug severity issues very seriously since attention to them will impact development on our side. 


https://www.schedmd.com/support.php

Severity 2 — High Impact

A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system.


Ben is currently out of the office due to weather related issues in the US. If this is production impacting I can have someone else look at this today. If not then please wait for his reply.

Comment 9 Simran 2021-02-16 13:18:47 MST

Jason,

I have updated the severity based on your feedback.  Even though this is not impacting production services right now it is limiting us from onboarding any new user that is not part of our approved business unit.  It is ok if I get a response in the next day or 2 but this will eventually become critical for us.  For now, lowered the priority and will wait for a response from Ben now that I know he is not in the office.

Thanks for the clarification and your support.

Regards,
-Simran

Comment 10 Albert Gil 2021-02-17 10:01:19 MST

Hi Simran,

I'll try to help while Ben is out.
Could you verify that TrackWCKey=yes on both, slurm.conf and slurmdbd.conf?
If they both have them, could you paste the output of:

$ scontrol show config | grep -i wckey
$ sacctmgr show config | grep -i wckey

Although it seems that you wait enough, please also note that unlike sacct information, the sreport data is updated/aggregated in an hourly basis.
Are you still getting only * WCKeys or are the right wckeys shown already?

Regards,
Albert

Comment 11 Ben Roberts 2021-02-17 10:24:03 MST

Hi Simran,

My apologies for the delayed response.  As Jason mentioned I've been dealing with weather related power outages.  It looks like Albert has jumped in to help (thank you).  I was going to respond with the same things, but have on addition to make.   Can I have you look at the information reported for a job with sacct?
sacct -j 1466 -o jobid,jobname,partition,account,wckey

Thanks,
Ben

Comment 12 Simran 2021-02-17 10:57:57 MST

Hi Ben,

No worries, hope you are safe and doing well.  Looks like I was missing the hourly update context for sreport.  I am able to see the updated keys now:

--
simran@spcdp-usw2-1104:~$ sacct -j 1466 -o jobid,jobname,partition,account,wckey
       JobID    JobName  Partition    Account      WCKey
------------ ---------- ---------- ---------- ----------
1466         test-scri+ G-1GPU-8C+  palladium p3.2xlarge
1466.batch        batch             palladium

simran@spcdp-usw2-1104:~$ sreport cluster WCKeyUtilizationByUser start=2/1/21 end=2/29/21
--------------------------------------------------------------------------------
Cluster/WCKey/User Utilization 2021-02-01T00:00:00 - 2021-02-17T17:59:59 (1447200 secs)
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster           WCKey     Login     Proper Name     Used
--------- --------------- --------- --------------- --------
slurm-ma+               *                                208
slurm-ma+               *      jhui             Hui      200
slurm-ma+               *    ravih1            Ravi        4
slurm-ma+               *    simran         Hansrai        4
slurm-ma+      c5.4xlarge                                 28
slurm-ma+      c5.4xlarge    ravih1            Ravi        9
slurm-ma+      c5.4xlarge    simran         Hansrai       20
slurm-ma+      p3.2xlarge                                242
slurm-ma+      p3.2xlarge      jhui             Hui      202
slurm-ma+      p3.2xlarge    ravih1            Ravi       27
slurm-ma+      p3.2xlarge    simran         Hansrai       13
--

Another question, if I have a user job that gets submitted in Feb but ends in March.  Where will this usage be reflected, if I am running this query on a monthly bases for chargeback.  I am hoping that since it ends in March that my monthly March dump will capture that usage.

Thanks,
-Simran

Comment 13 Ben Roberts 2021-02-17 12:08:32 MST

I'm glad to see that things look like they're working as expected after the rollup.  For jobs that have usage that span time periods, the usage should be reported in the time period in which the usage occurred.  In other words, the usage that occurred in Feb will be reported as happening in Feb and the usage that happened in Mar will be reported for Mar.  The hourly rollup looks at the usage for that period of time and accumulates it in a table.  The same thing also happens on a daily and monthly basis to keep the usage for a given time period accurate.

Let me know if you have any questions about this.

Thanks,
Ben

Comment 14 Simran 2021-02-17 17:20:22 MST

Thanks Ben.  Would be great to get your input regarding the following code, in case there is a better way of doing it or does this look ok to you:

--
    if (job_desc.partition == 'C-16Cpu-30GB') then
	job_desc.wckey = 'c5.4xlarge'
    elseif (job_desc.partition == 'C-36Cpu-69GB') then
	job_desc.wckey = 'c5.9xlarge'
    elseif (job_desc.partition == 'C-72Cpu-139GB') then
	job_desc.wckey = 'c5.18xlarge'
    elseif (job_desc.partition == 'M-16Cpu-123GB') then
	job_desc.wckey = 'r5.4xlarge'
    elseif (job_desc.partition == 'M-48Cpu-371GB') then
	job_desc.wckey = 'r5.12xlarge'
    elseif (job_desc.partition == 'M-96Cpu-742GB') then
	job_desc.wckey = 'r5.24xlarge'
    elseif (job_desc.partition == 'G-1GPU-8Cpu-58GB') then
	job_desc.wckey = 'p3.2xlarge'
    elseif (job_desc.partition == 'G-4GPU-32Cpu-235GB') then
	job_desc.wckey = 'p3.8xlarge'
    elseif (job_desc.partition == 'G-8GPU-64Cpu-471GB') then
	job_desc.wckey = 'p3.16xlarge'
    end
--

Thanks,
-Simran

Comment 15 Ben Roberts 2021-02-17 18:20:21 MST

I forgot to address that part of your question.  That if statement looks good.  I don't know that there is going to be a more efficient way of mapping wckeys to partitions.

Thanks,
Ben

Comment 16 Simran 2021-02-18 08:35:58 MST

Thanks for all the help Ben.  Much appreciated.  Feel free to close this request.  I will open a new one if we have further questions/issues.

Regards,
-Simran

Comment 17 Ben Roberts 2021-02-18 08:53:38 MST

I'm glad to hear you have a solution that works for you.  Closing now.