13170 – sreport or sacct for user cpu time usage

Ticket 13170 - sreport or sacct for user cpu time usage

Summary: sreport or sacct for user cpu time usage

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	20.02.5
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Ben Roberts
QA Contact:

URL:

Duplicates (1):	13168 (view as ticket list)
Depends on:
Blocks:

Reported:	2022-01-11 10:08 MST by jhs43
Modified:	2022-01-13 14:01 MST (History)
CC List:	0 users

See Also:
Site:	Cornell ITSG
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
13170_sacct.out (545.46 KB, application/octet-stream) 2022-01-11 14:15 MST, jhs43	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description jhs43 2022-01-11 10:08:54 MST

Hello. 
We are struggling with gathering a user's cpu time on a cluster. I've followed as much documentation as I can get my hands on regarding this and the output sill seems to be an impossible number. We'v played with a mixed bag of options. A few examples:

$ sacct -P -u $USER -n --starttime 12/01/21 --allocations --format=ncpus | awk '{ sum+=$1} END {print sum}'
209648
[jhs3001@scu-login01 ~]$ sacct -P -u $USER -n --starttime 12/09/21 --allocations --format=cputimeraw | awk '{ sum+=$1} END {print sum}'
1650516444

sreport -t hour cluster AccountUtilizationByUser cluster=$CLUSTER user=$USER start=01/01/22 end=01/10/22 format=Accounts,Cluster,TresCount,Login,Proper,Used

Used=177008

 sreport cluster -t hourper --tres=cpu,gpu AccountUtilizationByUser user=$USER start=12/09/21 format=Accounts,Cluster,Login,Proper%30,TresName,Used tree
Used=450447(14.68%)

I have tried adding account names, end times. It is a cluster with mixed hyperthreading vs non nodes. I've tried doing the math per cpu.. still not making sense. 

Can you please point me in the right direction?
Thank you so much!
Jodie

Comment 1 Jason Booth 2022-01-11 10:44:57 MST

*** Ticket 13168 has been marked as a duplicate of this ticket. ***

Comment 2 Ben Roberts 2022-01-11 13:38:07 MST

Hi Jodie,

It is strange that the numbers aren't lining up between sacct and sreport.  I ran a few tests on my system to make sure there isn't a recent bug introduced.  I started with the awk statement you used, totaling up the number of CPU seconds for the first 15 days of December.

$ sacct -P -u $USER -n --starttime=12/01/21 --endtime=12/15/21 --allocations --format=cputimeraw | awk '{ sum+=$1} END {print sum}'
313030



Then I used sreport to get data for the same 15 day period.  The usage was split between two accounts, but they add up to the 313,030 CPU seconds that I got from sacct.

$ sreport -t seconds cluster AccountUtilizationByUser user=$USER start=12/01/21 end=12/15/21 format=Accounts,Cluster,TresName,Used tree
 Unknown field 'login/proper'
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2021-12-01T00:00:00 - 2021-12-14T23:59:59 (1209600 secs)
Usage reported in CPU Seconds
--------------------------------------------------------------------------------
Account                Cluster      TRES Name     Used 
-------------------- --------- -------------- -------- 
sub4                    knight            cpu     1992 
sub1                    knight            cpu   311038 




Looking at the output you sent I don't see that there is a sreport that covers the exact same data as your sacct output that gets a sum of the cputimeraw.  That sacct output covers from 12/09/21 to current, including all jobs run by the current user.  The first sreport command you ran includes a cluster constraint (which may not have any impact if you only have a single cluster) and it covers the time period from 01/01/22 to 01/10/22.  The second sreport command does start at 12/09/21 (like the sacct command), but it also includes a constraint of '--tres=cpu,gpu', which could be affecting the report.  

I would like to see the raw output of the two commands.  Can I have you send the output of the following:
sacct -P -u $USER -n --starttime 12/09/21 --allocations --format=cluster,jobid,account,cputimeraw > 13170_sacct.out

sreport -t seconds cluster AccountUtilizationByUser user=$USER start=12/09/21 format=Accounts,Cluster,Login,Proper%30,TresName,Used tree


Since the sacct command will probably generate quite a bit of data I redirected it to a file and you can just attach the file to the ticket.  I'll take a look at the raw data and see if it helps clarify why there is a difference in the numbers.

One more thing that might come into play is the possibility that there is a runaway job.  Can you run the following command to see if there is a runaway job on your system?
sacctmgr show runawayjobs

Thanks,
Ben

Comment 3 jhs43 2022-01-11 14:15:52 MST

Created attachment 22937 [details]
13170_sacct.out

Hi Ben,
Yes, I realized I sent you different dates; we have used a mixed bag thinking maybe part of the date could be the issue. We can do the math and it does give matching output. The issue is more it give output that is years beyond what it possibly could be. (or hours that cannot be possible…).
If I set the date to start=01/01/22 and -t seconds: 712488448
                                                                     -t hours: 197913
This user runs on 2 partitions:
PartitionName=scu-cpu Nodes=scu-node0[20-39],scu-node049 State=UP MaxTime=7-0
PartitionName=scu-gpu Nodes=scu-node0[51-53] State=UP MaxTime=2-0
NodeName=scu-node0[20-47] RealMemory=762430 Sockets=2 CoresPerSocket=28 ThreadsPerCore=1 State=UNKNOWN
NodeName=scu-node0[50-53] RealMemory=762430 Sockets=2 CoresPerSocket=28 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx6000:2

Even if the hours are going by per cpu, it still is not adding up.

$ sudo sacctmgr show runawayjobs
Runaway Jobs: No runaway jobs found on cluster scu

sreport -t seconds cluster AccountUtilizationByUser user=$USER start=12/09/21 format=Accounts,Cluster,Login,Proper%30,TresName,Used tree
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2021-12-09T00:00:00 - 2022-01-10T23:59:59 (2851200 secs)
Usage reported in CPU Seconds
--------------------------------------------------------------------------------
             Account   Cluster     Login                    Proper Name      TRES Name       Used
-------------------- --------- --------- ------------------------------ -------------- ----------
listonlab                  scu         XXXXX                      XXXXXXXXX          cpu 1621608212


Attached, you will find  ‘13170_sacct.out’.

We may very-well be just misunderstanding the calculations.
Thanks for the help!
Jodie


From: bugs@schedmd.com <bugs@schedmd.com>
Date: Tuesday, January 11, 2022 at 3:38 PM
To: Jodie H. Sprouse <jhs43@cornell.edu>
Subject: [Bug 13170] sreport or sacct for user cpu time usage
Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=13170#c2> on bug 13170<https://bugs.schedmd.com/show_bug.cgi?id=13170> from Ben Roberts<mailto:ben@schedmd.com>

Hi Jodie,



It is strange that the numbers aren't lining up between sacct and sreport.  I

ran a few tests on my system to make sure there isn't a recent bug introduced.

I started with the awk statement you used, totaling up the number of CPU

seconds for the first 15 days of December.



$ sacct -P -u $USER -n --starttime=12/01/21 --endtime=12/15/21 --allocations

--format=cputimeraw | awk '{ sum+=$1} END {print sum}'

313030







Then I used sreport to get data for the same 15 day period.  The usage was

split between two accounts, but they add up to the 313,030 CPU seconds that I

got from sacct.



$ sreport -t seconds cluster AccountUtilizationByUser user=$USER start=12/01/21

end=12/15/21 format=Accounts,Cluster,TresName,Used tree

 Unknown field 'login/proper'

--------------------------------------------------------------------------------

Cluster/Account/User Utilization 2021-12-01T00:00:00 - 2021-12-14T23:59:59

(1209600 secs)

Usage reported in CPU Seconds

--------------------------------------------------------------------------------

Account                Cluster      TRES Name     Used

-------------------- --------- -------------- --------

sub4                    knight            cpu     1992

sub1                    knight            cpu   311038









Looking at the output you sent I don't see that there is a sreport that covers

the exact same data as your sacct output that gets a sum of the cputimeraw.

That sacct output covers from 12/09/21 to current, including all jobs run by

the current user.  The first sreport command you ran includes a cluster

constraint (which may not have any impact if you only have a single cluster)

and it covers the time period from 01/01/22 to 01/10/22.  The second sreport

command does start at 12/09/21 (like the sacct command), but it also includes a

constraint of '--tres=cpu,gpu', which could be affecting the report.



I would like to see the raw output of the two commands.  Can I have you send

the output of the following:

sacct -P -u $USER -n --starttime 12/09/21 --allocations

--format=cluster,jobid,account,cputimeraw > 13170_sacct.out



sreport -t seconds cluster AccountUtilizationByUser user=$USER start=12/09/21

format=Accounts,Cluster,Login,Proper%30,TresName,Used tree





Since the sacct command will probably generate quite a bit of data I redirected

it to a file and you can just attach the file to the ticket.  I'll take a look

at the raw data and see if it helps clarify why there is a difference in the

numbers.



One more thing that might come into play is the possibility that there is a

runaway job.  Can you run the following command to see if there is a runaway

job on your system?

sacctmgr show runawayjobs



Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 4 Ben Roberts 2022-01-12 10:28:31 MST

Hi Jodie,

I'm sorry I didn't quite get the problem you were trying to show initially. I think that these numbers do make sense if this user is a primary user on these two partitions in the last month or so. Let me try to break it down.

You show that the user has access to two partitions. scu-cpu has 20 nodes with 2 sockets, 28 cores per socket and a single thread per core. So the total number of available CPUs for this partition is:
num_nodes * (num_sockets * cores_per_socket * threads_per_core)
20 * ( 2 * 28 * 1 )
1120 available CPUs for the scu_cpu partition

scu-gpu has 3 nodes with 2 sockets, 28 cores per socket and 2 threads per core. The calculation for this partition looks like this:
num_nodes * (num_sockets * cores_per_socket * threads_per_core)
3 * ( 2 * 28 * 2 )
336 available CPUs for the scu_gpu partition

The total number of CPUs available between the two partitions is:
1456 available CPUs

The most sreport output you sent had a start time of 12/09/21, and you sent that yesterday afternoon. That leaves 33 full days in that period of time. The total number of hours available in that period of time is:
33 days * 24 hours = 792 hours

Which also means that the number of seconds is:
792 hours * 60 minutes * 60 seconds = 2,851,200 seconds

So, the total number of CPU hours available to the two partitions is:
1456 CPUs * 792 hours = 1,153,152 CPU hours

And the total number of CPU seconds available to the two partitions is:
1456 CPUs * 2,851,200 seconds = 4,151,347,200 CPU seconds

The number of seconds from the most recent sreport output you sent was 1,621,608,212 CPU seconds, which does fall within the number of available seconds.

I hope this helps clarify how these numbers are calculated. Let me know if you have any questions or see any errors in my calculations.

Thanks,
Ben

Comment 5 jhs43 2022-01-12 15:25:20 MST

Hi Ben,
Thank you very much for this. We’ve been going through many scenarios and making sense to us. When we put in 10 years to cover from the first job submitted, the 96 years finally makes sense ﷐[U+1F609]﷑
Our goal is to send a weekly report to each user containing: cpu and gpu hours used by the user in each partition.
We’ll have to guide the user how the output is calculated…many nodes cross partitions.
Any further tips are greatly appreciated!
Thx again.
Jodie


From: bugs@schedmd.com <bugs@schedmd.com>
Date: Wednesday, January 12, 2022 at 12:28 PM
To: Jodie H. Sprouse <jhs43@cornell.edu>
Subject: [Bug 13170] sreport or sacct for user cpu time usage
Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=13170#c4> on bug 13170<https://bugs.schedmd.com/show_bug.cgi?id=13170> from Ben Roberts<mailto:ben@schedmd.com>

Hi Jodie,



I'm sorry I didn't quite get the problem you were trying to show initially.  I

think that these numbers do make sense if this user is a primary user on these

two partitions in the last month or so.  Let me try to break it down.



You show that the user has access to two partitions.  scu-cpu has 20 nodes with

2 sockets, 28 cores per socket and a single thread per core.  So the total

number of available CPUs for this partition is:

num_nodes * (num_sockets * cores_per_socket * threads_per_core)

       20 * (       2    *        28        *         1       )

       1120 available CPUs for the scu_cpu partition





scu-gpu has 3 nodes with 2 sockets, 28 cores per socket and 2 threads per core.

 The calculation for this partition looks like this:

num_nodes * (num_sockets * cores_per_socket * threads_per_core)

        3 * (       2    *        28        *         2       )

        336 available CPUs for the scu_gpu partition





The total number of CPUs available between the two partitions is:

    1456 available CPUs





The most sreport output you sent had a start time of 12/09/21, and you sent

that yesterday afternoon.  That leaves 33 full days in that period of time.

The total number of hours available in that period of time is:

33 days * 24 hours = 792 hours



Which also means that the number of seconds is:

792 hours * 60 minutes * 60 seconds = 2,851,200 seconds





So, the total number of CPU hours available to the two partitions is:

1456 CPUs * 792 hours = 1,153,152 CPU hours



And the total number of CPU seconds available to the two partitions is:

1456 CPUs * 2,851,200 seconds = 4,151,347,200 CPU seconds





The number of seconds from the most recent sreport output you sent was

1,621,608,212 CPU seconds, which does fall within the number of available

seconds.





I hope this helps clarify how these numbers are calculated.  Let me know if you

have any questions or see any errors in my calculations.



Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 6 Ben Roberts 2022-01-13 09:14:36 MST

I'm glad to hear that helped make sense of the reports.  It's true that the cluster reports don't give you the ability to filter the jobs by partition.  If you wanted to provide the users with an overview of how much time was spent in each partition then you could put together data with sacct, similar to how you did in this ticket.  You should be able to filter on whatever you want with sacct to get just the data you are interested in.  Maybe that was already your plan.  

Let me know if you have any additional questions about this or if this ticket is ok to close.

Thanks,
Ben

Comment 7 jhs43 2022-01-13 09:36:39 MST

Hi Ben,
Go ahead and close the ticket. I’ll open a new one if I find further questions in this topic.
My only suggestion would be to have further examples in the documentation. Your explanation did help.
Have a nice day.
Jodie

From: bugs@schedmd.com <bugs@schedmd.com>
Date: Thursday, January 13, 2022 at 11:14 AM
To: Jodie H. Sprouse <jhs43@cornell.edu>
Subject: [Bug 13170] sreport or sacct for user cpu time usage
Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=13170#c6> on bug 13170<https://bugs.schedmd.com/show_bug.cgi?id=13170> from Ben Roberts<mailto:ben@schedmd.com>

I'm glad to hear that helped make sense of the reports.  It's true that the

cluster reports don't give you the ability to filter the jobs by partition.  If

you wanted to provide the users with an overview of how much time was spent in

each partition then you could put together data with sacct, similar to how you

did in this ticket.  You should be able to filter on whatever you want with

sacct to get just the data you are interested in.  Maybe that was already your

plan.

Let me know if you have any additional questions about this or if this ticket

is ok to close.

Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 8 Ben Roberts 2022-01-13 14:01:52 MST

I'm glad that helped clarify how this is calculated.  I've opened an internal ticket to consider adding some additional examples about how the CPU seconds are calculated.

Thanks,
Ben