Hello. We are struggling with gathering a user's cpu time on a cluster. I've followed as much documentation as I can get my hands on regarding this and the output sill seems to be an impossible number. We'v played with a mixed bag of options. A few examples: $ sacct -P -u $USER -n --starttime 12/01/21 --allocations --format=ncpus | awk '{ sum+=$1} END {print sum}' 209648 [jhs3001@scu-login01 ~]$ sacct -P -u $USER -n --starttime 12/09/21 --allocations --format=cputimeraw | awk '{ sum+=$1} END {print sum}' 1650516444 sreport -t hour cluster AccountUtilizationByUser cluster=$CLUSTER user=$USER start=01/01/22 end=01/10/22 format=Accounts,Cluster,TresCount,Login,Proper,Used Used=177008 sreport cluster -t hourper --tres=cpu,gpu AccountUtilizationByUser user=$USER start=12/09/21 format=Accounts,Cluster,Login,Proper%30,TresName,Used tree Used=450447(14.68%) I have tried adding account names, end times. It is a cluster with mixed hyperthreading vs non nodes. I've tried doing the math per cpu.. still not making sense. Can you please point me in the right direction? Thank you so much! Jodie
*** Ticket 13168 has been marked as a duplicate of this ticket. ***
Hi Jodie, It is strange that the numbers aren't lining up between sacct and sreport. I ran a few tests on my system to make sure there isn't a recent bug introduced. I started with the awk statement you used, totaling up the number of CPU seconds for the first 15 days of December. $ sacct -P -u $USER -n --starttime=12/01/21 --endtime=12/15/21 --allocations --format=cputimeraw | awk '{ sum+=$1} END {print sum}' 313030 Then I used sreport to get data for the same 15 day period. The usage was split between two accounts, but they add up to the 313,030 CPU seconds that I got from sacct. $ sreport -t seconds cluster AccountUtilizationByUser user=$USER start=12/01/21 end=12/15/21 format=Accounts,Cluster,TresName,Used tree Unknown field 'login/proper' -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2021-12-01T00:00:00 - 2021-12-14T23:59:59 (1209600 secs) Usage reported in CPU Seconds -------------------------------------------------------------------------------- Account Cluster TRES Name Used -------------------- --------- -------------- -------- sub4 knight cpu 1992 sub1 knight cpu 311038 Looking at the output you sent I don't see that there is a sreport that covers the exact same data as your sacct output that gets a sum of the cputimeraw. That sacct output covers from 12/09/21 to current, including all jobs run by the current user. The first sreport command you ran includes a cluster constraint (which may not have any impact if you only have a single cluster) and it covers the time period from 01/01/22 to 01/10/22. The second sreport command does start at 12/09/21 (like the sacct command), but it also includes a constraint of '--tres=cpu,gpu', which could be affecting the report. I would like to see the raw output of the two commands. Can I have you send the output of the following: sacct -P -u $USER -n --starttime 12/09/21 --allocations --format=cluster,jobid,account,cputimeraw > 13170_sacct.out sreport -t seconds cluster AccountUtilizationByUser user=$USER start=12/09/21 format=Accounts,Cluster,Login,Proper%30,TresName,Used tree Since the sacct command will probably generate quite a bit of data I redirected it to a file and you can just attach the file to the ticket. I'll take a look at the raw data and see if it helps clarify why there is a difference in the numbers. One more thing that might come into play is the possibility that there is a runaway job. Can you run the following command to see if there is a runaway job on your system? sacctmgr show runawayjobs Thanks, Ben
Created attachment 22937 [details] 13170_sacct.out Hi Ben, Yes, I realized I sent you different dates; we have used a mixed bag thinking maybe part of the date could be the issue. We can do the math and it does give matching output. The issue is more it give output that is years beyond what it possibly could be. (or hours that cannot be possible…). If I set the date to start=01/01/22 and -t seconds: 712488448 -t hours: 197913 This user runs on 2 partitions: PartitionName=scu-cpu Nodes=scu-node0[20-39],scu-node049 State=UP MaxTime=7-0 PartitionName=scu-gpu Nodes=scu-node0[51-53] State=UP MaxTime=2-0 NodeName=scu-node0[20-47] RealMemory=762430 Sockets=2 CoresPerSocket=28 ThreadsPerCore=1 State=UNKNOWN NodeName=scu-node0[50-53] RealMemory=762430 Sockets=2 CoresPerSocket=28 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx6000:2 Even if the hours are going by per cpu, it still is not adding up. $ sudo sacctmgr show runawayjobs Runaway Jobs: No runaway jobs found on cluster scu sreport -t seconds cluster AccountUtilizationByUser user=$USER start=12/09/21 format=Accounts,Cluster,Login,Proper%30,TresName,Used tree -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2021-12-09T00:00:00 - 2022-01-10T23:59:59 (2851200 secs) Usage reported in CPU Seconds -------------------------------------------------------------------------------- Account Cluster Login Proper Name TRES Name Used -------------------- --------- --------- ------------------------------ -------------- ---------- listonlab scu XXXXX XXXXXXXXX cpu 1621608212 Attached, you will find ‘13170_sacct.out’. We may very-well be just misunderstanding the calculations. Thanks for the help! Jodie From: bugs@schedmd.com <bugs@schedmd.com> Date: Tuesday, January 11, 2022 at 3:38 PM To: Jodie H. Sprouse <jhs43@cornell.edu> Subject: [Bug 13170] sreport or sacct for user cpu time usage Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=13170#c2> on bug 13170<https://bugs.schedmd.com/show_bug.cgi?id=13170> from Ben Roberts<mailto:ben@schedmd.com> Hi Jodie, It is strange that the numbers aren't lining up between sacct and sreport. I ran a few tests on my system to make sure there isn't a recent bug introduced. I started with the awk statement you used, totaling up the number of CPU seconds for the first 15 days of December. $ sacct -P -u $USER -n --starttime=12/01/21 --endtime=12/15/21 --allocations --format=cputimeraw | awk '{ sum+=$1} END {print sum}' 313030 Then I used sreport to get data for the same 15 day period. The usage was split between two accounts, but they add up to the 313,030 CPU seconds that I got from sacct. $ sreport -t seconds cluster AccountUtilizationByUser user=$USER start=12/01/21 end=12/15/21 format=Accounts,Cluster,TresName,Used tree Unknown field 'login/proper' -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2021-12-01T00:00:00 - 2021-12-14T23:59:59 (1209600 secs) Usage reported in CPU Seconds -------------------------------------------------------------------------------- Account Cluster TRES Name Used -------------------- --------- -------------- -------- sub4 knight cpu 1992 sub1 knight cpu 311038 Looking at the output you sent I don't see that there is a sreport that covers the exact same data as your sacct output that gets a sum of the cputimeraw. That sacct output covers from 12/09/21 to current, including all jobs run by the current user. The first sreport command you ran includes a cluster constraint (which may not have any impact if you only have a single cluster) and it covers the time period from 01/01/22 to 01/10/22. The second sreport command does start at 12/09/21 (like the sacct command), but it also includes a constraint of '--tres=cpu,gpu', which could be affecting the report. I would like to see the raw output of the two commands. Can I have you send the output of the following: sacct -P -u $USER -n --starttime 12/09/21 --allocations --format=cluster,jobid,account,cputimeraw > 13170_sacct.out sreport -t seconds cluster AccountUtilizationByUser user=$USER start=12/09/21 format=Accounts,Cluster,Login,Proper%30,TresName,Used tree Since the sacct command will probably generate quite a bit of data I redirected it to a file and you can just attach the file to the ticket. I'll take a look at the raw data and see if it helps clarify why there is a difference in the numbers. One more thing that might come into play is the possibility that there is a runaway job. Can you run the following command to see if there is a runaway job on your system? sacctmgr show runawayjobs Thanks, Ben ________________________________ You are receiving this mail because: * You reported the bug.
Hi Jodie, I'm sorry I didn't quite get the problem you were trying to show initially. I think that these numbers do make sense if this user is a primary user on these two partitions in the last month or so. Let me try to break it down. You show that the user has access to two partitions. scu-cpu has 20 nodes with 2 sockets, 28 cores per socket and a single thread per core. So the total number of available CPUs for this partition is: num_nodes * (num_sockets * cores_per_socket * threads_per_core) 20 * ( 2 * 28 * 1 ) 1120 available CPUs for the scu_cpu partition scu-gpu has 3 nodes with 2 sockets, 28 cores per socket and 2 threads per core. The calculation for this partition looks like this: num_nodes * (num_sockets * cores_per_socket * threads_per_core) 3 * ( 2 * 28 * 2 ) 336 available CPUs for the scu_gpu partition The total number of CPUs available between the two partitions is: 1456 available CPUs The most sreport output you sent had a start time of 12/09/21, and you sent that yesterday afternoon. That leaves 33 full days in that period of time. The total number of hours available in that period of time is: 33 days * 24 hours = 792 hours Which also means that the number of seconds is: 792 hours * 60 minutes * 60 seconds = 2,851,200 seconds So, the total number of CPU hours available to the two partitions is: 1456 CPUs * 792 hours = 1,153,152 CPU hours And the total number of CPU seconds available to the two partitions is: 1456 CPUs * 2,851,200 seconds = 4,151,347,200 CPU seconds The number of seconds from the most recent sreport output you sent was 1,621,608,212 CPU seconds, which does fall within the number of available seconds. I hope this helps clarify how these numbers are calculated. Let me know if you have any questions or see any errors in my calculations. Thanks, Ben
Hi Ben, Thank you very much for this. We’ve been going through many scenarios and making sense to us. When we put in 10 years to cover from the first job submitted, the 96 years finally makes sense 😉 Our goal is to send a weekly report to each user containing: cpu and gpu hours used by the user in each partition. We’ll have to guide the user how the output is calculated…many nodes cross partitions. Any further tips are greatly appreciated! Thx again. Jodie From: bugs@schedmd.com <bugs@schedmd.com> Date: Wednesday, January 12, 2022 at 12:28 PM To: Jodie H. Sprouse <jhs43@cornell.edu> Subject: [Bug 13170] sreport or sacct for user cpu time usage Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=13170#c4> on bug 13170<https://bugs.schedmd.com/show_bug.cgi?id=13170> from Ben Roberts<mailto:ben@schedmd.com> Hi Jodie, I'm sorry I didn't quite get the problem you were trying to show initially. I think that these numbers do make sense if this user is a primary user on these two partitions in the last month or so. Let me try to break it down. You show that the user has access to two partitions. scu-cpu has 20 nodes with 2 sockets, 28 cores per socket and a single thread per core. So the total number of available CPUs for this partition is: num_nodes * (num_sockets * cores_per_socket * threads_per_core) 20 * ( 2 * 28 * 1 ) 1120 available CPUs for the scu_cpu partition scu-gpu has 3 nodes with 2 sockets, 28 cores per socket and 2 threads per core. The calculation for this partition looks like this: num_nodes * (num_sockets * cores_per_socket * threads_per_core) 3 * ( 2 * 28 * 2 ) 336 available CPUs for the scu_gpu partition The total number of CPUs available between the two partitions is: 1456 available CPUs The most sreport output you sent had a start time of 12/09/21, and you sent that yesterday afternoon. That leaves 33 full days in that period of time. The total number of hours available in that period of time is: 33 days * 24 hours = 792 hours Which also means that the number of seconds is: 792 hours * 60 minutes * 60 seconds = 2,851,200 seconds So, the total number of CPU hours available to the two partitions is: 1456 CPUs * 792 hours = 1,153,152 CPU hours And the total number of CPU seconds available to the two partitions is: 1456 CPUs * 2,851,200 seconds = 4,151,347,200 CPU seconds The number of seconds from the most recent sreport output you sent was 1,621,608,212 CPU seconds, which does fall within the number of available seconds. I hope this helps clarify how these numbers are calculated. Let me know if you have any questions or see any errors in my calculations. Thanks, Ben ________________________________ You are receiving this mail because: * You reported the bug.
I'm glad to hear that helped make sense of the reports. It's true that the cluster reports don't give you the ability to filter the jobs by partition. If you wanted to provide the users with an overview of how much time was spent in each partition then you could put together data with sacct, similar to how you did in this ticket. You should be able to filter on whatever you want with sacct to get just the data you are interested in. Maybe that was already your plan. Let me know if you have any additional questions about this or if this ticket is ok to close. Thanks, Ben
Hi Ben, Go ahead and close the ticket. I’ll open a new one if I find further questions in this topic. My only suggestion would be to have further examples in the documentation. Your explanation did help. Have a nice day. Jodie From: bugs@schedmd.com <bugs@schedmd.com> Date: Thursday, January 13, 2022 at 11:14 AM To: Jodie H. Sprouse <jhs43@cornell.edu> Subject: [Bug 13170] sreport or sacct for user cpu time usage Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=13170#c6> on bug 13170<https://bugs.schedmd.com/show_bug.cgi?id=13170> from Ben Roberts<mailto:ben@schedmd.com> I'm glad to hear that helped make sense of the reports. It's true that the cluster reports don't give you the ability to filter the jobs by partition. If you wanted to provide the users with an overview of how much time was spent in each partition then you could put together data with sacct, similar to how you did in this ticket. You should be able to filter on whatever you want with sacct to get just the data you are interested in. Maybe that was already your plan. Let me know if you have any additional questions about this or if this ticket is ok to close. Thanks, Ben ________________________________ You are receiving this mail because: * You reported the bug.
I'm glad that helped clarify how this is calculated. I've opened an internal ticket to consider adding some additional examples about how the CPU seconds are calculated. Thanks, Ben