Ticket 11159

Summary: Understanding sreport
Product: Slurm Reporter: Jimmy Hui <jhui>
Component: AccountingAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: scott, tripiana
Version: 21.08.x   
Hardware: Linux   
OS: Linux   
Site: Roche/PHCIX Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Jimmy Hui 2021-03-19 16:57:17 MDT
Hi,

We have enable wckey to track usage and when we run the following report. It show the user xxx has used 680712 min (472.716667 days) in 19 days. Are reading this report correctly? When I execute sacct on the same user for the same time frame the sum of the elapsed time is about "39472" minutes. Can someone please take a look at let me know if I am reading the sreport wrong?


sreport -t minutes cluster WCKeyUtilizationByUser start=3/1/21 end=3/30/21
--------------------------------------------------------------------------------
Cluster/WCKey/User Utilization 2021-03-01T00:00:00 - 2021-03-19T22:59:59 (1638000 secs)
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster           WCKey     Login     Proper Name     Used
--------- --------------- --------- --------------- --------
slurm-ma+     p3.16xlarge     xxxx          xxxx       680712



sacct -X -u xxxx --starttime=2021-03-01 --endtime=2021-03-31 -o start,end,elapsed
              Start                 End    Elapsed
------------------- ------------------- ----------
2021-02-28T11:05:39 2021-03-02T00:50:11 1-13:44:32
2021-03-03T04:58:18 2021-03-03T04:59:26   00:01:08
2021-03-03T05:38:18 2021-03-03T05:39:20   00:01:02
2021-03-04T08:24:50 2021-03-04T16:55:44   08:30:54
2021-03-05T04:13:22 2021-03-05T04:14:20   00:00:58
2021-03-05T04:21:11 2021-03-05T04:22:52   00:01:41
2021-03-05T17:28:22 2021-03-06T17:24:38   23:56:16
2021-03-06T17:24:49 2021-03-07T18:13:46 1-00:48:57
2021-03-06T17:41:56 2021-03-07T18:13:50 1-00:31:54
2021-03-06T17:41:56 2021-03-07T18:13:53 1-00:31:57
2021-03-07T18:14:02 2021-03-08T15:14:54   21:00:52
2021-03-07T18:14:13 2021-03-08T15:14:57   21:00:44
2021-03-07T18:14:38 2021-03-08T15:15:03   21:00:25
2021-03-08T15:15:09 2021-03-10T01:45:04 1-10:29:55
2021-03-08T15:15:21 2021-03-10T02:04:32 1-10:49:11
2021-03-08T15:15:27 2021-03-10T02:03:46 1-10:48:19
2021-03-11T05:48:57 2021-03-11T05:49:15   00:00:18
2021-03-11T05:49:54 2021-03-11T05:56:38   00:06:44
2021-03-11T06:14:44 2021-03-11T06:19:01   00:04:17
2021-03-11T06:25:23 2021-03-11T06:26:36   00:01:13
2021-03-11T06:28:27 2021-03-11T06:29:41   00:01:14
2021-03-11T06:37:23 2021-03-11T06:38:36   00:01:13
2021-03-11T06:40:39 2021-03-11T06:44:22   00:03:43
2021-03-11T06:52:09 2021-03-11T06:56:05   00:03:56
2021-03-11T07:23:38 2021-03-11T07:25:34   00:01:56
2021-03-11T18:04:08 2021-03-11T18:05:32   00:01:24
2021-03-11T18:13:46 2021-03-11T18:14:28   00:00:42
2021-03-11T18:23:27 2021-03-11T18:43:24   00:19:57
2021-03-11T20:23:38 2021-03-11T20:44:31   00:20:53
2021-03-12T02:45:39 2021-03-12T04:36:59   01:51:20
2021-03-12T04:37:20 2021-03-12T04:39:16   00:01:56
2021-03-12T08:34:41 2021-03-14T15:40:42 2-07:06:01
2021-03-12T20:38:14 2021-03-13T19:08:05   22:29:51
2021-03-13T02:09:45 2021-03-13T18:34:15   16:24:30
2021-03-13T19:09:27 2021-03-16T03:01:16 2-07:51:49
2021-03-16T06:38:59 2021-03-18T15:04:56 2-08:25:57
2021-03-16T06:44:59 2021-03-17T07:48:55 1-01:03:56
2021-03-16T06:46:59 2021-03-16T15:34:01   08:47:02
2021-03-17T00:59:32 2021-03-18T01:43:23 1-00:43:51
2021-03-17T07:49:19 2021-03-18T09:14:36 1-01:25:17
2021-03-18T06:36:06 2021-03-19T01:34:20   18:58:14
2021-03-18T18:46:07 2021-03-19T01:34:11   06:48:04
2021-03-18T21:00:07             Unknown 1-01:51:58
2021-03-19T04:39:38 2021-03-19T04:40:50   00:01:12
2021-03-19T05:30:08 2021-03-19T06:14:01   00:43:53
2021-03-19T21:56:09             Unknown   00:55:56
2021-03-19T21:56:09             Unknown   00:55:56
Comment 1 Ben Roberts 2021-03-22 14:21:11 MDT
Hi Jimmy,

I think the most likely reason the numbers aren't lining up is because you're looking at the Elapsed Time in sacct rather than the amount of CPU Time.  For example, if a job ran for 1 minute on 10 CPUs it would have an elapsed time of 1 minute, but it would have used 10 CPU minutes.  

Here's a job I found from a few days ago on my system that demonstrates this.  The job ran for 2 minutes and 4 seconds, but it was on 360 CPUs, so the CPUTime for the job is 12 hours and 24 minutes.
$ sacct -X -j 25843 -ojobid,start,end,elapsed,cputime
       JobID               Start                 End    Elapsed    CPUTime 
------------ ------------------- ------------------- ---------- ---------- 
25843        2021-03-18T09:48:59 2021-03-18T09:51:03   00:02:04   12:24:00 


Let me know if looking at the CPUTime in sacct doesn't make the numbers line up better with what you're seeing in sreport.

Thanks,
Ben
Comment 2 Scott Hilton 2021-03-22 14:29:15 MDT
Jimmy,

First of all, sacct will by default only show one job id. In the case where a job was requeued there may be more than one job with the same id. Using -D will show all duplicate jobs.
Also, jobs don't just start and end at the beginning of the day and end of the day. sacct will give us the full account of all jobs that ran even partially in the time frame specified. To truncate the jobs to their portion in the time frame add -T.
>sacct -DTX -u xxxx --starttime=2021-03-01 --endtime=2021-03-31 -o start,end,elapsed

Now for calculation, the sreport output is in cpu minutes. For sacct, you will have to multiply the number of cpus by the raw elapsed time(seconds) and then convert to minutes to get a similar answer. So lets add AllocTRES into your sacct search and change elapsed to elapsedraw.
>sacct -X -u xxxx --starttime=2021-03-01 --endtime=2021-03-31 -o start,end,AllocTRES,elapsedRaw

To automatically calculate it we can add -P and some bash tools
>sacct -DTXP -u xxxx --starttime=2021-03-01 --endtime=2021-03-31 -o start,end,AllocTRES,elapsedRaw | sed 's/.*cpu=//' | sed 's/,.*|/,/' | awk -F, 'NF==2 {sum+=$1 * $2} END {printf "%.2f\n", sum/60}'

Let me know if this works for you, or if you have any questions.

-Scott
Comment 3 Jimmy Hui 2021-03-22 15:12:55 MDT
Hi Ben & Scott,

Thanks for the clarification. Is there a way to generate a report base on Elapsed time with sreport? We are trying to calculate elapsed time usage base on the instance type. We have setup WCKey and each time a user request an instances we associated the instance type with the WCKey.
Comment 4 Ben Roberts 2021-03-22 15:51:33 MDT
Hi Jimmy,

I'm afraid there isn't a way to have sreport just used the Elapsed Time rather than the total CPU Time for a report.  If your nodes are all the same size and the jobs always exclusively get a node then you could derive that number by dividing the value from the report by the number of CPUs per node.  Otherwise the best way to get that data is by looking at the job records with sacct and computing it from there.  Can I ask if there's a reason you're wanting a report with just the elapsed time instead of the CPU time?  

Thanks,
Ben
Comment 5 Jimmy Hui 2021-03-22 16:03:17 MDT
Hi Ben,

My colleague had submit this ticket https://bugs.schedmd.com/show_bug.cgi?id=10857 and was suggested that WCKey would work for us. We have a  business need to track and charge back AWS usage. These partitions each map to a specific instance type and has a AWS cost associated with it. So we were trying to use sreport WCKeyUtilizationByUser to get that information.
Comment 6 Ben Roberts 2021-03-23 11:33:54 MDT
Hi Jimmy,

Thanks for the explanation, that makes sense.  It's possible that sreport could work still since these WCKeys are unique to the type of node.  If the jobs consume all the resources on the nodes they run on then you can take number of CPU minutes for a certain WCKey and divide by the number of CPUs for that instance type, which should give you the uptime of the nodes.

If your jobs don't have exclusive access to the nodes they run on then it will be a lot trickier to try and get the uptime of the nodes.  So, do jobs typically use a full node when they cause it to be spun up?  Or are you interested in enforcing a rule that jobs be charged for the full use of a node?  

It was me who suggested using WCKeys to track usage of different instance types, but I wasn't thinking about reporting on the uptime of the nodes.  I apologize for that oversight, but if jobs do have exclusive access to the nodes they use then we should be able to get the information you want.

Thanks,
Ben
Comment 8 Jimmy Hui 2021-03-24 13:18:41 MDT
Hi Ben,

The numbers don't really add up here. Here is the sreport for this user.

1965495 CPU Minutes on the instance that has 64 vCPU.

Cpu Min / core = total / minute = total * instance cost = total
1965495 / 64 = 30710.859375 / 60 = 511.847 * 24.48 = 12530.030625

When I check aws cost explorer this user use ~$23K

Instance Size 	vCPUs  On-Demand Price/hr*
p3.16xlarge 	64     $24.48

--------------------------------------------------------------------------------
Cluster/User/WCKey Utilization 2021-03-01T00:00:00 - 2021-03-24T18:59:59 (2055600 secs)
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name           WCKey      Used
--------- --------- --------------- --------------- ---------
slurm-ma+      xxxx          xxxx     p3.16xlarge   1965495
Comment 9 Ben Roberts 2021-03-24 15:15:01 MDT
Is the number of CPU Minutes coming from the sacct query you used before?  If so can I have you run it again with the following columns in the output?  

sacct -X -u xxxx --starttime=2021-03-01 --endtime=2021-03-31 -ojobid,elapsed,cputime,alloccpus,allocnodes

Or if it's a different set of jobs then please adjust the start and end times accordingly.  I would like to see if all the CPUs on the nodes are being charged for these jobs.

Thanks,
Ben
Comment 10 Jimmy Hui 2021-03-24 15:27:28 MDT
Hi Ben,

The CPU min is from sreport. Here is the report from sacct. I have added ElapsedRaw and CPUTimeRaw.

sacct -X -u petrovy1 --starttime=2021-03-01 --endtime=2021-03-31 -ojobid,elapsed,elapsedRaw,cputime,cputimeRAW,alloccpus,allocnodes
       JobID    Elapsed ElapsedRaw    CPUTime CPUTimeRAW  AllocCPUS AllocNodes
------------ ---------- ---------- ---------- ---------- ---------- ----------
3841         1-13:44:32     135872 100-15:30:08    8695808         64          1
4508           00:01:08         68   01:12:32       4352         64          1
4510           00:01:02         62   01:06:08       3968         64          1
4511           08:30:54      30654 22-16:57:36    1961856         64          1
4531           00:00:58         58   01:01:52       3712         64          1
4532           00:01:41        101   01:47:44       6464         64          1
4535           23:56:16      86176 63-20:01:04    5515264         64          1
4538         1-00:48:57      89337 66-04:12:48    5717568         64          1
4539         1-00:31:54      88314 65-10:01:36    5652096         64          1
4540         1-00:31:57      88317 65-10:04:48    5652288         64          1
4545           21:00:52      75652 56-00:55:28    4841728         64          1
4546           21:00:44      75644 56-00:46:56    4841216         64          1
4547           21:00:25      75625 56-00:26:40    4840000         64          1
4791         1-10:29:55     124195 91-23:54:40    7948480         64          1
4792         1-10:49:11     125351 92-20:27:44    8022464         64          1
4793         1-10:48:19     125299 92-19:32:16    8019136         64          1
6096           00:00:18         18   00:07:12        432         24          1
6097           00:06:44        404   02:41:36       9696         24          1
6098           00:04:17        257   01:42:48       6168         24          1
6099           00:01:13         73   00:29:12       1752         24          1
6100           00:01:14         74   00:29:36       1776         24          1
6101           00:01:13         73   00:29:12       1752         24          1
6102           00:03:43        223   01:29:12       5352         24          1
6103           00:03:56        236   01:34:24       5664         24          1
6104           00:01:56        116   00:46:24       2784         24          1
6106           00:01:24         84   00:33:36       2016         24          1
6107           00:00:42         42   00:16:48       1008         24          1
6108           00:19:57       1197   07:58:48      28728         24          1
6109           00:20:53       1253   08:21:12      30072         24          1
6110           01:51:20       6680 1-20:32:00     160320         24          1
6111           00:01:56        116   00:46:24       2784         24          1
6112         2-07:06:01     198361 55-02:24:24    4760664         24          1
6114           22:29:51      80991 22-11:56:24    1943784         24          1
6120           16:24:30      59070 43-18:08:00    3780480         64          1
6121         2-07:51:49     201109 55-20:43:36    4826616         24          1
10588        2-08:25:57     203157 56-10:22:48    4875768         24          1
6123         1-01:03:56      90236 25-01:34:24    2165664         24          1
6122           08:47:02      31622 8-18:48:48     758928         24          1
11316        1-00:43:51      89031 65-22:46:24    5697984         64          1
11983        1-01:25:17      91517 67-18:58:08    5857088         64          1
12712          18:58:14      68294 50-14:06:56    4370816         64          1
12728          06:48:04      24484 18-03:16:16    1566976         64          1
12733        1-19:41:46     157306 116-12:33:04   10067584         64          1
13097          00:01:12         72   01:16:48       4608         64          1
13098          00:43:53       2633 1-22:48:32     168512         64          1
13096          18:45:55      67555 50-00:58:40    4323520         64          1
13191        1-08:02:11     115331 85-10:19:44    7381184         64          1
14210          00:09:51        591   10:30:24      37824         64          1
14221        3-13:22:02     307322 227-15:30:08   19668608         64          1
16809        3-00:42:29     261749 193-21:18:56   16751936         64          1
14347        1-12:41:10     132070 97-19:54:40    8452480         64          1
19571          00:07:54        474   08:25:36      30336         64          1
19572          00:00:03          3   00:03:12        192         64          1
19618          00:04:03        243   04:19:12      15552         64          1
19620          00:03:05        185   03:17:20      11840         64          1
19637          00:00:00          0   00:00:00          0         24          0
19645          14:12:20      51140 37-21:09:20    3272960         64          1
19646          05:49:23      20963 15-12:40:32    1341632         64          1
Comment 11 Ben Roberts 2021-03-25 09:44:45 MDT
Hi Jimmy,

Thanks for sending that information.  I can see that there are several jobs that were only billed for 24 CPUs rather than the full 64.  If you use the CPUTimeRaw field to do this calculation it doesn't work out because the value doesn't take into account the fact that a full node is up and being charged even though the job doesn't use all the available CPUs.  

I calculated the full number of billable CPU Seconds by multiplying the ElapsedRaw value for each job by 64 and then summing those values.  With that I got:
216773120 Billable CPU Seconds = 3612885.333 Billable CPU Minutes

Then I plugged this into the equation you sent earlier and it comes out to ~$23k, as you said you expected.
3612885.333 / 64 = 56451.3333 / 60 = 940.8555 * 24.48 = 23032.14


There are a couple options you can take to make this behave as you are expecting.  You can make a similar adjustment to your calculation, multiplying the ElapsedRaw value by the number of CPUs corresponding to the AWS instance type.  Or you can force the jobs to get exclusive access to the nodes so that their calculated CPUTimeRaw value matches what is being charged by AWS.  You can do that with the OverSubscribe parameter, setting it to EXCLUSIVE for the partitions.  You can read more about that parameter here:
https://slurm.schedmd.com/slurm.conf.html#OPT_OverSubscribe



I set up an example of how this would look.  I configured two partitions, 'debug' and 'high'.  I configured the 'high' partition to use OverSubscribe=EXCLUSIVE.

$ scontrol show partition | egrep 'PartitionName|OverSubscribe'
PartitionName=debug
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
PartitionName=high
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE



I submitted a short job to each of the partitions, just requesting a single processor on each of my 24 processor nodes.

$ sbatch -n1 -pdebug --wrap='srun sleep 10'
Submitted batch job 25902
$ sbatch -n1 -phigh --wrap='srun sleep 10'
Submitted batch job 25903


In the report for these jobs you can see that the job that went to the 'debug' partition (25902) has a CPUTimeRaw of 10.  The job that went to the 'high' partition (25903) was charged for the full node.

$ sacct -j 25902,25903 -X -ojobid,elapsedRaw,cputimeRAW,alloccpus
       JobID ElapsedRaw CPUTimeRAW  AllocCPUS 
------------ ---------- ---------- ---------- 
25902                10         10          1 
25903                10        240         24 



Hopefully this helps clarify where these numbers are coming from.  Please let me know if you have any questions about this.  

Thanks,
Ben
Comment 12 Jimmy Hui 2021-03-25 12:37:24 MDT
Hi Ben,

Thanks for the information this seem to work. Should I be using -X or -DTPX? They return different numbers.

J


sacct -DTPX -u petrovy1 --starttime=2021-03-01 --endtime=2021-03-31 -o elapsedRaw |awk '{total = total + $1}END{print total}'
4417426

sacct -X -u petrovy1 --starttime=2021-03-01 --endtime=2021-03-31 -o elapsedRaw |awk '{total = total + $1}END{print total}'
3121358
Comment 13 Ben Roberts 2021-03-25 16:08:42 MDT
The '-D' flag tells sacct to show duplicates.
The '-T' flag tells sacct to truncate the time a job ran that might be before the start time or after the end time.
The '-P' flag is to make the output parsable.
The '-X' flag shows just the job allocation and doesn't include the steps.

The truncate flag would only be able to reduce the time, but you're seeing an increase with these flags.  The only flag that could result in more time would be the duplicate flag.  A requeued job can cause there to be duplicate entries for jobs, but this doesn't sound likely with your workflow.  Do you happen to have jobs that are being requeued?  I would be interested in seeing the output of that query before it's processed by awk:
sacct -DTPX -u petrovy1 --starttime=2021-03-01 --endtime=2021-03-31 -o jobid,elapsedRaw 

Thanks,
Ben
Comment 14 Jimmy Hui 2021-03-25 16:18:20 MDT
Hi Ben,

Here is output.

sacct -DTPX -u xxx --starttime=2021-03-01 --endtime=2021-03-31 -o jobid,elapsedRaw
JobID|ElapsedRaw
3841|89411
4508|68
4510|62
4511|95454
4511|30654
4531|58
4532|101
4535|24364
4535|604
4535|11601
4535|22
4535|21
4535|23
4535|17
4535|86176
4538|89337
4539|23
4540|18
4539|23
4540|22
4540|24
4539|23
4539|18
4540|18
4539|88314
4540|88317
4545|75652
4546|75644
4547|75625
4791|124195
4792|125351
4793|125299
6096|18
6097|404
6098|257
6099|73
6100|74
6101|73
6102|223
6103|236
6104|116
6106|84
6107|42
6108|1197
6109|1253
6110|4296
6110|20
6110|604
6110|23
6110|13684
6110|23
6110|603
6110|6680
6111|116
6112|4982
6112|602
6112|22
6112|22
6112|21
6112|17
6112|198361
6114|22
6114|1254
6114|21
6114|603
6114|24
6114|21
6114|22
6114|18
6114|80991
6120|23
6120|24
6120|22
6120|18
6120|59070
6121|201109
6122|23
6122|90921
6123|421
6123|22
6123|30889
6122|25277
6123|29056
6122|24477
6123|30971
6122|603
6122|14276
6123|601
6123|21065
6122|11922
6122|3986
6122|5579
6123|5453
10588|22
10588|23
10588|23
6123|23
10588|17
6122|23
6123|22
10588|203157
6123|22
6122|21
6122|23
6123|17
6123|90236
6122|17
6122|31622
11316|21
11316|19218
11316|23
11316|603
11316|2639
11316|21
11316|22
11316|23
11316|18
11316|89031
11983|91517
12712|24
12712|6826
12712|604
12712|20
12712|22
12712|23
12712|18
12712|68294
12728|22
12728|23
12728|22
12728|17
12728|24484
12733|21
12733|22
12733|21
12733|17
12733|157306
13096|22
13097|17
13096|31856
13097|72
13098|23
13098|2633
13191|25853
13191|21
13191|21411
13096|19478
13096|21
13096|601
13096|6839
13191|602
13191|5219
13096|21
13191|22
13096|603
13191|603
13096|22
13191|22
13096|22
13191|22
13096|22
13191|22
13096|17
13191|17
13096|67555
13191|115331
14210|591
14221|307322
14347|206342
16809|17
16809|296791
14347|17
14347|132070
19571|22
19571|474
19572|21
19572|3
19618|243
19620|185
19637|0
19645|84364
19646|53520
19646|22776
19645|17
19645|19
19645|19
19645|17
19645|55146
21447|17
16809|33398
21447|20286
19646|13187
16809|0
22214|0
22219|0
22261|359
22277|1421
22361|22
22364|18
22371|17
19646|17
19646|6186
22517|22
Comment 15 Ben Roberts 2021-03-26 13:32:08 MDT
It does look like you're getting duplicates that would be causing the difference in values, but I'm not sure exactly why that would be.  Do you have jobs that get requeued?  I would like to investigate one of these jobs, like 19646.  Can you send the output of 'sacct -l -j 19646' along with the slurmctld logs that cover the time this job was on the system?

Thanks,
Ben
Comment 16 Ben Roberts 2021-04-08 08:16:31 MDT
I wanted to follow up and see if you're able to provide some additional detail about one of these jobs that has multiple entries.  Can you send the output of 'sacct -l -j 19646' along with the slurmctld logs that cover the time this job was on the system?

Thanks,
Ben
Comment 17 Jimmy Hui 2021-04-08 10:37:18 MDT
Hi Ben,

Is this enough information?

sacct -l -j 19646
       JobID     JobIDRaw    JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks  AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov     ReqMem ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask   AveDiskWrite    AllocGRES      ReqGRES    ReqTRES  AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ------------ ------------ ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- ---------------
19646        19646        spleen_mx+ G-8GPU-64+                                                                                                                                                                                                              64 1-03:15:39 CANCELLED+      0:0                  Unknown       Unknown       Unknown     7368Mc                                                                                                                                                 gpu:8 PER_NODE:gp+ billing=2+ billing=6+
19646.batch  19646.batch       batch            204714644K spcd-usw2-007+              0 204714644K  16584876K spcd-usw2+          0  16584876K    51682 spcd-usw2-0+              0      51682 2-18:14:06 spcd-usw2+          0 2-18:14:06        1         64 1-03:15:42  CANCELLED     0:15         1K             0             0             0     7368Mc              0  2086067.52M spcd-usw2-00756               0    2086067.52M     1267.40M  spcd-usw2-00756                0       1267.40M        gpu:8 PER_NODE:gp+            cpu=64,me+ cpu=2-18:14:0+ cpu=2-18:14:0+ cpu=spcd-usw2-007+ cpu=0,fs/disk=0,m+ cpu=2-18:14:0+ cpu=spcd-usw2-007+ cpu=0,fs/disk=0,m+ cpu=2-18:14:0+ energy=0,fs/di+ energy=spcd-usw2-0+           fs/disk=0 energy=0,fs/di+ energy=0,fs/di+




grep 19646 slurmctld.log
[2021-03-24T07:14:38.927] _slurm_rpc_submit_batch_job: JobId=19646 InitPrio=4294900443 usec=393
[2021-03-24T15:37:21.935] backfill: Started JobId=19646 in G-8GPU-64Cpu-471GB on spcd-usw2-00758
[2021-03-24T15:37:35.973] prolog_running_decr: Configuration for JobId=19646 is complete
[2021-03-25T06:29:21.209] requeue job JobId=19646 due to failure of node spcd-usw2-00758
[2021-03-25T06:29:21.212] Requeuing JobId=19646
[2021-03-25T06:31:25.285] backfill: Started JobId=19646 in G-8GPU-64Cpu-471GB on spcd-usw2-01277
[2021-03-25T06:38:20.770] job_time_limit: Configuration for JobId=19646 complete
[2021-03-25T06:38:20.770] Resetting JobId=19646 start time for node power up
[2021-03-25T12:57:56.792] requeue job JobId=19646 due to failure of node spcd-usw2-01277
[2021-03-25T12:57:56.794] Requeuing JobId=19646
[2021-03-25T16:36:27.696] backfill: Started JobId=19646 in G-8GPU-64Cpu-471GB on spcd-usw2-01277
[2021-03-25T16:42:50.174] job_time_limit: Configuration for JobId=19646 complete
[2021-03-25T16:42:50.174] Resetting JobId=19646 start time for node power up
[2021-03-25T20:22:37.346] requeue job JobId=19646 due to failure of node spcd-usw2-01277
[2021-03-25T20:22:37.348] Requeuing JobId=19646
[2021-03-25T20:24:42.464] sched: Allocate JobId=19646 NodeList=spcd-usw2-01780 #CPUs=64 Partition=G-8GPU-64Cpu-471GB
[2021-03-25T20:24:59.747] requeue job JobId=19646 due to failure of node spcd-usw2-01780
[2021-03-25T20:24:59.750] Requeuing JobId=19646
[2021-03-25T20:25:00.569] _slurm_rpc_requeue: 19646: Job is pending execution
[2021-03-25T20:25:00.593] sched: _update_job: setting features to us-west-2a for JobId=19646
[2021-03-25T20:25:00.594] _slurm_rpc_update_job: complete JobId=19646 uid=0 usec=12045
[2021-03-25T20:27:28.342] backfill: Started JobId=19646 in G-8GPU-64Cpu-471GB on spcd-usw2-00756
[2021-03-25T20:34:20.950] job_time_limit: Configuration for JobId=19646 complete
[2021-03-25T20:34:20.950] Resetting JobId=19646 start time for node power up
[2021-03-26T04:36:16.439] requeue job JobId=19646 due to failure of node spcd-usw2-00756
[2021-03-26T04:36:16.442] Requeuing JobId=19646
[2021-03-26T04:38:18.862] backfill: Started JobId=19646 in G-8GPU-64Cpu-471GB on spcd-usw2-00760
[2021-03-26T04:44:51.970] job_time_limit: Configuration for JobId=19646 complete
[2021-03-26T04:44:51.970] Resetting JobId=19646 start time for node power up
[2021-03-26T15:02:48.368] requeue job JobId=19646 due to failure of node spcd-usw2-00760
[2021-03-26T15:02:48.371] Requeuing JobId=19646
[2021-03-26T15:04:49.874] backfill: Started JobId=19646 in G-8GPU-64Cpu-471GB on spcd-usw2-00756
[2021-03-26T15:11:21.952] job_time_limit: Configuration for JobId=19646 complete
[2021-03-26T15:11:21.952] Resetting JobId=19646 start time for node power up
[2021-03-27T18:27:00.784] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=19646 uid 93422
Comment 18 Ben Roberts 2021-04-08 13:15:56 MDT
Thanks for sending that information.  It looks like what is happening is that this job tried to start several times, but then after approximately 6 or 7 minutes if the node hadn't come up correctly, the job was requeued and it tried to start again later.  Then at 2021-03-26T15:11:21.952 the job was successfully started and ran until the job was cancelled.  The previous attempts to start resulted in usage time being recorded in the database, but since the node didn't come up it didn't result anything being billed by AWS and the record that actually had usage is what shows up without displaying the duplicate records.  

Please let me know if you have questions about this.

Thanks,
Ben
Comment 19 Ben Roberts 2021-04-15 15:50:18 MDT
Hi Jimmy,

I wanted to follow up and make sure the information about the duplicate jobs I sent made sense.  Let me know if you have any additional questions about this.

Thanks,
Ben
Comment 20 Jimmy Hui 2021-04-15 16:31:10 MDT
Hi Ben,

I think we got a point forward with sacct. Thanks for the clarification.
Comment 21 Ben Roberts 2021-04-16 08:33:29 MDT
I'm glad to hear that this helped explain what was happening.  I'll go ahead and close this ticket.  Let us know if there's anything else we can do to help.

Thanks,
Ben