Ticket 8952

Summary: sreport returns incorrect percentages
Product: Slurm Reporter: Greg Wickham <greg.wickham>
Component: AccountingAssignee: Scott Hilton <scott>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: albert.gil
Version: 19.05.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8958
Site: KAUST Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 19.07.7 20.02.3 20.11.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Greg Wickham 2020-04-28 06:19:28 MDT
Dear Team,

Can you please help us understand this sreport query:

$ sreport  -t percent -T gres/gpu,cpu,gres/gpu:v100,gres/gpu:gtx1080ti,gres/gpu:rtx2080ti   cluster Utilization start="2020-04-19T00:00:00" end="2020-04-26T00:00:00"
--------------------------------------------------------------------------------
Cluster Utilization 2020-04-19T00:00:00 - 2020-04-25T23:59:59
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster      TRES Name  Allocated       Down PLND Down      Idle   Reserved    Reported 
--------- -------------- ---------- ---------- --------- --------- ---------- ----------- 
   dragon            cpu     47.82%     28.19%     4.45%     3.63%     15.91%      99.99% 
   dragon       gres/gpu     68.34%      2.17%     0.00%    29.49%      0.00%     100.00% 
   dragon gres/gpu:gtx1+     99.94%      0.06%     0.00%     0.00%      0.00% 410170400.0 
   dragon gres/gpu:rtx2+     25.89%     74.11%     0.00%     0.00%      0.00%     138.02% 
   dragon  gres/gpu:v100     99.69%      0.31%     0.00%     0.00%      0.00%    1431.29% 


Issues:

 - the headings of this report don't match the description in the sreport manual (cluster Utilization)

 - the gres/gpu:rtx2+ is showing 74% down, but this is no where no correct
 - the 'Reported' column shows values over 100%

thanks,

   -greg
Comment 1 Scott Hilton 2020-04-28 15:43:14 MDT
Hi Greg, 

The first issue you mentioned looks like it matches the documentation to me. In https://slurm.schedmd.com/sreport.html under cluster utilization the second note says:
"The default view for the "Cluster Utilization" report includes the following fields: Cluster, Allocated, Down, PlannedDown, Idle, Reserved, Reported."

If this isn't what you mean please let me know.

The other two issues look like bugs to me but I will have to do more research. I'll get back to you on them once I know the answer.

-Scott
Comment 3 Greg Wickham 2020-04-29 01:56:41 MDT
Regarding the headings - not sure why I got that wrong.

This may help too:

$ sreport   -T gres/gpu,cpu,gres/gpu:v100,gres/gpu:gtx1080ti,gres/gpu:rtx2080ti   cluster Utilization start="2020-04-19T00:00:00" end="2020-04-26T00:00:00"
--------------------------------------------------------------------------------
Cluster Utilization 2020-04-19T00:00:00 - 2020-04-25T23:59:59
Usage reported in TRES Minutes
--------------------------------------------------------------------------------
  Cluster      TRES Name  Allocated       Down PLND Down      Idle   Reserved    Reported 
--------- -------------- ---------- ---------- --------- --------- ---------- ----------- 
   dragon            cpu   93153566   54902197   8669504   7077713   30981166   194784146 
   dragon       gres/gpu    2734732      86964         0   1180063          0     4001760 
   dragon gres/gpu:gtx1+      68318         44         0         0          0       68362 
   dragon gres/gpu:rtx2+      28817      82486         0         0          0      111303 
   dragon  gres/gpu:v100    1006816       3104         0         0          0     1009920
Comment 4 Scott Hilton 2020-04-29 10:38:55 MDT
Hi Greg, 

Thanks for the other report that helped. To fully identify the issue could you share a database dump. I'm not sure if the issue is in the database values or in the sreport function.

For reference this is how you make a database dump:
mysqldump --databases <your_database> > <file>

Please zip the file as well:
gzip <file>

-Scott
Comment 5 Greg Wickham 2020-04-29 22:41:57 MDT
Have taken dump of db. Compressed it's 1.1GB

Can it be uploaded to somewhere? (Or do you want it attached to this ticket?)

  -greg
Comment 6 Scott Hilton 2020-04-30 09:44:30 MDT
Greg,

That is bigger than the maximum file size for this site. So it won't work here.

You could share it via google drive (or a similar service) with scott@schedmd.com.

-Scott
Comment 7 Scott Hilton 2020-05-06 09:54:22 MDT
Greg,

How do you know that 74% down is not correct?

Could you send me your slurm.conf and any included .conf files. I want to see how the nodes are set up to help me identify patterns in what is reported as down.

Also, I got your database working and have reproduced same issues on my machine now.

I'm still not sure what is causing reported to be greater than 100% but I am looking into it.

Thanks, Scott
Comment 9 Scott Hilton 2020-05-07 10:40:21 MDT
Greg,

Also send your gres.conf if you can.

Thanks,

Scott
Comment 12 Scott Hilton 2020-05-11 15:51:55 MDT
(In reply to Scott Hilton from comment #7)
> Greg,
> 
> How do you know that 74% down is not correct?
> 
> Could you send me your slurm.conf and any included .conf files. I want to
> see how the nodes are set up to help me identify patterns in what is
> reported as down.
> 
> Also, I got your database working and have reproduced same issues on my
> machine now.
> 
> I'm still not sure what is causing reported to be greater than 100% but I am
> looking into it.
> 
> Thanks, Scott

I think I forgot to enable notifications on this comment.

Can you send me your slurm.conf and included files?

Thanks, Scott
Comment 14 Scott Hilton 2020-05-19 14:38:36 MDT
Greg, 

Part of the fix for bug 8958 I believe will also fix the issue where you were getting percentages much larger than %100. This is the commit:

https://github.com/SchedMD/slurm/commit/010d752b275a1ccb1c9537238233c938f2412ec2

Could you explain how you know that 74% down for the rtx is not correct? 
Also, do you have any other info about the rtx setup that may be relevant.

Take care,

Scott
Comment 15 Greg Wickham 2020-05-19 22:27:50 MDT
Hi Scott,

The evidence that the 2080s were up is stored in our slurm database (we use slurm to report when nodes are down etc).

We have 32 x RTX2080ti in 4 nodes (8 GPUs per node).

During the week in question one node was down (all week) due to a GPU fault.

The remaining three nodes were processing jobs (all week).

For slurm to report that the 2080Ti was down 74.11% would have meant that three of the four nodes were not available for virtually the entire week (just shy of 75% down).

We are confident this was not the case.

(BTW - I sent all out configs as a dropbox link - did you get it?)

   -greg
Comment 16 Scott Hilton 2020-05-20 09:14:33 MDT
Greg, 

Yes, I did get your configs. They were quite helpful in diagnosing the problems for this bug and bug 8958.

Thanks for the info. I'll keep you updated on that last issue.

-Scott
Comment 30 Scott Hilton 2020-05-29 11:52:21 MDT
Greg,

It looks like the total time for the one down graphics card is correct. The percent was off because we had an incorrect tres count 8 instead of 24. This should also be fixed be the change in 8958. 

If you had had the fix from 8958, I would guess you would get something like:
>Cluster      TRES Name   Allocated       Down  PLND Down      Idle   Reserved    Reported 
>dragon  gres/gpu:rtx2+         ~9%       ~26%      0.00%      ~65%      0.00%        100%    

That means though that we still have only 9% of the time was allocated. This could mean that the rest of the time the gpus were actually idle or there is a bug counting the allocated time (which I am looking into). 

If it was idle time  this could result from nodes running just on the cpus or jobs reserving the nodes until they have all the resources to run.

Have you encountered any similar bugs to this? 

Also, have you updated to 19.05.7 or 20.02.3?

-Scott
Comment 32 Greg Wickham 2020-05-30 22:53:24 MDT
Hi Scott,

A ran a simple script over the output of sacct -X for the week in question. The results are:

gpu504-37 0 GPU seconds (0% of period)
gpu510-02 4224835 GPU seconds (87% of period)
gpu510-12 3199137 GPU seconds (66% of period)
gpu510-17 2431046 GPU seconds (50% of period)

(GPU seconds is the number of seconds during the week that the GPU was allocated to a job. The percentage is calculated between GPU seconds available [of 8 GPUs] and the overall number of seconds in the week).

Our slurm version is currently:

$ scontrol -V
slurm 19.05.5

We plan on upgrading to 20 in the next maintenance session (in three weeks).
In the interim we can upgrade to a later version of 19 if it is recommend.
Comment 33 Scott Hilton 2020-06-02 12:43:18 MDT
Greg, 

Could you send me that script?

Thanks,

Scott
Comment 34 Greg Wickham 2020-06-02 23:06:56 MDT
Hi Scott,

Script is at https://gitlab.com/greg.wickham/python-slurm

When the script run it identifies a job that spans more than one node. Looking at this job further (# 10342997) I'm not sure how to update the script to keep track of the correct allocations.

Maybe these following questions should be a separate ticket, but they are related to the data for this ticket. (Let me know if I should open a separate ticket)

For job 10342997

$ sacct -P -j 10342997 --format jobidraw,start,end,nodelist,reqtres,reqgres,alloctres,allocgres
JobIDRaw|Start|End|NodeList|ReqTRES|ReqGRES|AllocTRES|AllocGRES
10342997|2020-04-21T18:33:34|2020-04-21T18:33:50|gpu504-37,gpu510-17|billing=20,cpu=20,gres/gpu=4,mem=120G,node=1|gpu:8|billing=20,cpu=20,gres/gpu=8,mem=120G,node=2|gpu:8
10342997.batch|2020-04-21T18:33:34|2020-04-21T18:33:50|gpu504-37||gpu:8|cpu=10,mem=60G,node=1|gpu:8
10342997.extern|2020-04-21T18:33:34|2020-04-21T18:33:51|gpu504-37,gpu510-17||gpu:8|billing=20,cpu=20,gres/gpu=8,mem=120G,node=2|gpu:8
10342997.0|2020-04-21T18:33:36|2020-04-21T18:33:52|gpu510-17||gpu:8|cpu=1,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=6G,node=1|gpu:8

$ sinfo -n gpu510-17,gpu504-37 -o "%n,%G"
HOSTNAMES,GRES
gpu504-37,gpu:gtx1080ti:8(S:0-1)
gpu510-17,gpu:rtx2080ti:8(S:0-1)

Q1: The job "10342997.0" indicates using "gres/gpu:gtx1080ti=4" on node gpu510-17, but the node has "rtx2080ti"

Q2: 10342997.0 shows AllocTRES is 4 GPUs while AllocGRES is 8 GPUS

Q3: 10342997.batch and 10342997.0 are running at the same time on different nodes - what is the total of GPUs allocated?
Comment 35 Scott Hilton 2020-06-03 10:08:56 MDT
Greg,

Could you post those questions in a new ticket? We will just focus on the original question involving missing allocation time in this thread. 

This will help us keep track of the issues.

Thanks, 

Scott
Comment 36 Greg Wickham 2020-06-03 20:04:35 MDT
created https://bugs.schedmd.com/show_bug.cgi?id=9159
Comment 37 Scott Hilton 2020-06-05 12:42:49 MDT
Greg,

The numbers from your script look like they are in the right ballpark to me. This tells me that the direct accounting records are probably correct. The issue is probably in adding up the time. 

I'm not sure but I think there is a good chance that the previous fix (see 8958) could also fix this discrepancy where we are missing allocation time, Leading to 74% down. 

Once you update to 20.02.3 or 19.05.7, if you see any new issues like this let me know. 

I'm going to close this ticket for now, but if the same issue appears again please reopen it so we can address the bug.

Good luck,

Scott