Ticket 5954

Summary: sreport -t percent -T node user topusers doesn't give percentages
Product: Slurm Reporter: Ryan Day <day36>
Component: User CommandsAssignee: Broderick Gardner <broderick>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: sts
Version: 17.11.9   
Hardware: Linux   
OS: Linux   
Site: LLNL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 18.08.5 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ryan Day 2018-10-30 17:33:06 MDT
In the process of trying to track down some usage from runaway jobs (bug 5924), we realized that 'sreport -T node -t percent user topusers' doesn't report usage as a percent.

[day36@quartz1916:~]$ sreport -t percent user topusers start=10/1
--------------------------------------------------------------------------------
Top 10 Users 2018-10-01T00:00:00 - 2018-10-29T23:59:59 (2505600 secs)
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account        Used   Energy 
--------- --------- --------------- --------------- ----------- -------- 
   quartz      dank    Dan Kirshner          baasic       8.69%    0.00% 
   quartz     lyang     Lin H. Yang         wbronze       8.02%    0.00% 
   quartz  bassenne Maxime Bassenne        stanford       6.36%    0.00% 
   quartz     jmilo John Camilo Pa+            utah       6.19%    0.00% 
   quartz    isaac4 Benjamin John +            utah       5.05%    0.00% 
   quartz  schaich2 David Alexande+        latticgc       3.35%    0.00% 
   quartz       kgb Kamron Groves +            utah       2.68%    0.00% 
   quartz     mandy Mandy Bethkenh+            pls2       2.62%    0.00% 
   quartz    mehta8 Yash Ajit Mehta         florida       2.54%    0.00% 
   quartz    chan52 Wai Hong Ronal+        stanford       2.54%    0.00% 
[day36@quartz1916:~]$ sreport -t percent -T node user topusers start=10/1
--------------------------------------------------------------------------------
Top 10 Users 2018-10-01T00:00:00 - 2018-10-29T23:59:59 (2505600 secs)
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account      TRES Name      Used 
--------- --------- --------------- --------------- -------------- --------- 
   quartz      dank    Dan Kirshner          baasic           node 567262106 
   quartz     lyang     Lin H. Yang         wbronze           node 523362440 
   quartz  bassenne Maxime Bassenne        stanford           node 414989903 
   quartz     jmilo John Camilo Pa+            utah           node 403563246 
   quartz    isaac4 Benjamin John +            utah           node 329629776 
   quartz  schaich2 David Alexande+        latticgc           node 218605693 
   quartz       kgb Kamron Groves +            utah           node 174650590 
   quartz     mandy Mandy Bethkenh+            pls2           node 170659143 
   quartz    mehta8 Yash Ajit Mehta         florida           node 166034510 
   quartz    chan52 Wai Hong Ronal+        stanford           node 165702467 
[day36@quartz1916:~]$ 

It's also not reporting the TRES minutes:
[day36@quartz1916:~]$ sreport -T node user topusers start=10/1
--------------------------------------------------------------------------------
Top 10 Users 2018-10-01T00:00:00 - 2018-10-29T23:59:59 (2505600 secs)
Usage reported in TRES Minutes
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account      TRES Name      Used 
--------- --------- --------------- --------------- -------------- --------- 
   quartz      dank    Dan Kirshner          baasic           node   9454368 
   quartz     lyang     Lin H. Yang         wbronze           node   8722707 
   quartz  bassenne Maxime Bassenne        stanford           node   6916498 
   quartz     jmilo John Camilo Pa+            utah           node   6726054 
   quartz    isaac4 Benjamin John +            utah           node   5493830 
   quartz  schaich2 David Alexande+        latticgc           node   3643428 
   quartz       kgb Kamron Groves +            utah           node   2910843 
   quartz     mandy Mandy Bethkenh+            pls2           node   2844319 
   quartz    mehta8 Yash Ajit Mehta         florida           node   2767242 
   quartz    chan52 Wai Hong Ronal+        stanford           node   2761708 
[day36@quartz1916:~]$
Comment 1 Broderick Gardner 2018-10-31 16:21:17 MDT
I'm investigating this for you. I see the problem with node tres usage, I'll figure out why that is not correct. What do you expect to see for Tres Minutes?
Comment 2 Ryan Day 2018-10-31 16:26:25 MDT
Sorry, I guess I wasn't clear on that. I don't think there's anything wrong with the -T minutes report. My first thought on the -t percent -T node output was that it might be reporting the TRES minutes instead of the percent, and so I included that output to show that it wasn't.
Comment 3 Broderick Gardner 2018-11-07 10:43:37 MST
Thanks for the clarification, that makes more sense. It turns out that it is showing a percent, but the percent is so large that the decimals and % are truncated. So the total time it is divided by is wrong somehow. I have replicated the issue, so I'll let you know when I have a fix.
Comment 4 Broderick Gardner 2018-12-05 13:38:16 MST
The percent doesn't work because cluster node usage is no longer tracked, so there is no total usage to normalize the user node usage. The value shown is actually the number of node seconds * 100. 

Cluster node usage isn't tracked because nodes are no longer the smallest unit of trackable computing resources; that is the CPU. User node usage is also not particularly useful because nodes are double counted if the user has multiple jobs on a single node. TRES node in general will likely be fully reworked in the future to fix this, though it is not yet clear what node usage accounting should mean. CPU usage is the proper TRES for accounting purposes. 

Because of this, I will likely add a warning to sreport output indicating that TRES node usage is invalid.
Comment 5 Broderick Gardner 2019-01-14 12:20:48 MST
Were you able to find the runaway job usage? Was it missing usage due to jobs fixed by `sacctmgr show runawayjobs`?
Comment 6 Ryan Day 2019-01-14 12:35:39 MST
we did sort out the runaway job usage by completely re-rolling the various assoc_usage tables. The cleanup from 'sacctmgr show runaways' wasn't enough by itself. See bug 5924.
Comment 7 Broderick Gardner 2019-01-14 13:03:08 MST
Okay. There have been some fixes to `sacctmgr show runawayjobs` in version 18.08, both to identify some even more rare runaway job cases and to fix rollup issues. Hopefully in the future that is enough.
Comment 9 Ryan Day 2019-01-14 14:28:15 MST
Sounds good. Thank you.
Comment 11 Broderick Gardner 2019-01-17 14:13:24 MST
This has now been patched; there is now an error message when trying to request reports that include cluster node TRES utilization or percentages. 

This will be included in version 18.08.5
commit 58fb2de1e4d6

Thanks