Ticket 1748 - Improve cpu profiling accuracy
Summary: Improve cpu profiling accuracy
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Profiling (show other tickets)
Version: 15.08.x
Hardware: Linux Linux
: 5 - Enhancement
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-06-16 02:46 MDT by Carlos Fenoy
Modified: 2015-08-18 03:55 MDT (History)
2 users (show)

See Also:
Site: Roche
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 15.08.0-0rc1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
tot_cpu from int to double patch (6.64 KB, patch)
2015-08-12 04:41 MDT, Carlos Fenoy
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Carlos Fenoy 2015-06-16 02:46:57 MDT
Following the discussion of the pull request #113, the goal of this bug request is to have the cpu time and cpu utilization fields to be float fields so the usage can be more accurate.

the problem is that with high rate profiling, the cpu usage becomes a binary field (see the output below). I would like to have a more accurate value of the cpu usage. I'm thinking about writing another profiling plugin that may be very useful in our environment, and probably also in other environments where compute nodes are shared.

As you can see in the output below, the cpu utilization is 100 or 0, and the cpu time is 1 or 0. Converting the cputime field to a float/double value would allow for both counters to be more accurate and provide better information to the users.

Regards,
Carlos 

$ cat extract_10948.csv
Job,Step,Node,Series,Date Time,ElapsedTime,CPU Frequency,CPU Time,CPU Utilization,rss,VM Size,Pages,Read_bytes,Write_bytes
10948,0,compute1,Task_0,2015-06-03 17:28:23,0,2400000,0,0.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:24,1,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:25,2,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:26,3,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:27,4,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:28,5,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:29,6,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:30,7,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:31,8,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:32,9,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:33,10,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:34,11,2399999,0,0.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:35,12,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:36,13,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:37,14,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:38,15,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:39,16,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:40,17,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:41,18,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:42,19,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:43,20,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:44,21,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:45,22,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:46,23,2399999,0,0.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:47,24,2399999,1,100.000,1276,9340,0,0.000,0.000
Comment 1 David Bigagli 2015-06-17 04:50:58 MDT
Enhancement request to use double type to represent cpu time instead of uint32_t.

David
Comment 2 Carlos Fenoy 2015-07-28 05:00:22 MDT
Any news on this petition?
Should I implement it?
Comment 3 Danny Auble 2015-07-28 05:02:55 MDT
If you would like to go for it please do, I don't think we will have time before 15.08 to look at it.
Comment 4 Carlos Fenoy 2015-08-12 04:41:16 MDT
Created attachment 2112 [details]
tot_cpu from int to double patch

I've patched the code to change tot_cpu field from int to double. I've compiled with --enable-developer flag and everything seems to work fine.
Please have a look at it.
Comment 5 Danny Auble 2015-08-14 07:19:35 MDT
Carlos, it appears this patch is a reverse patch, but I was able to figure it out ;).

Any case it is committed in 78c0bf9a58036.

Thanks!
Comment 6 Moe Jette 2015-08-18 03:55:39 MDT
Important fix to original code here, prevents divide by zero:
https://github.com/SchedMD/slurm/commit/94b11ac40bc569002b7376150883784ee57b2423