Ticket 1748

Summary: Improve cpu profiling accuracy
Product: Slurm Reporter: Carlos Fenoy <carlos.fenoy>
Component: ProfilingAssignee: David Bigagli <david>
Status: RESOLVED FIXED QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: brian, da
Version: 15.08.x   
Hardware: Linux   
OS: Linux   
Site: Roche Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 15.08.0-0rc1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: tot_cpu from int to double patch

Description Carlos Fenoy 2015-06-16 02:46:57 MDT
Following the discussion of the pull request #113, the goal of this bug request is to have the cpu time and cpu utilization fields to be float fields so the usage can be more accurate.

the problem is that with high rate profiling, the cpu usage becomes a binary field (see the output below). I would like to have a more accurate value of the cpu usage. I'm thinking about writing another profiling plugin that may be very useful in our environment, and probably also in other environments where compute nodes are shared.

As you can see in the output below, the cpu utilization is 100 or 0, and the cpu time is 1 or 0. Converting the cputime field to a float/double value would allow for both counters to be more accurate and provide better information to the users.

Regards,
Carlos 

$ cat extract_10948.csv
Job,Step,Node,Series,Date Time,ElapsedTime,CPU Frequency,CPU Time,CPU Utilization,rss,VM Size,Pages,Read_bytes,Write_bytes
10948,0,compute1,Task_0,2015-06-03 17:28:23,0,2400000,0,0.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:24,1,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:25,2,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:26,3,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:27,4,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:28,5,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:29,6,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:30,7,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:31,8,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:32,9,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:33,10,2400000,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:34,11,2399999,0,0.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:35,12,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:36,13,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:37,14,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:38,15,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:39,16,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:40,17,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:41,18,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:42,19,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:43,20,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:44,21,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:45,22,2399999,1,100.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:46,23,2399999,0,0.000,1276,9340,0,0.000,0.000
10948,0,compute1,Task_0,2015-06-03 17:28:47,24,2399999,1,100.000,1276,9340,0,0.000,0.000
Comment 1 David Bigagli 2015-06-17 04:50:58 MDT
Enhancement request to use double type to represent cpu time instead of uint32_t.

David
Comment 2 Carlos Fenoy 2015-07-28 05:00:22 MDT
Any news on this petition?
Should I implement it?
Comment 3 Danny Auble 2015-07-28 05:02:55 MDT
If you would like to go for it please do, I don't think we will have time before 15.08 to look at it.
Comment 4 Carlos Fenoy 2015-08-12 04:41:16 MDT
Created attachment 2112 [details]
tot_cpu from int to double patch

I've patched the code to change tot_cpu field from int to double. I've compiled with --enable-developer flag and everything seems to work fine.
Please have a look at it.
Comment 5 Danny Auble 2015-08-14 07:19:35 MDT
Carlos, it appears this patch is a reverse patch, but I was able to figure it out ;).

Any case it is committed in 78c0bf9a58036.

Thanks!
Comment 6 Moe Jette 2015-08-18 03:55:39 MDT
Important fix to original code here, prevents divide by zero:
https://github.com/SchedMD/slurm/commit/94b11ac40bc569002b7376150883784ee57b2423