Ticket 9956 - RAPL plugin: incorrect *Watts and ConsumedEnergy values
Summary: RAPL plugin: incorrect *Watts and ConsumedEnergy values
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 21.08.x
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Oriol Vilarrubi
QA Contact: Felip Moll
URL:
Depends on:
Blocks:
 
Reported: 2020-10-07 15:57 MDT by Alexey Kozlov
Modified: 2024-11-22 13:20 MST (History)
5 users (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
proposed patch (7.71 KB, patch)
2020-10-12 12:55 MDT, Alexey Kozlov
Details | Diff
Slurm patch fox fixing the energy gathering in the DRAM modules (4.83 KB, patch)
2024-10-24 07:39 MDT, Oriol Vilarrubi
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Alexey Kozlov 2020-10-07 15:57:18 MDT
AcctGatherEnergy RAPL plugin is using the same energy unit for all CPU and DRAM packages:

https://github.com/SchedMD/slurm/blob/master/src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c#L326

However, on many modern server architectures (Haswell, Skylake X/SP, CascadeLake SP), DRAM energy unit is distinct from the package energy unit stored in the MSR_RAPL_POWER_UNIT register. Instead, it has a fixed value of 1/15300.

The (gloomy) situation becomes clear when looking at the Linux powercap driver code, which gives correct measurements:    

https://github.com/torvalds/linux/blob/master/drivers/powercap/intel_rapl_common.c#L964

https://github.com/torvalds/linux/blob/master/drivers/powercap/intel_rapl_common.c#L1017

So apparently, the only viable solution would be to check CPU model and set DRAM energy unit accordingly.

As a result of this bug, AcctGatherEnergy reports power and energy values which are incorrect, and in my experiments they were usually inflated by as much as 30%-50%.
Comment 3 Alexey Kozlov 2020-10-12 12:55:22 MDT
Created attachment 16196 [details]
proposed patch

This patch fixes multiple bugs/issues in power computation:

- CurrentWatts: using CPU energy unit for DRAM domain resulted in wrong values on many systems (Intel Haswell/Skylake/CascadeLake)

- CurrentWatts: same energy unit was used for all packages -> might work for now, but could break anytime 

- AveWatts: incorrect value due to missing normalization by the polling interval

- AveWatts: inaccurate value due to using integer type to compute running average (at some point contribution of the current measurement becomes <1.0 -> AveWatts is frozen)
Comment 4 Mahendra Paipuri 2024-07-15 04:04:48 MDT
Hello,

Any reason why this issue never got attention. The bug exists still in the RAPL plugin due to which the energy consumption reported by SLURM is significantly over-estimated than the actual values. Here is a little [report](https://gist.github.com/mahendrapaipuri/bcd357747d32073e3cb4622940db408b) on the bug.
Comment 6 Oriol Vilarrubi 2024-07-29 07:11:02 MDT
Hello Mahendra,

I am looking at how to best integrate this patch to current slurm version, your report is being very useful, many thanks
Comment 15 Oriol Vilarrubi 2024-10-24 07:39:28 MDT
Created attachment 39397 [details]
Slurm patch fox fixing the energy gathering in the DRAM modules

Hello Mahendra,

We have taken Alexey patch and adapted to the current slurm codebase, seeing that you have already tested the rapl-read in some of the affected CPUs, would you be so kind to test this patch too? Please test it on a reduced set of nodes if possible. It should not happen but if a segfault occurs we don't want to impact production.

Many thanks in advance.
Comment 18 Mahendra Paipuri 2024-11-22 13:20:20 MST
Hello Oriol,

Sorry for the late response, I have been caught up with a lot of stuff leading to SC24.

Unfortunately I am not the one that manages SLURM cluster on our center and I cannot really test it on our prod machines. I will see what I can do with my sysadmin team. I have also access to hardware where I will be able to quickly spin up SLURM cluster with the patch and see if it has been fixed. 

Thanks for the patch.

Regards
Mahendra