Summary: | RAPL plugin: incorrect *Watts and ConsumedEnergy values | ||
---|---|---|---|
Product: | Slurm | Reporter: | Alexey Kozlov <alexey.kozlov> |
Component: | Accounting | Assignee: | Oriol Vilarrubi <jvilarru> |
Status: | OPEN --- | QA Contact: | Tim Wickberg <tim> |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | mahendra.paipuri, markus.hilger, sts, tim, uemit.seren |
Version: | 21.08.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | proposed patch |
Description
Alexey Kozlov
2020-10-07 15:57:18 MDT
Created attachment 16196 [details]
proposed patch
This patch fixes multiple bugs/issues in power computation:
- CurrentWatts: using CPU energy unit for DRAM domain resulted in wrong values on many systems (Intel Haswell/Skylake/CascadeLake)
- CurrentWatts: same energy unit was used for all packages -> might work for now, but could break anytime
- AveWatts: incorrect value due to missing normalization by the polling interval
- AveWatts: inaccurate value due to using integer type to compute running average (at some point contribution of the current measurement becomes <1.0 -> AveWatts is frozen)
Hello, Any reason why this issue never got attention. The bug exists still in the RAPL plugin due to which the energy consumption reported by SLURM is significantly over-estimated than the actual values. Here is a little [report](https://gist.github.com/mahendrapaipuri/bcd357747d32073e3cb4622940db408b) on the bug. Hello Mahendra, I am looking at how to best integrate this patch to current slurm version, your report is being very useful, many thanks |