Summary: | RAPL plugin: incorrect *Watts and ConsumedEnergy values | ||
---|---|---|---|
Product: | Slurm | Reporter: | Alexey Kozlov <alexey.kozlov> |
Component: | Accounting | Assignee: | Oriol Vilarrubi <jvilarru> |
Status: | OPEN --- | QA Contact: | Felip Moll <felip.moll> |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | mahendra.paipuri, markus.hilger, sts, tim, uemit.seren |
Version: | 21.08.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
proposed patch
Slurm patch fox fixing the energy gathering in the DRAM modules |
Description
Alexey Kozlov
2020-10-07 15:57:18 MDT
Created attachment 16196 [details]
proposed patch
This patch fixes multiple bugs/issues in power computation:
- CurrentWatts: using CPU energy unit for DRAM domain resulted in wrong values on many systems (Intel Haswell/Skylake/CascadeLake)
- CurrentWatts: same energy unit was used for all packages -> might work for now, but could break anytime
- AveWatts: incorrect value due to missing normalization by the polling interval
- AveWatts: inaccurate value due to using integer type to compute running average (at some point contribution of the current measurement becomes <1.0 -> AveWatts is frozen)
Hello, Any reason why this issue never got attention. The bug exists still in the RAPL plugin due to which the energy consumption reported by SLURM is significantly over-estimated than the actual values. Here is a little [report](https://gist.github.com/mahendrapaipuri/bcd357747d32073e3cb4622940db408b) on the bug. Hello Mahendra, I am looking at how to best integrate this patch to current slurm version, your report is being very useful, many thanks Created attachment 39397 [details]
Slurm patch fox fixing the energy gathering in the DRAM modules
Hello Mahendra,
We have taken Alexey patch and adapted to the current slurm codebase, seeing that you have already tested the rapl-read in some of the affected CPUs, would you be so kind to test this patch too? Please test it on a reduced set of nodes if possible. It should not happen but if a segfault occurs we don't want to impact production.
Many thanks in advance.
Hello Oriol, Sorry for the late response, I have been caught up with a lot of stuff leading to SC24. Unfortunately I am not the one that manages SLURM cluster on our center and I cannot really test it on our prod machines. I will see what I can do with my sysadmin team. I have also access to hardware where I will be able to quickly spin up SLURM cluster with the patch and see if it has been fixed. Thanks for the patch. Regards Mahendra |