Summary: | RAPL plugin: incorrect *Watts and ConsumedEnergy values | ||
---|---|---|---|
Product: | Slurm | Reporter: | Alexey Kozlov <alexey.kozlov> |
Component: | Accounting | Assignee: | Oriol Vilarrubi <jvilarru> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | mahendra.paipuri, markus.hilger, sts, tim, uemit.seren |
Version: | 21.08.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 25.05 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
proposed patch
Slurm patch fox fixing the energy gathering in the DRAM modules |
Description
Alexey Kozlov
2020-10-07 15:57:18 MDT
Created attachment 16196 [details]
proposed patch
This patch fixes multiple bugs/issues in power computation:
- CurrentWatts: using CPU energy unit for DRAM domain resulted in wrong values on many systems (Intel Haswell/Skylake/CascadeLake)
- CurrentWatts: same energy unit was used for all packages -> might work for now, but could break anytime
- AveWatts: incorrect value due to missing normalization by the polling interval
- AveWatts: inaccurate value due to using integer type to compute running average (at some point contribution of the current measurement becomes <1.0 -> AveWatts is frozen)
Hello, Any reason why this issue never got attention. The bug exists still in the RAPL plugin due to which the energy consumption reported by SLURM is significantly over-estimated than the actual values. Here is a little [report](https://gist.github.com/mahendrapaipuri/bcd357747d32073e3cb4622940db408b) on the bug. Hello Mahendra, I am looking at how to best integrate this patch to current slurm version, your report is being very useful, many thanks Created attachment 39397 [details]
Slurm patch fox fixing the energy gathering in the DRAM modules
Hello Mahendra,
We have taken Alexey patch and adapted to the current slurm codebase, seeing that you have already tested the rapl-read in some of the affected CPUs, would you be so kind to test this patch too? Please test it on a reduced set of nodes if possible. It should not happen but if a segfault occurs we don't want to impact production.
Many thanks in advance.
Hello Oriol, Sorry for the late response, I have been caught up with a lot of stuff leading to SC24. Unfortunately I am not the one that manages SLURM cluster on our center and I cannot really test it on our prod machines. I will see what I can do with my sysadmin team. I have also access to hardware where I will be able to quickly spin up SLURM cluster with the patch and see if it has been fixed. Thanks for the patch. Regards Mahendra Hello Mahendra, Did you had the chance to test this? Regards. Hello Oriol, Sorry for the late reply. Yes, I managed to test it by spinning up a SLURM cluster on a Broadwell machine. I used SLURM 24.11.0 and applied your patch before compiling the binaries. And I confirm that the power reported by `scontrol show node <node-name>` is consistent with the one reported by Linux's powercap. I used the script from https://web.eece.maine.edu/~vweaver/projects/rapl/rapl-read.c to get RAPL counters from Powercap. Results from Linux powercap: ``` # ./rapl -s RAPL read -- use -s for sysfs, -p for perf_event, -m for msr Found Broadwell-EP Processor type 0 (0), 1 (1), 2 (0), 3 (1), 4 (0), 5 (1), 6 (0), 7 (1) 8 (0), 9 (1), 10 (0), 11 (1), 12 (0), 13 (1), 14 (0), 15 (1) 16 (0), 17 (1), 18 (0), 19 (1), 20 (0), 21 (1), 22 (0), 23 (1) 24 (0), 25 (1), 26 (0), 27 (1), 28 (0), 29 (1), 30 (0), 31 (1) Detected 32 cores in 2 packages Trying sysfs powercap interface to gather results Sleeping 1 second Package 0 package-0 : 31.193768J dram : 0.836083J Package 1 package-1 : 23.376588J dram : 1.715023J ``` And output from `scontrol show node <node-name>` ``` # scontrol show node <node-name> NodeName=nova-18 Arch=x86_64 CoresPerSocket=8 CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.99 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=nova-18 NodeHostName=<node-name> Version=24.11.0 OS=Linux 6.8.0-31-generic #31-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024 RealMemory=7937 AllocMem=0 FreeMem=60396 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=cpu BootTime=2024-12-09T12:00:57 SlurmdStartTime=2024-12-09T13:26:27 LastBusyTime=2024-12-09T13:26:28 ResumeAfterTime=None CfgTRES=cpu=16,mem=7937M,billing=16 AllocTRES= CurrentWatts=57 AveWatts=48 ``` The system is in idle state without any activity and hence the power usage values are fairly stable. So, we can conclude that SLURM's RAPL plugin is taking correct scaling factor for DRAM counters now. - Mahendra Dear Oriol and Mahendra, great to see progress on this one, thanks for your efforts! My original patch also included the fix to "AveWatts" value calculation. It has been partially applied back in 2021: https://github.com/SchedMD/slurm/commit/3bd1c3037f5b35da30ca977132c698a4da30c74d but the rounding issue still persists in the master branch. Here is the line in question: https://github.com/SchedMD/slurm/blob/master/src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c#L367 You should be able to reproduce the problem by keeping a node idle for a while, and then starting a job thereby increasing power consumption . You will likely notice that AveWatts is "stuck" at the idle power level. This happens because "energy->ave_watts" is an integer variable, and hence any increment <1.0 is ignored. After a long period of idle (i.e. the value of "readings" is high), it is quite likely that increment will be <1.0. Therefore, AveWatts value will be frozen at the idle power level. In the proposed patch, I introduced a static double variable to deal with this rounding issue. Admittedly this is not the most elegant solution, so maybe you can find a better one. https://support.schedmd.com/attachment.cgi?id=16196&action=diff#a/src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c_sec5 Best, Alexey Hi Mahendra an Alexey, Thanks for testing the patch. We will be pushing it to slurm code base soon. Alexey, about the rounding issue, I'll look deeper into it and if it is something reasonable I will open a new ticket with you on copy, just to keep the topics separated. What you did with the temporal variable seems reasonable, but as said I would like to look deeper into it before agreeing or disagreeing on it. Regards. I am on holiday until 13.01.2025 with limited mail access. For urgent requests please contact my manager peter.grossoehme@megware.com Hello Alexey, We have pushed the part of taking into account the different energy units for the intel processors in commit c8a694e3, that will be on the next slurm major release(25.05). About the avewatts precision, we have been studying that part, and taking into account that other energy plugins are also affected by this same issue we will be working on that in a separate ticket, I like your approach, but in order to better use it in other plugins we might want to change ave_watts in the energy struct itself rather than having a local variable, but as said we will be studying that in a separate ticket. Would you say that this is an issue that is important for your case? Do you use the energy of the node for some statistics? Regards. Hello Oriol, > We have pushed the part of taking into account the different energy units > for the intel processors in commit c8a694e3, that will be on the next slurm > major release(25.05). That's great news, thank you! > About the avewatts precision, we have been studying that part, and taking > into account that other energy plugins are also affected by this same issue > we will be working on that in a separate ticket, I like your approach, but > in order to better use it in other plugins we might want to change ave_watts > in the energy struct itself rather than having a local variable, but as said > we will be studying that in a separate ticket. Would you say that this is an > issue that is important for your case? Do you use the energy of the node for > some statistics? Currently, I'm using a custom solution based on IPMI+telegraf+influxdb for node energy monitoring. It seems like many HPC systems adopted a similar approach (see e.g. http://export.arxiv.org/pdf/2411.16204#section.5), in part due to lack of standardized solutions. Hence, I think support for basic node-level energy monitoring in SLURM would be extremely helpful for the community. In its current form, the usefulness of AveWatts is somewhat limited because there is no explicit period specification: it reports average power since SLURM daemon has been started. So if you are planning to refactor this part, maybe you could implement averages for fixed periods similar to /proc/loadavg. e.g. last 5min, 24h, 30days etc. Best. Alexey Hi Alexey, I have opened an internal issue regarding the average watts as well as your comment about it not having a defined timeframe. I will be closing this issue as fixed as it was more focused on the fixing the energy values for those intel processors and that code is already in slurm. Best regards. |