Ticket 9956

Summary:	RAPL plugin: incorrect *Watts and ConsumedEnergy values
Product:	Slurm	Reporter:	Alexey Kozlov <alexey.kozlov>
Component:	Accounting	Assignee:	Oriol Vilarrubi <jvilarru>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	mahendra.paipuri, markus.hilger, sts, tim, uemit.seren
Version:	21.08.x
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	25.05
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	proposed patch Slurm patch fox fixing the energy gathering in the DRAM modules

Description Alexey Kozlov 2020-10-07 15:57:18 MDT

AcctGatherEnergy RAPL plugin is using the same energy unit for all CPU and DRAM packages:

https://github.com/SchedMD/slurm/blob/master/src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c#L326

However, on many modern server architectures (Haswell, Skylake X/SP, CascadeLake SP), DRAM energy unit is distinct from the package energy unit stored in the MSR_RAPL_POWER_UNIT register. Instead, it has a fixed value of 1/15300.

The (gloomy) situation becomes clear when looking at the Linux powercap driver code, which gives correct measurements:    

https://github.com/torvalds/linux/blob/master/drivers/powercap/intel_rapl_common.c#L964

https://github.com/torvalds/linux/blob/master/drivers/powercap/intel_rapl_common.c#L1017

So apparently, the only viable solution would be to check CPU model and set DRAM energy unit accordingly.

As a result of this bug, AcctGatherEnergy reports power and energy values which are incorrect, and in my experiments they were usually inflated by as much as 30%-50%.

Comment 3 Alexey Kozlov 2020-10-12 12:55:22 MDT

Created attachment 16196 [details]
proposed patch

This patch fixes multiple bugs/issues in power computation:

- CurrentWatts: using CPU energy unit for DRAM domain resulted in wrong values on many systems (Intel Haswell/Skylake/CascadeLake)

- CurrentWatts: same energy unit was used for all packages -> might work for now, but could break anytime 

- AveWatts: incorrect value due to missing normalization by the polling interval

- AveWatts: inaccurate value due to using integer type to compute running average (at some point contribution of the current measurement becomes <1.0 -> AveWatts is frozen)

Comment 4 Mahendra Paipuri 2024-07-15 04:04:48 MDT

Hello,

Any reason why this issue never got attention. The bug exists still in the RAPL plugin due to which the energy consumption reported by SLURM is significantly over-estimated than the actual values. Here is a little [report](https://gist.github.com/mahendrapaipuri/bcd357747d32073e3cb4622940db408b) on the bug.

Comment 6 Oriol Vilarrubi 2024-07-29 07:11:02 MDT

Hello Mahendra,

I am looking at how to best integrate this patch to current slurm version, your report is being very useful, many thanks

Comment 15 Oriol Vilarrubi 2024-10-24 07:39:28 MDT

Created attachment 39397 [details]
Slurm patch fox fixing the energy gathering in the DRAM modules

Hello Mahendra,

We have taken Alexey patch and adapted to the current slurm codebase, seeing that you have already tested the rapl-read in some of the affected CPUs, would you be so kind to test this patch too? Please test it on a reduced set of nodes if possible. It should not happen but if a segfault occurs we don't want to impact production.

Many thanks in advance.

Comment 18 Mahendra Paipuri 2024-11-22 13:20:20 MST

Hello Oriol,

Sorry for the late response, I have been caught up with a lot of stuff leading to SC24.

Unfortunately I am not the one that manages SLURM cluster on our center and I cannot really test it on our prod machines. I will see what I can do with my sysadmin team. I have also access to hardware where I will be able to quickly spin up SLURM cluster with the patch and see if it has been fixed. 

Thanks for the patch.

Regards
Mahendra

Comment 19 Oriol Vilarrubi 2024-12-06 03:09:46 MST

Hello Mahendra,

Did you had the chance to test this?

Regards.

Comment 20 Mahendra Paipuri 2024-12-09 05:34:17 MST

Hello Oriol,

Sorry for the late reply.

Yes, I managed to test it by spinning up a SLURM cluster on a Broadwell machine. I used SLURM 24.11.0 and applied your patch before compiling the binaries. And I confirm that the power reported by `scontrol show node <node-name>` is consistent with the one reported by Linux's powercap. 

I used the script from https://web.eece.maine.edu/~vweaver/projects/rapl/rapl-read.c to get RAPL counters from Powercap.

Results from Linux powercap:

```
# ./rapl -s

RAPL read -- use -s for sysfs, -p for perf_event, -m for msr

Found Broadwell-EP Processor type
        0 (0), 1 (1), 2 (0), 3 (1), 4 (0), 5 (1), 6 (0), 7 (1)
        8 (0), 9 (1), 10 (0), 11 (1), 12 (0), 13 (1), 14 (0), 15 (1)
        16 (0), 17 (1), 18 (0), 19 (1), 20 (0), 21 (1), 22 (0), 23 (1)
        24 (0), 25 (1), 26 (0), 27 (1), 28 (0), 29 (1), 30 (0), 31 (1)

        Detected 32 cores in 2 packages


Trying sysfs powercap interface to gather results

        Sleeping 1 second

        Package 0
                package-0       : 31.193768J
                dram    : 0.836083J
        Package 1
                package-1       : 23.376588J
                dram    : 1.715023J
```

And output from `scontrol show node <node-name>`

```
# scontrol show node <node-name>
NodeName=nova-18 Arch=x86_64 CoresPerSocket=8 
   CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.99
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=nova-18 NodeHostName=<node-name> Version=24.11.0
   OS=Linux 6.8.0-31-generic #31-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024 
   RealMemory=7937 AllocMem=0 FreeMem=60396 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=cpu 
   BootTime=2024-12-09T12:00:57 SlurmdStartTime=2024-12-09T13:26:27
   LastBusyTime=2024-12-09T13:26:28 ResumeAfterTime=None
   CfgTRES=cpu=16,mem=7937M,billing=16
   AllocTRES=
   CurrentWatts=57 AveWatts=48
   
```

The system is in idle state without any activity and hence the power usage values are fairly stable. So, we can conclude that SLURM's RAPL plugin is taking correct scaling factor for DRAM counters now.

-
Mahendra

Comment 21 Alexey Kozlov 2024-12-09 07:11:22 MST

Dear Oriol and Mahendra,

great to see progress on this one, thanks for your efforts!

My original patch also included the fix to "AveWatts" value calculation. It has been partially applied back in 2021:

https://github.com/SchedMD/slurm/commit/3bd1c3037f5b35da30ca977132c698a4da30c74d

but the rounding issue still persists in the master branch. Here is the line in question: 

https://github.com/SchedMD/slurm/blob/master/src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c#L367

You should be able to reproduce the problem by keeping a node idle for a while, and then starting a job thereby increasing power consumption . You will likely notice that AveWatts is "stuck" at the idle power level. This happens because "energy->ave_watts" is an integer variable, and hence any increment <1.0 is ignored. After a long period of idle (i.e. the value of "readings" is high), it is quite likely that increment will be <1.0. Therefore, AveWatts value will be frozen at the idle power level.

In the proposed patch, I introduced a static double variable to deal with this rounding issue. Admittedly this is not the most elegant solution, so maybe you can find a better one.

https://support.schedmd.com/attachment.cgi?id=16196&action=diff#a/src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c_sec5

Best,
Alexey

Comment 24 Oriol Vilarrubi 2024-12-20 04:57:07 MST

Hi Mahendra an Alexey,

Thanks for testing the patch. We will be pushing it to slurm code base soon.

Alexey, about the rounding issue, I'll look deeper into it and if it is something reasonable I will open a new ticket with you on copy, just to keep the topics separated. What you did with the temporal variable seems reasonable, but as said I would like to look deeper into it before agreeing or disagreeing on it.

Regards.

Comment 25 markus.hilger 2024-12-20 04:57:33 MST

I am on holiday until 13.01.2025 with limited mail access.
For urgent requests please contact my manager peter.grossoehme@megware.com

Comment 27 Oriol Vilarrubi 2025-01-09 13:18:09 MST

Hello Alexey,

We have pushed the part of taking into account the different energy units for the intel processors in commit c8a694e3, that will be on the next slurm major release(25.05).

About the avewatts precision, we have been studying that part, and taking into account that other energy plugins are also affected by this same issue we will be working on that in a separate ticket, I like your approach, but in order to better use it in other plugins we might want to change ave_watts in the energy struct itself rather than having a local variable, but as said we will be studying that in a separate ticket. Would you say that this is an issue that is important for your case? Do you use the energy of the node for some statistics?

Regards.

Comment 28 Alexey Kozlov 2025-01-10 05:48:35 MST

Hello Oriol,

> We have pushed the part of taking into account the different energy units
> for the intel processors in commit c8a694e3, that will be on the next slurm
> major release(25.05).

That's great news, thank you!

> About the avewatts precision, we have been studying that part, and taking
> into account that other energy plugins are also affected by this same issue
> we will be working on that in a separate ticket, I like your approach, but
> in order to better use it in other plugins we might want to change ave_watts
> in the energy struct itself rather than having a local variable, but as said
> we will be studying that in a separate ticket. Would you say that this is an
> issue that is important for your case? Do you use the energy of the node for
> some statistics?

Currently, I'm using a custom solution based on IPMI+telegraf+influxdb for node energy monitoring. It seems like many HPC systems adopted a similar approach
(see e.g. http://export.arxiv.org/pdf/2411.16204#section.5), in part due to lack of standardized solutions. Hence, I think support for basic node-level energy monitoring in SLURM would be extremely helpful for the community.

In its current form, the usefulness of AveWatts is somewhat limited because there is no explicit period specification:  it reports average power since SLURM daemon has been started. 
So if you are planning to refactor this part, maybe you could implement averages for fixed periods similar to /proc/loadavg. e.g. last 5min, 24h, 30days etc.


Best.
Alexey

Comment 29 Oriol Vilarrubi 2025-01-10 09:49:03 MST

Hi Alexey,

I have opened an internal issue regarding the average watts as well as your comment about it not having a defined timeframe. I will be closing this issue as fixed as it was more focused on the fixing the energy values for those intel processors and that code is already in slurm.

Best regards.