Ticket 11245 - scontrol show node reports very high AveWatts values
Summary: scontrol show node reports very high AveWatts values
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 20.11.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Albert Gil
QA Contact: Alejandro Sanchez
URL:
Depends on:
Blocks:
 
Reported: 2021-03-29 15:52 MDT by Kilian Cavalotti
Modified: 2021-05-24 12:39 MDT (History)
2 users (show)

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.6
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld/sh02-01n04-slurmd logs (3.84 MB, application/gzip)
2021-04-02 08:55 MDT, Kilian Cavalotti
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2021-03-29 15:52:48 MDT
Hi,

Using the RAPL power monitoring plugin with:
AcctGatherEnergyType=acct_gather_energy/rapl

we noted pretty high AveWatts values reported in scontrol show node.

For instance:

# scontrol show node sh02-01n04
NodeName=sh02-01n04 Arch=x86_64 CpuBind=cores CoresPerSocket=10
   CPUAlloc=0 CPUTot=20 CPULoad=0.61
   AvailableFeatures=IB:EDR,CPU_MNF:INTEL,CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,NO_GPU
   ActiveFeatures=IB:EDR,CPU_MNF:INTEL,CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,NO_GPU
   Gres=(null)
   NodeAddr=sh02-01n04 NodeHostName=sh02-01n04 Version=20.11.5
   OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
   RealMemory=128000 AllocMem=0 FreeMem=120562 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=102161 Owner=N/A MCS_label=N/A
   Partitions=normal
   BootTime=2020-12-19T14:50:25 SlurmdStartTime=2021-03-26T16:55:01
   CfgTRES=cpu=20,mem=125G,billing=51
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=23 AveWatts=35829
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

The real power draw of that 20-core Broadwell node at rest (the node is IDLE right now) is more like 50 watts, and it's maximum consumption is probably around 500W, so seeing an average value of over 35kW is a bit surprising.

This seems to be consistent across all our nodes that support RAPL (ie. Intel-based nodes). 

We didn't find much information in the docs, and the AveWatts field is not referenced anywhere we looked, so I'm wondering if it's actually reporting Watts or something else.

Thanks,
--
Kilian
Comment 2 Albert Gil 2021-04-01 09:31:51 MDT
Hi Kilian,

> We didn't find much information in the docs, and the AveWatts field is not
> referenced anywhere we looked, so I'm wondering if it's actually reporting
> Watts or something else.

You are right that there is not much information about that value, but I'm pretty sure that this is because it should be self explanatory, i.e. it's supposed to report Watts.

Let me look deeper into the code.
I'll come back to you with more info.

Regards,
Albert
Comment 4 Albert Gil 2021-04-01 10:48:03 MDT
Kilian,

Could you do the following:
1) Add DebugFlags=Energy to slurm.conf and "scontrol reconfigure", or better just run:

$ scontrol setdebugflags +Energy

2) Restart one slurmd (eg sh02-01n04)
3) Gather the slurmd log of 1 day and attach it here, with also slurmctld log.
4) Remove DebugFlags=Energy and reconfigure, or just:

$ scontrol setdebugflags -Energy

Thanks,
Albert
Comment 5 Kilian Cavalotti 2021-04-01 14:45:48 MDT
Hi Albert, 

(In reply to Albert Gil from comment #4)
> Could you do the following:
> 1) Add DebugFlags=Energy to slurm.conf and "scontrol reconfigure", or better
> just run:
> 
> $ scontrol setdebugflags +Energy
> 
> 2) Restart one slurmd (eg sh02-01n04)
> 3) Gather the slurmd log of 1 day and attach it here, with also slurmctld
> log.
> 4) Remove DebugFlags=Energy and reconfigure, or just:
> 
> $ scontrol setdebugflags -Energy


Thanks for the instructions, I'll get that for you.
Just a quick precision: do I need to raise the logging level as well? We're running with: 
SlurmctldSyslogDebug=info
SlurmdSyslogDebug=info
Will that be enough to capture logs with DebugFlags=Energy?

Thanks!
--
Kilian
Comment 6 Albert Gil 2021-04-02 01:51:37 MDT
Hi Kilian,

> Just a quick precision: do I need to raise the logging level as well? We're
> running with: 
> SlurmctldSyslogDebug=info
> SlurmdSyslogDebug=info
> Will that be enough to capture logs with DebugFlags=Energy?

Yes.
DebugFlags are independent of the general debug/log level.
Actually we are always leaning to use DebugFlags because we usually need a high level of details to troubleshoot issues and increasing the general level makes the logs too verbose. DebugFlags can provide high level of details of only some subsystems, and that's what we typically need for troubleshooting.

In this case I've seen in the code that most of the logs that I want to see are already under DebugFlags=Energy (in 20.11), so it will be enough.
At least for now.

Thanks,
Albert
Comment 7 Kilian Cavalotti 2021-04-02 08:55:45 MDT
Created attachment 18797 [details]
slurmctld/sh02-01n04-slurmd logs

Hi Albert,

Thanks for the confirmation! Unfortunately, I don't think setting DebugFlags to Energy has been logging much additional information. but maybe I'm not looking for the right thing.

Here are the slurmctld and sh02-01n04's slurmd logs for the last ~18h after setting DebugFlags:

# date
Fri Apr  2 07:54:09 PDT 2021

# grep -i energy slurmctld.log
2021-04-01T13:42:00-07:00 sh03-sl01 slurmctld[19096]: Set DebugFlags to Energy

# grep -iE '20\.11' sh02-01n04.log
2021-03-31T15:30:29-07:00 sh02-01n04 slurmd[39327]: slurmd version 20.11.5 started
2021-04-01T13:42:36-07:00 sh02-01n04 slurmd[168883]: slurmd version 20.11.5 started

Flags set at 13:42:00 yesterday, and slurmd restarted on sh02-01n04 at 13:42:36


Cheers,
--
Kilian
Comment 8 Albert Gil 2021-04-06 04:20:52 MDT
Hi Killian,

> Thanks for the confirmation! Unfortunately, I don't think setting DebugFlags
> to Energy has been logging much additional information. but maybe I'm not
> looking for the right thing.

My mistake.
My comment 6 was not accurate, actually DebugFlags does need these levels (or higher):

SlurmctldDebug=verbose
SlurmdDebug=verbose

I guess that I'm so used to higher values that I just forgot... sorry about that.

Could you repeat that gathering of logs using the verbose level?

Thanks,
Albert
Comment 9 Kilian Cavalotti 2021-04-06 09:32:46 MDT
Hi Albert, 

(In reply to Albert Gil from comment #8)
> Could you repeat that gathering of logs using the verbose level?

No worries. I tried "scontrol setdebug verbose" but fell back into 
bug #10922. So I modified slurm.conf to change the log levels:

SlurmctldSyslogDebug=verbose
SlurmdSyslogDebug=verbose

# scontrol show config | grep -i debug
DebugFlags              = Energy
SlurmctldDebug          = info
SlurmctldSyslogDebug    = verbose
SlurmdDebug             = info
SlurmdSyslogDebug       = verbose

Hopefully the Slurm{ctld,d}Debug values will not be used, as we log to syslog directly.

So far, though, I couldn't find anything relative to Energy in the slurmctld nor slurmd logs, besides this:

# grep -i energy /var/log/slurm/ctld/sh03-sl01.log
2021-04-06T08:09:25-07:00 sh03-sl01 slurmctld[19096]: Set DebugFlags to Energy
2021-04-06T08:12:02-07:00 sh03-sl01 slurmctld[19096]: ENERGY: Updating acct_gather data for sh01-07n[02-08,10-20],sh01-08n[17-26,30-33],sh01-13n[25-29],sh01-17n[14-19,23-26],sh01-18n[11-35],sh01-19n[01-06,08],sh01-20n[01-35],sh01-21n[01-20,22-36],sh01-22n[01-36],sh01-23n[01-11,13,15-36],sh01-24n[01-23,25-36],sh01-25n[01-16,18-33],sh01-26n[01-03,05-34,36],sh01-27n[01-25,28-33,35],sh01-28n[01-12],sh01-29n[02-08],sh01-30n[01-36],sh01-31n[01-10],sh02-01n[01-72],sh02-02n[01-72],sh02-03n[01-72],sh02-04n[01-72],sh02-05n[01-24,29-72],sh02-06n[01-72],sh02-0
2021-04-06T08:17:33-07:00 sh03-sl01 slurmctld[19096]: ENERGY: Updating acct_gather data for sh01-07n[02-08,10-20],sh01-08n[17-26,30-33],sh01-13n[25-29],sh01-17n[14-19,23-26],sh01-18n[11-35],sh01-19n[01-06,08],sh01-20n[01-35],sh01-21n[01-20,22-36],sh01-22n[01-36],sh01-23n[01-11,13,15-36],sh01-24n[01-23,25-36],sh01-25n[01-16,18-33],sh01-26n[01-03,05-34,36],sh01-27n[01-25,28-33,35],sh01-28n[01-12],sh01-29n[02-08],sh01-30n[01-36],sh01-31n[01-10],sh02-01n[01-72],sh02-02n[01-72],sh02-03n[01-72],sh02-04n[01-72],sh02-05n[01-24,29-72],sh02-06n[01-72],sh02-0

The "ENERGY: Updating acct_gather data for" is repeated every 5 minutes, but without any additional details. Not how the node list is truncated too.

Nothing in the slurmd logs so far, even after a restart.

What messages should I look for?

Thanks!
--
Kilian
Comment 10 Albert Gil 2021-04-06 11:34:15 MDT
Hi Kilian,

> The "ENERGY: Updating acct_gather data for" is repeated every 5 minutes, but
> without any additional details. Not how the node list is truncated too.

Yes, this is the only message expected on slurmctld.

> Nothing in the slurmd logs so far, even after a restart.
> What messages should I look for?

Something is wrong.
It depends on the plugin, but for RAPL there are these specific logs on different functions:

src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c|273| log_flag(ENERGY, "RAPL Found: %d packages", nb_pkg);
src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c|354| log_flag(ENERGY, "RAPL Result %"PRIu64" = %.6f Joules", result, ret);
src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c|378| log_flag(ENERGY, "%s: current %.6f Joules, consumed %"PRIu64"",
src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c|408| log_flag(ENERGY, "%s: consumed %u watts",
src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c|414| log_flag(ENERGY, "Energy: dataset created (id = %d)",

Note that every time that the "ENERGY: Updating acct_gather data for" is logged on slurmctld, that means that slurmds are requested to gather that information. And some of the above logs should be printed.

> I tried "scontrol setdebug verbose" but fell back into 
> bug #10922. So I modified slurm.conf to change the log levels:

I see.
It seems that is going to be fixed soon.

> # scontrol show config | grep -i debug
> DebugFlags              = Energy
> SlurmctldDebug          = info
> SlurmctldSyslogDebug    = verbose
> SlurmdDebug             = info
> SlurmdSyslogDebug       = verbose
> 
> Hopefully the Slurm{ctld,d}Debug values will not be used, as we log to
> syslog directly.

This is strange.
Please set SlurmdDebug=verbose and try again.

Thanks,
Albert
Comment 11 Kilian Cavalotti 2021-04-08 11:04:04 MDT
(In reply to Albert Gil from comment #10)
> This is strange.
> Please set SlurmdDebug=verbose and try again.

Ok, it only worked when I put SlurmdSyslogDebug=verbose and DebugFlags=Energy in slurm.conf, distributed the file everywhere, and did a "scontrol reconfig". Using scontrol to set debugflags or changing SlurmdSyslogDebug on the compute node only apparently doesn't work. 

Anyway, here's what sh02-01n04 has started logging:


Apr 08 09:53:18 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _hardware: ENERGY: RAPL Found: 2 packages
Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: RAPL powercapture_debug Energy units = 0.000061, Power Units = 0.125000
Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: RAPL Max power = 180 w
Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: ENERGY: RAPL Result 9586735962 = 585127.927368 Joules
Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: ENERGY: _get_joules_task: current 585127.927368 Joules, consumed 1
Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: RAPL powercapture_debug Energy units = 0.000061, Power Units = 0.125000
Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: RAPL Max power = 180 w
Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: ENERGY: RAPL Result 10472853537 = 639212.252014 Joules
Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: ENERGY: _get_joules_task: current 639212.252014 Joules, consumed 54085

Max power = 180W seems reasonable for that type of node.

After the controller restart, scontrol show node initially showed {Current,Ave}Watts=0 for all the nodes, but after a few minutes, values came back::

# scontrol show node sh02-01n04 | grep Watts
   CapWatts=n/a
   CurrentWatts=180 AveWatts=27042
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

With the first value being 180W, it's weird that the average is over 27,000 already.

Hope this helps!

Cheers,
--
Kilian
Comment 13 Albert Gil 2021-04-12 09:28:37 MDT
Hi Kilian,

> Ok, it only worked when I put SlurmdSyslogDebug=verbose and
> DebugFlags=Energy in slurm.conf, distributed the file everywhere, and did a
> "scontrol reconfig". Using scontrol to set debugflags or changing
> SlurmdSyslogDebug on the compute node only apparently doesn't work.

Yes, slurm.conf should be always distributed.
 
> Apr 08 09:53:18 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl:
> _hardware: ENERGY: RAPL Found: 2 packages
> Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl:
> _get_joules_task: RAPL powercapture_debug Energy units = 0.000061, Power
> Units = 0.125000
> Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl:
> _get_joules_task: RAPL Max power = 180 w
> Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl:
> _get_joules_task: ENERGY: RAPL Result 9586735962 = 585127.927368 Joules
> Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl:
> _get_joules_task: ENERGY: _get_joules_task: current 585127.927368 Joules,
> consumed 1
> Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl:
> _get_joules_task: RAPL powercapture_debug Energy units = 0.000061, Power
> Units = 0.125000
> Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl:
> _get_joules_task: RAPL Max power = 180 w
> Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl:
> _get_joules_task: ENERGY: RAPL Result 10472853537 = 639212.252014 Joules
> Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl:
> _get_joules_task: ENERGY: _get_joules_task: current 639212.252014 Joules,
> consumed 54085
> 
> Max power = 180W seems reasonable for that type of node.
> 
> After the controller restart, scontrol show node initially showed
> {Current,Ave}Watts=0 for all the nodes, but after a few minutes, values came
> back::
> 
> # scontrol show node sh02-01n04 | grep Watts
>    CapWatts=n/a
>    CurrentWatts=180 AveWatts=27042
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> With the first value being 180W, it's weird that the average is over 27,000
> already.

You are right.
I've been able to reproduce the issue and it seems that there is a bug on how the AveWatts is computed, it's not computing the AveWatts but something like the average of the consumed Joules from the last reading.

I've seen other things that can be improved in that block of code.
I'm preparing a patch.

I'll keep you updated,
Albert
Comment 23 Albert Gil 2021-04-19 09:52:45 MDT
Hi Kilian,

> I'm preparing a patch.

I'm glad to let you know that the main fix of the issue has been pushed into 20.11 and will be released as part of the next minor release 20.11.6:

- https://github.com/SchedMD/slurm/commit/3bd1c3037f5b35da30ca977132c698a4da30c74d


> I've seen other things that can be improved in that block of code.

We are still doing some related improvements, but most probably they will land on next major release 21.08.

I'll keep you posted,
Albert
Comment 24 Kilian Cavalotti 2021-04-19 09:58:51 MDT
Hi Albert,

On Mon, Apr 19, 2021 at 8:52 AM <bugs@schedmd.com> wrote:
> I'm glad to let you know that the main fix of the issue has been pushed into
> 20.11 and will be released as part of the next minor release 20.11.6:
>
> -
> https://github.com/SchedMD/slurm/commit/3bd1c3037f5b35da30ca977132c698a4da30c74d
>
>
> > I've seen other things that can be improved in that block of code.
>
> We are still doing some related improvements, but most probably they will land
> on next major release 21.08.


Thanks for the updates, that's all excellent news!

Cheers,
--
Kilian
Comment 25 Albert Gil 2021-04-21 13:02:37 MDT
Hi Kilian,

We are still discussing internally how/if we'll improve this part of the code.
If this is ok for you, I'm closing this bug as the main issue is fixed.

Thanks,
Albert
Comment 27 Kilian Cavalotti 2021-04-21 13:54:19 MDT
(In reply to Albert Gil from comment #25)
> We are still discussing internally how/if we'll improve this part of the
> code.
> If this is ok for you, I'm closing this bug as the main issue is fixed.

Yes, that's good. Thanks!

Cheers,
--
Kilian
Comment 28 Alexey Kozlov 2021-05-24 12:39:47 MDT
Dear Albert,

maybe you're already aware, but just in case: also "CurrentWatts" value seems to be computed/reported incorrectly on Haswell and later:

https://bugs.schedmd.com/show_bug.cgi?id=9956

https://github.com/amkozlov/slurm/commit/46094c35126f71f1f558d3153ef4fc1f8f01d052

Would be great to have this fixed in the next releases!

Best,
Alexey