Hi, Using the RAPL power monitoring plugin with: AcctGatherEnergyType=acct_gather_energy/rapl we noted pretty high AveWatts values reported in scontrol show node. For instance: # scontrol show node sh02-01n04 NodeName=sh02-01n04 Arch=x86_64 CpuBind=cores CoresPerSocket=10 CPUAlloc=0 CPUTot=20 CPULoad=0.61 AvailableFeatures=IB:EDR,CPU_MNF:INTEL,CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,NO_GPU ActiveFeatures=IB:EDR,CPU_MNF:INTEL,CPU_GEN:BDW,CPU_SKU:E5-2640v4,CPU_FRQ:2.40GHz,NO_GPU Gres=(null) NodeAddr=sh02-01n04 NodeHostName=sh02-01n04 Version=20.11.5 OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 RealMemory=128000 AllocMem=0 FreeMem=120562 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=102161 Owner=N/A MCS_label=N/A Partitions=normal BootTime=2020-12-19T14:50:25 SlurmdStartTime=2021-03-26T16:55:01 CfgTRES=cpu=20,mem=125G,billing=51 AllocTRES= CapWatts=n/a CurrentWatts=23 AveWatts=35829 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) The real power draw of that 20-core Broadwell node at rest (the node is IDLE right now) is more like 50 watts, and it's maximum consumption is probably around 500W, so seeing an average value of over 35kW is a bit surprising. This seems to be consistent across all our nodes that support RAPL (ie. Intel-based nodes). We didn't find much information in the docs, and the AveWatts field is not referenced anywhere we looked, so I'm wondering if it's actually reporting Watts or something else. Thanks, -- Kilian
Hi Kilian, > We didn't find much information in the docs, and the AveWatts field is not > referenced anywhere we looked, so I'm wondering if it's actually reporting > Watts or something else. You are right that there is not much information about that value, but I'm pretty sure that this is because it should be self explanatory, i.e. it's supposed to report Watts. Let me look deeper into the code. I'll come back to you with more info. Regards, Albert
Kilian, Could you do the following: 1) Add DebugFlags=Energy to slurm.conf and "scontrol reconfigure", or better just run: $ scontrol setdebugflags +Energy 2) Restart one slurmd (eg sh02-01n04) 3) Gather the slurmd log of 1 day and attach it here, with also slurmctld log. 4) Remove DebugFlags=Energy and reconfigure, or just: $ scontrol setdebugflags -Energy Thanks, Albert
Hi Albert, (In reply to Albert Gil from comment #4) > Could you do the following: > 1) Add DebugFlags=Energy to slurm.conf and "scontrol reconfigure", or better > just run: > > $ scontrol setdebugflags +Energy > > 2) Restart one slurmd (eg sh02-01n04) > 3) Gather the slurmd log of 1 day and attach it here, with also slurmctld > log. > 4) Remove DebugFlags=Energy and reconfigure, or just: > > $ scontrol setdebugflags -Energy Thanks for the instructions, I'll get that for you. Just a quick precision: do I need to raise the logging level as well? We're running with: SlurmctldSyslogDebug=info SlurmdSyslogDebug=info Will that be enough to capture logs with DebugFlags=Energy? Thanks! -- Kilian
Hi Kilian, > Just a quick precision: do I need to raise the logging level as well? We're > running with: > SlurmctldSyslogDebug=info > SlurmdSyslogDebug=info > Will that be enough to capture logs with DebugFlags=Energy? Yes. DebugFlags are independent of the general debug/log level. Actually we are always leaning to use DebugFlags because we usually need a high level of details to troubleshoot issues and increasing the general level makes the logs too verbose. DebugFlags can provide high level of details of only some subsystems, and that's what we typically need for troubleshooting. In this case I've seen in the code that most of the logs that I want to see are already under DebugFlags=Energy (in 20.11), so it will be enough. At least for now. Thanks, Albert
Created attachment 18797 [details] slurmctld/sh02-01n04-slurmd logs Hi Albert, Thanks for the confirmation! Unfortunately, I don't think setting DebugFlags to Energy has been logging much additional information. but maybe I'm not looking for the right thing. Here are the slurmctld and sh02-01n04's slurmd logs for the last ~18h after setting DebugFlags: # date Fri Apr 2 07:54:09 PDT 2021 # grep -i energy slurmctld.log 2021-04-01T13:42:00-07:00 sh03-sl01 slurmctld[19096]: Set DebugFlags to Energy # grep -iE '20\.11' sh02-01n04.log 2021-03-31T15:30:29-07:00 sh02-01n04 slurmd[39327]: slurmd version 20.11.5 started 2021-04-01T13:42:36-07:00 sh02-01n04 slurmd[168883]: slurmd version 20.11.5 started Flags set at 13:42:00 yesterday, and slurmd restarted on sh02-01n04 at 13:42:36 Cheers, -- Kilian
Hi Killian, > Thanks for the confirmation! Unfortunately, I don't think setting DebugFlags > to Energy has been logging much additional information. but maybe I'm not > looking for the right thing. My mistake. My comment 6 was not accurate, actually DebugFlags does need these levels (or higher): SlurmctldDebug=verbose SlurmdDebug=verbose I guess that I'm so used to higher values that I just forgot... sorry about that. Could you repeat that gathering of logs using the verbose level? Thanks, Albert
Hi Albert, (In reply to Albert Gil from comment #8) > Could you repeat that gathering of logs using the verbose level? No worries. I tried "scontrol setdebug verbose" but fell back into bug #10922. So I modified slurm.conf to change the log levels: SlurmctldSyslogDebug=verbose SlurmdSyslogDebug=verbose # scontrol show config | grep -i debug DebugFlags = Energy SlurmctldDebug = info SlurmctldSyslogDebug = verbose SlurmdDebug = info SlurmdSyslogDebug = verbose Hopefully the Slurm{ctld,d}Debug values will not be used, as we log to syslog directly. So far, though, I couldn't find anything relative to Energy in the slurmctld nor slurmd logs, besides this: # grep -i energy /var/log/slurm/ctld/sh03-sl01.log 2021-04-06T08:09:25-07:00 sh03-sl01 slurmctld[19096]: Set DebugFlags to Energy 2021-04-06T08:12:02-07:00 sh03-sl01 slurmctld[19096]: ENERGY: Updating acct_gather data for sh01-07n[02-08,10-20],sh01-08n[17-26,30-33],sh01-13n[25-29],sh01-17n[14-19,23-26],sh01-18n[11-35],sh01-19n[01-06,08],sh01-20n[01-35],sh01-21n[01-20,22-36],sh01-22n[01-36],sh01-23n[01-11,13,15-36],sh01-24n[01-23,25-36],sh01-25n[01-16,18-33],sh01-26n[01-03,05-34,36],sh01-27n[01-25,28-33,35],sh01-28n[01-12],sh01-29n[02-08],sh01-30n[01-36],sh01-31n[01-10],sh02-01n[01-72],sh02-02n[01-72],sh02-03n[01-72],sh02-04n[01-72],sh02-05n[01-24,29-72],sh02-06n[01-72],sh02-0 2021-04-06T08:17:33-07:00 sh03-sl01 slurmctld[19096]: ENERGY: Updating acct_gather data for sh01-07n[02-08,10-20],sh01-08n[17-26,30-33],sh01-13n[25-29],sh01-17n[14-19,23-26],sh01-18n[11-35],sh01-19n[01-06,08],sh01-20n[01-35],sh01-21n[01-20,22-36],sh01-22n[01-36],sh01-23n[01-11,13,15-36],sh01-24n[01-23,25-36],sh01-25n[01-16,18-33],sh01-26n[01-03,05-34,36],sh01-27n[01-25,28-33,35],sh01-28n[01-12],sh01-29n[02-08],sh01-30n[01-36],sh01-31n[01-10],sh02-01n[01-72],sh02-02n[01-72],sh02-03n[01-72],sh02-04n[01-72],sh02-05n[01-24,29-72],sh02-06n[01-72],sh02-0 The "ENERGY: Updating acct_gather data for" is repeated every 5 minutes, but without any additional details. Not how the node list is truncated too. Nothing in the slurmd logs so far, even after a restart. What messages should I look for? Thanks! -- Kilian
Hi Kilian, > The "ENERGY: Updating acct_gather data for" is repeated every 5 minutes, but > without any additional details. Not how the node list is truncated too. Yes, this is the only message expected on slurmctld. > Nothing in the slurmd logs so far, even after a restart. > What messages should I look for? Something is wrong. It depends on the plugin, but for RAPL there are these specific logs on different functions: src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c|273| log_flag(ENERGY, "RAPL Found: %d packages", nb_pkg); src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c|354| log_flag(ENERGY, "RAPL Result %"PRIu64" = %.6f Joules", result, ret); src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c|378| log_flag(ENERGY, "%s: current %.6f Joules, consumed %"PRIu64"", src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c|408| log_flag(ENERGY, "%s: consumed %u watts", src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c|414| log_flag(ENERGY, "Energy: dataset created (id = %d)", Note that every time that the "ENERGY: Updating acct_gather data for" is logged on slurmctld, that means that slurmds are requested to gather that information. And some of the above logs should be printed. > I tried "scontrol setdebug verbose" but fell back into > bug #10922. So I modified slurm.conf to change the log levels: I see. It seems that is going to be fixed soon. > # scontrol show config | grep -i debug > DebugFlags = Energy > SlurmctldDebug = info > SlurmctldSyslogDebug = verbose > SlurmdDebug = info > SlurmdSyslogDebug = verbose > > Hopefully the Slurm{ctld,d}Debug values will not be used, as we log to > syslog directly. This is strange. Please set SlurmdDebug=verbose and try again. Thanks, Albert
(In reply to Albert Gil from comment #10) > This is strange. > Please set SlurmdDebug=verbose and try again. Ok, it only worked when I put SlurmdSyslogDebug=verbose and DebugFlags=Energy in slurm.conf, distributed the file everywhere, and did a "scontrol reconfig". Using scontrol to set debugflags or changing SlurmdSyslogDebug on the compute node only apparently doesn't work. Anyway, here's what sh02-01n04 has started logging: Apr 08 09:53:18 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _hardware: ENERGY: RAPL Found: 2 packages Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: RAPL powercapture_debug Energy units = 0.000061, Power Units = 0.125000 Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: RAPL Max power = 180 w Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: ENERGY: RAPL Result 9586735962 = 585127.927368 Joules Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: ENERGY: _get_joules_task: current 585127.927368 Joules, consumed 1 Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: RAPL powercapture_debug Energy units = 0.000061, Power Units = 0.125000 Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: RAPL Max power = 180 w Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: ENERGY: RAPL Result 10472853537 = 639212.252014 Joules Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: _get_joules_task: ENERGY: _get_joules_task: current 639212.252014 Joules, consumed 54085 Max power = 180W seems reasonable for that type of node. After the controller restart, scontrol show node initially showed {Current,Ave}Watts=0 for all the nodes, but after a few minutes, values came back:: # scontrol show node sh02-01n04 | grep Watts CapWatts=n/a CurrentWatts=180 AveWatts=27042 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s With the first value being 180W, it's weird that the average is over 27,000 already. Hope this helps! Cheers, -- Kilian
Hi Kilian, > Ok, it only worked when I put SlurmdSyslogDebug=verbose and > DebugFlags=Energy in slurm.conf, distributed the file everywhere, and did a > "scontrol reconfig". Using scontrol to set debugflags or changing > SlurmdSyslogDebug on the compute node only apparently doesn't work. Yes, slurm.conf should be always distributed. > Apr 08 09:53:18 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: > _hardware: ENERGY: RAPL Found: 2 packages > Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: > _get_joules_task: RAPL powercapture_debug Energy units = 0.000061, Power > Units = 0.125000 > Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: > _get_joules_task: RAPL Max power = 180 w > Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: > _get_joules_task: ENERGY: RAPL Result 9586735962 = 585127.927368 Joules > Apr 08 09:55:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: > _get_joules_task: ENERGY: _get_joules_task: current 585127.927368 Joules, > consumed 1 > Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: > _get_joules_task: RAPL powercapture_debug Energy units = 0.000061, Power > Units = 0.125000 > Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: > _get_joules_task: RAPL Max power = 180 w > Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: > _get_joules_task: ENERGY: RAPL Result 10472853537 = 639212.252014 Joules > Apr 08 10:00:17 sh02-01n04.int slurmd[141547]: acct_gather_energy/rapl: > _get_joules_task: ENERGY: _get_joules_task: current 639212.252014 Joules, > consumed 54085 > > Max power = 180W seems reasonable for that type of node. > > After the controller restart, scontrol show node initially showed > {Current,Ave}Watts=0 for all the nodes, but after a few minutes, values came > back:: > > # scontrol show node sh02-01n04 | grep Watts > CapWatts=n/a > CurrentWatts=180 AveWatts=27042 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > With the first value being 180W, it's weird that the average is over 27,000 > already. You are right. I've been able to reproduce the issue and it seems that there is a bug on how the AveWatts is computed, it's not computing the AveWatts but something like the average of the consumed Joules from the last reading. I've seen other things that can be improved in that block of code. I'm preparing a patch. I'll keep you updated, Albert
Hi Kilian, > I'm preparing a patch. I'm glad to let you know that the main fix of the issue has been pushed into 20.11 and will be released as part of the next minor release 20.11.6: - https://github.com/SchedMD/slurm/commit/3bd1c3037f5b35da30ca977132c698a4da30c74d > I've seen other things that can be improved in that block of code. We are still doing some related improvements, but most probably they will land on next major release 21.08. I'll keep you posted, Albert
Hi Albert, On Mon, Apr 19, 2021 at 8:52 AM <bugs@schedmd.com> wrote: > I'm glad to let you know that the main fix of the issue has been pushed into > 20.11 and will be released as part of the next minor release 20.11.6: > > - > https://github.com/SchedMD/slurm/commit/3bd1c3037f5b35da30ca977132c698a4da30c74d > > > > I've seen other things that can be improved in that block of code. > > We are still doing some related improvements, but most probably they will land > on next major release 21.08. Thanks for the updates, that's all excellent news! Cheers, -- Kilian
Hi Kilian, We are still discussing internally how/if we'll improve this part of the code. If this is ok for you, I'm closing this bug as the main issue is fixed. Thanks, Albert
(In reply to Albert Gil from comment #25) > We are still discussing internally how/if we'll improve this part of the > code. > If this is ok for you, I'm closing this bug as the main issue is fixed. Yes, that's good. Thanks! Cheers, -- Kilian
Dear Albert, maybe you're already aware, but just in case: also "CurrentWatts" value seems to be computed/reported incorrectly on Haswell and later: https://bugs.schedmd.com/show_bug.cgi?id=9956 https://github.com/amkozlov/slurm/commit/46094c35126f71f1f558d3153ef4fc1f8f01d052 Would be great to have this fixed in the next releases! Best, Alexey