Ticket 17759

Summary: scontrol show node shows CurrentWatts and CPULoad greater than zero for nodes that are powered off
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: slurmctldAssignee: Tyler Connel <tyler>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: tyler
Version: 23.02.5   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.11 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2023-09-24 05:55:32 MDT
Since we enabled to Slurm power_save module for powering down idle on-premise nodes, we have noticed that "scontrol show node" shows CurrentWatts power and CPULoad greater than zero for nodes that are actually powered off, for example:

$ scontrol show node s007
NodeName=s007 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=0 CPUEfctv=80 CPUTot=80 CPULoad=0.02
   AvailableFeatures=xeon5218r,GPU_RTX3090,power_ipmi
   ActiveFeatures=xeon5218r,GPU_RTX3090,power_ipmi
   Gres=gpu:RTX3090:10
   NodeAddr=s007 NodeHostName=s007 Version=23.02.4
   OS=Linux 3.10.0-1160.99.1.el7.x86_64 #1 SMP Wed Sep 13 14:19:20 UTC 2023
   RealMemory=768000 AllocMem=0 FreeMem=763076 Sockets=4 Boards=1
   State=IDLE+POWERED_DOWN ThreadsPerCore=2 TmpDisk=800000 Weight=19336 Owner=N/A MCS_label=N/A
   Partitions=sm3090,sm3090_768
   BootTime=2023-09-23T23:38:43 SlurmdStartTime=2023-09-23T23:39:14
   LastBusyTime=Unknown ResumeAfterTime=None
   CfgTRES=cpu=80,mem=750G,billing=160,gres/gpu=10
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=37 AveWatts=36
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Here the expected values would be N/A because the node is powered off:

CPULoad=N/A
CurrentWatts=N/A
AveWatts=N/A

The IMPI DMCI remote command shows that the node's current power is indeed zero:

$ ipmi-dcmi -D LAN_2_0 --username=root --password=$IPMI_PASSWORD --hostname=s007b --get-system-power-statistics
Current Power                        : 0 Watts
Minimum Power over sampling duration : 350 watts
Maximum Power over sampling duration : 4687 watts
Average Power over sampling duration : 2356 watts
Time Stamp                           : 09/24/2023 - 11:42:26
Statistics reporting time period     : 2672412000 milliseconds
Power Measurement                    : Not Available

It appears that scontrol display the last recorded values from slurmd in stead of what should be the actual current values.  

IMHO, when slurmctld registers the State=IDLE+POWERED_DOWN, or if slurmd hasn't been reachable for SlurmdTimeout seconds, slurmctld should set the 3 N/A values above.

Could you kindly update slurmctld to display this behavior?

FYI, we have this slurmd timeout value:
$ scontrol show config | grep SlurmdTimeout
SlurmdTimeout           = 300 sec

Thanks,
Ole
Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2023-09-24 06:26:43 MDT
Note added: We measure node power using RAPL:

$ scontrol show config | grep AcctGatherEnergyType
AcctGatherEnergyType    = acct_gather_energy/rapl
Comment 2 Tyler Connel 2023-09-26 09:09:58 MDT
Hello Ole,

Fortunately, this issue is easily reproducible. I have a patch that I'll be uploading now that will cause the power save module to reset the three metrics: CPULoad, CurrentWatts, AveWatts when a node is determined to be suspended.

Best Regards,
Tyler Connel
Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2023-09-26 09:12:06 MDT
Hi Tyler,

(In reply to Tyler Connel from comment #2)
> Fortunately, this issue is easily reproducible. I have a patch that I'll be
> uploading now that will cause the power save module to reset the three
> metrics: CPULoad, CurrentWatts, AveWatts when a node is determined to be
> suspended.

Wonderful, I'm glad this is an easy fix :-)  Will the patch be applied to 23.02, or do we have to wait until 23.11?

Thanks,
Ole
Comment 5 Tyler Connel 2023-09-26 12:21:28 MDT
Hopefully it's a *good* fix per the reviewer :)

Since the change mostly involves correcting a behavior to an expected behavior, my intuition is to target master for the change. Of course, if you have a strong preference that the change apply to 23.02 I would be willing to target that branch.
Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2023-09-26 12:34:09 MDT
(In reply to Tyler Connel from comment #5)
> Since the change mostly involves correcting a behavior to an expected
> behavior, my intuition is to target master for the change. Of course, if you
> have a strong preference that the change apply to 23.02 I would be willing
> to target that branch.

Yes, I'd like to ask for the change in 23.02.  We won't plan to upgrade to 23.11 until several minor releases, so I'd be really happy if the change could be applied to 23.02 also!  I'd like to get my power monitoring scripts perfected and tested with acct_gather_energy/impi quite soon.

Thanks,
Ole
Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2023-09-27 05:28:17 MDT
Hi Tyler,

(In reply to Tyler Connel from comment #2)
> Fortunately, this issue is easily reproducible. I have a patch that I'll be
> uploading now that will cause the power save module to reset the three
> metrics: CPULoad, CurrentWatts, AveWatts when a node is determined to be
> suspended.

It's not only the power save module which may turn off a node, other reasons for node malfunction exist.  Today we have a node with a dead motherboard which crashed and won't power up.  It's status is DOWN+DRAIN+NOT_RESPONDING:

$ scontrol show node c060
NodeName=c060 Arch=x86_64 CoresPerSocket=10 
   CPUAlloc=0 CPUEfctv=40 CPUTot=40 CPULoad=39.98
   AvailableFeatures=xeon6148v5,opa,xeon40,power_ipmi
   ActiveFeatures=xeon6148v5,opa,xeon40,power_ipmi
   Gres=(null)
   NodeAddr=c060 NodeHostName=c060 Version=23.02.5
   OS=Linux 3.10.0-1160.95.1.el7.x86_64 #1 SMP Mon Jul 24 13:59:37 UTC 2023 
   RealMemory=384000 AllocMem=0 FreeMem=371528 Sockets=4 Boards=1
   State=DOWN+DRAIN+NOT_RESPONDING ThreadsPerCore=1 TmpDisk=140000 Weight=10535 Owner=N/A MCS_label=N/A
   Partitions=xeon40 
   BootTime=2023-08-30T08:37:02 SlurmdStartTime=2023-09-26T20:24:43
   LastBusyTime=2023-09-27T07:22:11 ResumeAfterTime=None
   CfgTRES=cpu=40,mem=375G,billing=66
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=435 AveWatts=395
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Motherboard defective [root@2023-09-27T07:56:40]

So it would be good if nodes with a state of NOT_RESPONDING also get their CPULoad, CurrentWatts, AveWatts metrics reset to N/A.  Is this possible?

Thanks,
Ole
Comment 8 Tyler Connel 2023-09-27 11:39:59 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7)
> Hi Tyler,
> 
> (In reply to Tyler Connel from comment #2)
> > Fortunately, this issue is easily reproducible. I have a patch that I'll be
> > uploading now that will cause the power save module to reset the three
> > metrics: CPULoad, CurrentWatts, AveWatts when a node is determined to be
> > suspended.
> 
> It's not only the power save module which may turn off a node, other reasons
> for node malfunction exist.  Today we have a node with a dead motherboard
> which crashed and won't power up.  It's status is DOWN+DRAIN+NOT_RESPONDING:
> 
> $ scontrol show node c060
> NodeName=c060 Arch=x86_64 CoresPerSocket=10 
>    CPUAlloc=0 CPUEfctv=40 CPUTot=40 CPULoad=39.98
>    AvailableFeatures=xeon6148v5,opa,xeon40,power_ipmi
>    ActiveFeatures=xeon6148v5,opa,xeon40,power_ipmi
>    Gres=(null)
>    NodeAddr=c060 NodeHostName=c060 Version=23.02.5
>    OS=Linux 3.10.0-1160.95.1.el7.x86_64 #1 SMP Mon Jul 24 13:59:37 UTC 2023 
>    RealMemory=384000 AllocMem=0 FreeMem=371528 Sockets=4 Boards=1
>    State=DOWN+DRAIN+NOT_RESPONDING ThreadsPerCore=1 TmpDisk=140000
> Weight=10535 Owner=N/A MCS_label=N/A
>    Partitions=xeon40 
>    BootTime=2023-08-30T08:37:02 SlurmdStartTime=2023-09-26T20:24:43
>    LastBusyTime=2023-09-27T07:22:11 ResumeAfterTime=None
>    CfgTRES=cpu=40,mem=375G,billing=66
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=435 AveWatts=395
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>    Reason=Motherboard defective [root@2023-09-27T07:56:40]
> 
> So it would be good if nodes with a state of NOT_RESPONDING also get their
> CPULoad, CurrentWatts, AveWatts metrics reset to N/A.  Is this possible?
> 
> Thanks,
> Ole

This is an excellent point. I'll pull this down from review and reconsider my approach against unexpectedly downed nodes.
Comment 9 Tyler Connel 2023-10-10 10:25:25 MDT
Just wanted to update as it's been a while since my last comment.

I found a good solution for CPU load to display as N/A for node states which include any of: DOWN, POWERED_DOWN, and NO_RESPOND. For the power metrics (e.g. AveWatts) I'll have to spend some time to find a good place in acct_gather_energy to affect a change.
Comment 11 Tyler Connel 2023-10-23 12:12:37 MDT
@Ole,

There's been some discussions on this issue. How would you feel about the expected values for a DOWN/POWERED_DOWN node being zeroed instead of N/A? E.g.:

CPULoad=0
CurrentWatts=0
AveWatts=0

We feel that this would be a better set of values for the interface to display. Would you have any reservations?

Best,
Tyler Connel
Comment 12 Ole.H.Nielsen@fysik.dtu.dk 2023-10-23 12:14:23 MDT
Hi Tyler,

(In reply to Tyler Connel from comment #11)
> There's been some discussions on this issue. How would you feel about the
> expected values for a DOWN/POWERED_DOWN node being zeroed instead of N/A?
> E.g.:
> 
> CPULoad=0
> CurrentWatts=0
> AveWatts=0
> 
> We feel that this would be a better set of values for the interface to
> display. Would you have any reservations?

I'm fine with zero values, since that reflects the node state as well.

Thanks,
Ole
Comment 22 Tyler Connel 2023-11-06 11:11:30 MST
Hello @Ole,

The patch has been accepted to reset the values mentioned (CPU load, current watts and average watts) to 0 when a node goes down unexpectedly or is powered down. This patch was accepted for 23.11, and I recall that you had also wanted the change applied to 23.02. I will inquire as to whether the change can be applied to 23.02 before resolving the ticket.

Best,
Tyler Connel
Comment 23 Ole.H.Nielsen@fysik.dtu.dk 2023-11-06 11:11:39 MST
I'm out of office, back on Thursday, November 9.
Jeg er ikke på kontoret, tilbage på torsdag den 9. november.

Best regards / Venlig hilsen,
Ole Holm Nielsen
Comment 24 Tyler Connel 2023-11-09 14:31:18 MST
Hello @Ole,

As this involves a change in behavior, the fix will only apply to 23.11. I'll resolve this ticket, but feel free to reach out if you have further questions.

Best Regards,
Tyler Connel
Comment 25 Ole.H.Nielsen@fysik.dtu.dk 2023-11-10 00:23:24 MST
Hi Tyler,

(In reply to Tyler Connel from comment #24)
> As this involves a change in behavior, the fix will only apply to 23.11.
> I'll resolve this ticket, but feel free to reach out if you have further
> questions.

Thanks a lot for the fix!  I'm sorry it can't apply to 23.02, since I consider the current behavior incorrect.  But so be it.

Greetings,
Ole