Summary: | scontrol show node shows CurrentWatts and CPULoad greater than zero for nodes that are powered off | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | slurmctld | Assignee: | Tyler Connel <tyler> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | tyler |
Version: | 23.02.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DTU Physics | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 23.11 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Ole.H.Nielsen@fysik.dtu.dk
2023-09-24 05:55:32 MDT
Note added: We measure node power using RAPL: $ scontrol show config | grep AcctGatherEnergyType AcctGatherEnergyType = acct_gather_energy/rapl Hello Ole, Fortunately, this issue is easily reproducible. I have a patch that I'll be uploading now that will cause the power save module to reset the three metrics: CPULoad, CurrentWatts, AveWatts when a node is determined to be suspended. Best Regards, Tyler Connel Hi Tyler, (In reply to Tyler Connel from comment #2) > Fortunately, this issue is easily reproducible. I have a patch that I'll be > uploading now that will cause the power save module to reset the three > metrics: CPULoad, CurrentWatts, AveWatts when a node is determined to be > suspended. Wonderful, I'm glad this is an easy fix :-) Will the patch be applied to 23.02, or do we have to wait until 23.11? Thanks, Ole Hopefully it's a *good* fix per the reviewer :) Since the change mostly involves correcting a behavior to an expected behavior, my intuition is to target master for the change. Of course, if you have a strong preference that the change apply to 23.02 I would be willing to target that branch. (In reply to Tyler Connel from comment #5) > Since the change mostly involves correcting a behavior to an expected > behavior, my intuition is to target master for the change. Of course, if you > have a strong preference that the change apply to 23.02 I would be willing > to target that branch. Yes, I'd like to ask for the change in 23.02. We won't plan to upgrade to 23.11 until several minor releases, so I'd be really happy if the change could be applied to 23.02 also! I'd like to get my power monitoring scripts perfected and tested with acct_gather_energy/impi quite soon. Thanks, Ole Hi Tyler, (In reply to Tyler Connel from comment #2) > Fortunately, this issue is easily reproducible. I have a patch that I'll be > uploading now that will cause the power save module to reset the three > metrics: CPULoad, CurrentWatts, AveWatts when a node is determined to be > suspended. It's not only the power save module which may turn off a node, other reasons for node malfunction exist. Today we have a node with a dead motherboard which crashed and won't power up. It's status is DOWN+DRAIN+NOT_RESPONDING: $ scontrol show node c060 NodeName=c060 Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUEfctv=40 CPUTot=40 CPULoad=39.98 AvailableFeatures=xeon6148v5,opa,xeon40,power_ipmi ActiveFeatures=xeon6148v5,opa,xeon40,power_ipmi Gres=(null) NodeAddr=c060 NodeHostName=c060 Version=23.02.5 OS=Linux 3.10.0-1160.95.1.el7.x86_64 #1 SMP Mon Jul 24 13:59:37 UTC 2023 RealMemory=384000 AllocMem=0 FreeMem=371528 Sockets=4 Boards=1 State=DOWN+DRAIN+NOT_RESPONDING ThreadsPerCore=1 TmpDisk=140000 Weight=10535 Owner=N/A MCS_label=N/A Partitions=xeon40 BootTime=2023-08-30T08:37:02 SlurmdStartTime=2023-09-26T20:24:43 LastBusyTime=2023-09-27T07:22:11 ResumeAfterTime=None CfgTRES=cpu=40,mem=375G,billing=66 AllocTRES= CapWatts=n/a CurrentWatts=435 AveWatts=395 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Motherboard defective [root@2023-09-27T07:56:40] So it would be good if nodes with a state of NOT_RESPONDING also get their CPULoad, CurrentWatts, AveWatts metrics reset to N/A. Is this possible? Thanks, Ole (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7) > Hi Tyler, > > (In reply to Tyler Connel from comment #2) > > Fortunately, this issue is easily reproducible. I have a patch that I'll be > > uploading now that will cause the power save module to reset the three > > metrics: CPULoad, CurrentWatts, AveWatts when a node is determined to be > > suspended. > > It's not only the power save module which may turn off a node, other reasons > for node malfunction exist. Today we have a node with a dead motherboard > which crashed and won't power up. It's status is DOWN+DRAIN+NOT_RESPONDING: > > $ scontrol show node c060 > NodeName=c060 Arch=x86_64 CoresPerSocket=10 > CPUAlloc=0 CPUEfctv=40 CPUTot=40 CPULoad=39.98 > AvailableFeatures=xeon6148v5,opa,xeon40,power_ipmi > ActiveFeatures=xeon6148v5,opa,xeon40,power_ipmi > Gres=(null) > NodeAddr=c060 NodeHostName=c060 Version=23.02.5 > OS=Linux 3.10.0-1160.95.1.el7.x86_64 #1 SMP Mon Jul 24 13:59:37 UTC 2023 > RealMemory=384000 AllocMem=0 FreeMem=371528 Sockets=4 Boards=1 > State=DOWN+DRAIN+NOT_RESPONDING ThreadsPerCore=1 TmpDisk=140000 > Weight=10535 Owner=N/A MCS_label=N/A > Partitions=xeon40 > BootTime=2023-08-30T08:37:02 SlurmdStartTime=2023-09-26T20:24:43 > LastBusyTime=2023-09-27T07:22:11 ResumeAfterTime=None > CfgTRES=cpu=40,mem=375G,billing=66 > AllocTRES= > CapWatts=n/a > CurrentWatts=435 AveWatts=395 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > Reason=Motherboard defective [root@2023-09-27T07:56:40] > > So it would be good if nodes with a state of NOT_RESPONDING also get their > CPULoad, CurrentWatts, AveWatts metrics reset to N/A. Is this possible? > > Thanks, > Ole This is an excellent point. I'll pull this down from review and reconsider my approach against unexpectedly downed nodes. Just wanted to update as it's been a while since my last comment. I found a good solution for CPU load to display as N/A for node states which include any of: DOWN, POWERED_DOWN, and NO_RESPOND. For the power metrics (e.g. AveWatts) I'll have to spend some time to find a good place in acct_gather_energy to affect a change. @Ole, There's been some discussions on this issue. How would you feel about the expected values for a DOWN/POWERED_DOWN node being zeroed instead of N/A? E.g.: CPULoad=0 CurrentWatts=0 AveWatts=0 We feel that this would be a better set of values for the interface to display. Would you have any reservations? Best, Tyler Connel Hi Tyler, (In reply to Tyler Connel from comment #11) > There's been some discussions on this issue. How would you feel about the > expected values for a DOWN/POWERED_DOWN node being zeroed instead of N/A? > E.g.: > > CPULoad=0 > CurrentWatts=0 > AveWatts=0 > > We feel that this would be a better set of values for the interface to > display. Would you have any reservations? I'm fine with zero values, since that reflects the node state as well. Thanks, Ole Hello @Ole, The patch has been accepted to reset the values mentioned (CPU load, current watts and average watts) to 0 when a node goes down unexpectedly or is powered down. This patch was accepted for 23.11, and I recall that you had also wanted the change applied to 23.02. I will inquire as to whether the change can be applied to 23.02 before resolving the ticket. Best, Tyler Connel I'm out of office, back on Thursday, November 9. Jeg er ikke på kontoret, tilbage på torsdag den 9. november. Best regards / Venlig hilsen, Ole Holm Nielsen Hello @Ole, As this involves a change in behavior, the fix will only apply to 23.11. I'll resolve this ticket, but feel free to reach out if you have further questions. Best Regards, Tyler Connel Hi Tyler, (In reply to Tyler Connel from comment #24) > As this involves a change in behavior, the fix will only apply to 23.11. > I'll resolve this ticket, but feel free to reach out if you have further > questions. Thanks a lot for the fix! I'm sorry it can't apply to 23.02, since I consider the current behavior incorrect. But so be it. Greetings, Ole |