Ticket 17689

Summary: sreport reports inconsistent powered_down (suspended) nodes
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: User CommandsAssignee: Megan Dahl <megan>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: megan
Version: 23.02.4   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.11.x Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2023-09-13 19:51:57 MDT
At SLUG'23 I discussed with Brian the sreport output for powered_down (suspended) nodes which appears to be inconsistent, at least for on-premise nodes that have been suspended by the Slurm power save module.  We would hope for sreport to display the number of nodes that have been suspended so that we can estimate the power savings.

Our cluster with 675 on-premise nodes currently have 50 nodes powered off and 1 broken node:

Batch job status for cluster niflheim at Thu Sep 14 03:40:52 CEST 2023
Node states summary:
allocated    584 nodes ( 86.52%)  22320 CPUs ( 90.26%)
drained~       1 nodes (  0.15%)     40 CPUs (  0.16%) Powered off
idle          34 nodes (  5.04%)    688 CPUs (  2.78%)
idle~         50 nodes (  7.41%)   1200 CPUs (  4.85%) Powered off
mixed          6 nodes (  0.89%)    480 CPUs (  1.94%)
Total        675 nodes (100.00%)  24728 CPUs (100.00%)

but sreport gives this result for a 3 hour period:

$ sreport cluster utilization Start=0914 End=03:00 -t percent
--------------------------------------------------------------------------------
Cluster Utilization 2023-09-14T00:00:00 - 2023-09-14T02:59:59
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster Allocated     Down PLND Dow     Idle  Planned  Reported
--------- --------- -------- -------- -------- -------- ---------
 niflheim    92.95%    3.74%    0.00%    0.00%    3.31%   100.00%

The sinfo command gives a full status:

$  sinfo -O nodes:6,cpus:5,statecomplete,nodelist:150
NODES CPUS STATECOMPLETE       NODELIST
584   16+  allocated           a[001-128],b[001-012],c[001-019,021-196],d[001-096],i[007-013,017-020,029,031-038,041-050],s[001,003],x[004,006,008,010-012,017,023-024,033-051,053-05
34    16+  idle                i[005-006,014-016,021-028,030,039-040],x[052,055,065,098,103-107,109,149,151,157,159-160,162-164]
50    24   idle+powered_down   x[005,007,009,013-016,018-022,025-032,057,073,078,086-088,102,111-112,121,123-129,133-141,165,168-170]
1     40   down+drain+powered_dc020
6     80   mixed               s[002,004-008]

So it would seem that sreport reports inconsistent powered_down (suspended) nodes.  The numbers for Down, PLND Down and Planned to not appear to reflect to cluster status.

Question: How may we use sreport to obtain statistics on the number of powered down nodes during a given period of time?

Thanks,
Ole
Comment 1 Megan Dahl 2023-09-18 15:24:43 MDT
Hello Ole,

I will need some time to look into this, but from first glance sreport’s PLND Down field calculation does not include the time that non-CLOUD nodes are in the POWERED_DOWN state. I will report back when I have more information.
https://slurm.schedmd.com/sreport.html#OPT_cluster-Utilization

Thanks,
--Megan
Comment 2 Megan Dahl 2023-09-19 09:56:55 MDT
Hello Ole,

After discussing this with Brian, I’ll go ahead and work on including non-cloud nodes in the calculation. I’ll keep you updated on the patch’s progress. 

Regards,
--Megan
Comment 5 Megan Dahl 2023-09-29 14:57:53 MDT
Hello Ole,

The time that all nodes are in the powered_down state will now be included in sreport’s PLND Down field instead of it only applying to cloud nodes. The change can be found in the following commit:
commit d021731cbf366859bb98bf3232463545916d6d40
Author: Megan Dahl <megan@schedmd.com>
Date:   Mon Sep 25 10:52:16 2023 -0600

	sreport PlannedDown field includes the time all nodes were POWERED_DOWN
    
	PlannedDown used to only include the time nodes were POWERED_DOWN if
	they were cloud nodes. However, it is desirable to see statistics on all
	POWERED_DOWN nodes.
    
	Bug 17689

This will be available in 23.11.

Regards,
--Megan
Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2023-10-03 00:57:54 MDT
Hi Megan,

Thanks a lot for creating this patch:

(In reply to Megan Dahl from comment #5)
> The time that all nodes are in the powered_down state will now be included
> in sreport’s PLND Down field instead of it only applying to cloud nodes. The
> change can be found in the following commit:
> commit d021731cbf366859bb98bf3232463545916d6d40
> Author: Megan Dahl <megan@schedmd.com>
> Date:   Mon Sep 25 10:52:16 2023 -0600
> 
> 	sreport PlannedDown field includes the time all nodes were POWERED_DOWN
>     
> 	PlannedDown used to only include the time nodes were POWERED_DOWN if
> 	they were cloud nodes. However, it is desirable to see statistics on all
> 	POWERED_DOWN nodes.
>     
> 	Bug 17689
> 
> This will be available in 23.11.

We probably won't be upgrading to 23.11 until after a few minor releases of 23.11.  Is there any chance that the patch can make it into 23.02?

Thanks,
Ole
Comment 7 Megan Dahl 2023-10-03 14:36:08 MDT
Hi Ole,

Unfortunately, since this is a functional change that is user visible it can not be added 23.02. The purpose of this is to avoid breaking maintenance releases.

Regards,
--Megan