At SLUG'23 I discussed with Brian the sreport output for powered_down (suspended) nodes which appears to be inconsistent, at least for on-premise nodes that have been suspended by the Slurm power save module. We would hope for sreport to display the number of nodes that have been suspended so that we can estimate the power savings. Our cluster with 675 on-premise nodes currently have 50 nodes powered off and 1 broken node: Batch job status for cluster niflheim at Thu Sep 14 03:40:52 CEST 2023 Node states summary: allocated 584 nodes ( 86.52%) 22320 CPUs ( 90.26%) drained~ 1 nodes ( 0.15%) 40 CPUs ( 0.16%) Powered off idle 34 nodes ( 5.04%) 688 CPUs ( 2.78%) idle~ 50 nodes ( 7.41%) 1200 CPUs ( 4.85%) Powered off mixed 6 nodes ( 0.89%) 480 CPUs ( 1.94%) Total 675 nodes (100.00%) 24728 CPUs (100.00%) but sreport gives this result for a 3 hour period: $ sreport cluster utilization Start=0914 End=03:00 -t percent -------------------------------------------------------------------------------- Cluster Utilization 2023-09-14T00:00:00 - 2023-09-14T02:59:59 Usage reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Dow Idle Planned Reported --------- --------- -------- -------- -------- -------- --------- niflheim 92.95% 3.74% 0.00% 0.00% 3.31% 100.00% The sinfo command gives a full status: $ sinfo -O nodes:6,cpus:5,statecomplete,nodelist:150 NODES CPUS STATECOMPLETE NODELIST 584 16+ allocated a[001-128],b[001-012],c[001-019,021-196],d[001-096],i[007-013,017-020,029,031-038,041-050],s[001,003],x[004,006,008,010-012,017,023-024,033-051,053-05 34 16+ idle i[005-006,014-016,021-028,030,039-040],x[052,055,065,098,103-107,109,149,151,157,159-160,162-164] 50 24 idle+powered_down x[005,007,009,013-016,018-022,025-032,057,073,078,086-088,102,111-112,121,123-129,133-141,165,168-170] 1 40 down+drain+powered_dc020 6 80 mixed s[002,004-008] So it would seem that sreport reports inconsistent powered_down (suspended) nodes. The numbers for Down, PLND Down and Planned to not appear to reflect to cluster status. Question: How may we use sreport to obtain statistics on the number of powered down nodes during a given period of time? Thanks, Ole
Hello Ole, I will need some time to look into this, but from first glance sreport’s PLND Down field calculation does not include the time that non-CLOUD nodes are in the POWERED_DOWN state. I will report back when I have more information. https://slurm.schedmd.com/sreport.html#OPT_cluster-Utilization Thanks, --Megan
Hello Ole, After discussing this with Brian, I’ll go ahead and work on including non-cloud nodes in the calculation. I’ll keep you updated on the patch’s progress. Regards, --Megan
Hello Ole, The time that all nodes are in the powered_down state will now be included in sreport’s PLND Down field instead of it only applying to cloud nodes. The change can be found in the following commit: commit d021731cbf366859bb98bf3232463545916d6d40 Author: Megan Dahl <megan@schedmd.com> Date: Mon Sep 25 10:52:16 2023 -0600 sreport PlannedDown field includes the time all nodes were POWERED_DOWN PlannedDown used to only include the time nodes were POWERED_DOWN if they were cloud nodes. However, it is desirable to see statistics on all POWERED_DOWN nodes. Bug 17689 This will be available in 23.11. Regards, --Megan
Hi Megan, Thanks a lot for creating this patch: (In reply to Megan Dahl from comment #5) > The time that all nodes are in the powered_down state will now be included > in sreport’s PLND Down field instead of it only applying to cloud nodes. The > change can be found in the following commit: > commit d021731cbf366859bb98bf3232463545916d6d40 > Author: Megan Dahl <megan@schedmd.com> > Date: Mon Sep 25 10:52:16 2023 -0600 > > sreport PlannedDown field includes the time all nodes were POWERED_DOWN > > PlannedDown used to only include the time nodes were POWERED_DOWN if > they were cloud nodes. However, it is desirable to see statistics on all > POWERED_DOWN nodes. > > Bug 17689 > > This will be available in 23.11. We probably won't be upgrading to 23.11 until after a few minor releases of 23.11. Is there any chance that the patch can make it into 23.02? Thanks, Ole
Hi Ole, Unfortunately, since this is a functional change that is user visible it can not be added 23.02. The purpose of this is to avoid breaking maintenance releases. Regards, --Megan