Ticket 12170

Summary:	_get_joules_task returns bad consumed energy values after reconfigure or slurmd restart
Product:	Slurm	Reporter:	Regine Gaudin <regine.gaudin>
Component:	Accounting	Assignee:	Carlos Tripiana Montes <tripiana>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, felip.moll, martin.perry, sts
Version:	21.08.1
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5518
Site:	CEA	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	21.08.7, 22.05.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Ticket Depends on:
Ticket Blocks:	13778

Description Regine Gaudin 2021-07-30 07:06:56 MDT

Hi

We are making frequent reconfigure as our clusters have hundreds of nodes leaving and rejoigning production.

When a  reconfigure is requested, ipmi-thread on each node is restarted.
consumed value of the node is then set to NOVAL64, 0 or 0<value < job_already_energy

So a job already running with an old->consumed_energy value positive on that node 
will have an incorrect new energy... something like
 new->base_consumed_energy = 0 + NOVAL64 - old->consumed_energy

as calculation is in 
src/plugins/acct_gather_energy/ipmi/acct_gather_energy.c

new->base_consumed_energy = adjustment +
    new->base_consumed_energy = adjustment +
                (new->consumed_energy - old->consumed_energy);
           

See slurmd debug energy log:

2021-07-30T14:49:35.830] ipmi-thread: sensor 13 current_watts: 200, consumed 112345 Joules 3000 new, ave watts 192
[2021-07-30T14:49:50.137] ipmi-thread: sensor 13 current_watts: 200, consumed 115345 Joules 3000 new, ave watts 192
[2021-07-30T14:49:50.176] [82561.3] _get_joules_task: consumed 83995 Joules (received 6000(200 watts) from slurmd)

(step 82561.3] has consumed 83995 J83995 Joules by now)

[2021-07-30T14:50:03.196] got reconfigure request
[2021-07-30T14:50:03.198] all threads complete
[2021-07-30T14:50:03.414] Considering each NUMA node as a socket
[2021-07-30T14:50:03.617] Node reconfigured socket/core boundaries SocketsPerBoard=2:8(hw) CoresPerSocket=64:16(hw)
[2021-07-30T14:50:03.622] Message aggregation disabled
[2021-07-30T14:50:03.852] Warning: Note very large processing time from slurm_topo_build_config: usec=226658 began=14:50:03.625
[2021-07-30T14:50:04.805] Can not stat gres.conf file (/etc/slurm/gres.conf), using slurm.conf data
[2021-07-30T14:50:04.853] ipmi-thread: ended
[2021-07-30T14:50:05.018] AcctGatherEnergy IPMI plugin thread launched
[2021-07-30T14:50:05.019] AcctGatherEnergy IPMI plugin loaded
[2021-07-30T14:50:05.021] ipmi-thread: launched
[2021-07-30T14:50:05.572] AcctGatherEnergy IPMI plugin thread init


[2021-07-30T14:50:05.651] ipmi-thread: sensor 13 current_watts: 200, consumed 0 Joules 0 new, ave watts 200
[2021-07-30T14:50:20.064] ipmi-thread: sensor 13 current_watts: 200, consumed 3000 Joules 3000 new, ave watts 200
[2021-07-30T14:50:20.075] [82561.3] _get_joules_task: consumed 18446744073709523266 Joules (received 18446744073709439271(200 watts) from slurmd)
2021-07-30T14:52:35.657] ipmi-thread: sensor 13 current_watts: 200, consumed 30000 Joules 3200 new, ave watts 200
[2021-07-30T14:52:50.063] ipmi-thread: sensor 13 current_watts: 200, consumed 33000 Joules 3000 new, ave watts 200

Energy of the step 82561.3] is now reset to 1650 Joules !


[2021-07-30T14:52:50.080] [82561.3] _get_joules_task: consumed 1650 Joules (received 6000(200 watts) from slurmd)
[2021-07-30T14:53:05.766] ipmi-thread: sensor 13 current_watts: 200, consumed 36000 Joules 3000 new, ave watts 200
[2021-07-30T14:53:20.072] ipmi-thread: sensor 13 current_watts: 200, consumed 39000 Joules 3000 new, ave watts 200


So the case where the energy of the node is becoming smaller than the energy of a running job is missing. I have tried the following patch but not sure it's clean but it seems to solve the problem:

2021-07-29T11:10:03.432] [80756.0] _get_joules_task: consumed 113305 Joules (received 0(570 watts) from slurmd)

[2021-07-29T11:10:03.463] got reconfigure request
[2021-07-29T11:10:03.465] waiting on 1 active threads

[2021-07-29T11:10:48.253] [80756.0] _get_joules_task: consumed 113305 Joules (
[2021-07-29T11:10:49.034] [80756.0] _get_joules_task: consumed 113875 Joules (received 570(570 watts) from slurmd)
[2021-07-29T11:10:50.063] [80756.0] _get_joules_task: consumed 114445 Joules (received 570(570 watts) from slurmd)


--- slurm-19.05.5.orig/src/plugins/acct_gather_energy/ipmi/acct_gather_energy_ipmi.c    2019-12-20 21:44:51.000000000 +0100
+++ slurm-19.05.5.work/src/plugins/acct_gather_energy/ipmi/acct_gather_energy_ipmi.c    2021-07-28 15:00:14.374241513 +0200
@@ -84,7 +84,8 @@
 #define _DEBUG_ENERGY 1
 #define IPMI_VERSION 2         /* Data structure version number */
 #define MAX_LOG_ERRORS 5       /* Max sensor reading errors log messages */
-
+#define FUZZY_ENERGY_MAX 1000000000
+#define fuzzy_max_energy(v1) ((v1) > FUZZY_ENERGY_MAX)
 /*
  * These variables are required by the generic plugin interface.  If they
  * are not found in the plugin, the plugin loader will ignore it.
@@ -822,6 +823,7 @@
 {
        time_t now = time(NULL);
        static bool first = true;
+       static bool impi_thread_restart = false;
        uint64_t adjustment = 0;
        uint16_t i;
-               if (!first) {
-                       new->consumed_energy -= start_current_energies[i];
-                       new->base_consumed_energy = adjustment +

+                if  (  fuzzy_max_energy(adjustment) ) {
+                    info("_get_joules_task: invalid  adjustment %"PRIu64" reset to 0 ",adjustment);
+                    adjustment = 0;
+                }
+               if  (!first) {
+                   if (new->consumed_energy >= old->consumed_energy )  {
+                       new->consumed_energy -= start_current_energies[i];
+                       new->base_consumed_energy = adjustment +
                                (new->consumed_energy - old->consumed_energy);
+                   }else{
+                        if (impi_thread_restart == false) {
+                                start_current_energies[i] =0;
+                                        info("_get_joules_task: start_current_energies reset ");
+                                                          }
+                        new->base_consumed_energy  = adjustment +  new->consumed_energy - start_current_energies[i] ;
+                        start_current_energies[i] = new->consumed_energy + adjustment ;
+                       impi_thread_restart = true;
+
+                 }

                } else {
                        /* This is just for the step, so take all the pervious
                           consumption out of the mix.
@@ -873,10 +900,31 @@
                        new->base_consumed_energy = 0;
                }


+
+/* Restart slurmd or reconfigure  might give new consumed energy to NO_VAL64 reset new base consumed energy to 0 */
+                if  (  fuzzy_max_energy(new->base_consumed_energy) ) {
+                if (debug_flags & DEBUG_FLAG_ENERGY)
+                    info("_get_joules_task: invalid new->base_consumed_energy %"PRIu64" reset to 0",new->base_consumed_energy);
+                    info("_get_joules_task: invalid  adjustment %"PRIu64" ",adjustment);
+                    info("_get_joules_task: invalid start_current_energies[i] %"PRIu64" ",start_current_energies[i]);
+                    info("_get_joules_task: invalid new->consumed_energy %"PRIu64" ",new->consumed_energy );
+                    info("_get_joules_task: invalid old->consumed_energy %"PRIu64" ",old->consumed_energy );
+                    new->base_consumed_energy = 0;
+                }
+
                } else {
                        /* This is just for the step, so take all the pervious
                           consumption out of the mix.
@@ -873,10 +900,31 @@
                        new->base_consumed_energy = 0;
                }

Comment 1 Felip Moll 2021-08-24 10:57:20 MDT

(In reply to Regine Gaudin from comment #0)
> Hi
> 
> We are making frequent reconfigure as our clusters have hundreds of nodes
> leaving and rejoigning production.
> 
> When a  reconfigure is requested, ipmi-thread on each node is restarted.
> consumed value of the node is then set to NOVAL64, 0 or 0<value <
> job_already_energy
> 
> So a job already running with an old->consumed_energy value positive on that
> node 
> will have an incorrect new energy... something like
>  new->base_consumed_energy = 0 + NOVAL64 - old->consumed_energy
> 
> as calculation is in 
> src/plugins/acct_gather_energy/ipmi/acct_gather_energy.c
> 
> new->base_consumed_energy = adjustment +
>     new->base_consumed_energy = adjustment +
>                 (new->consumed_energy - old->consumed_energy);
>            
> 
> See slurmd debug energy log:
> 
> 2021-07-30T14:49:35.830] ipmi-thread: sensor 13 current_watts: 200, consumed
> 112345 Joules 3000 new, ave watts 192
> [2021-07-30T14:49:50.137] ipmi-thread: sensor 13 current_watts: 200,
> consumed 115345 Joules 3000 new, ave watts 192
> [2021-07-30T14:49:50.176] [82561.3] _get_joules_task: consumed 83995 Joules
> (received 6000(200 watts) from slurmd)
> 
> (step 82561.3] has consumed 83995 J83995 Joules by now)
> 
> [2021-07-30T14:50:03.196] got reconfigure request
> [2021-07-30T14:50:03.198] all threads complete
> [2021-07-30T14:50:03.414] Considering each NUMA node as a socket
> [2021-07-30T14:50:03.617] Node reconfigured socket/core boundaries
> SocketsPerBoard=2:8(hw) CoresPerSocket=64:16(hw)
> [2021-07-30T14:50:03.622] Message aggregation disabled
> [2021-07-30T14:50:03.852] Warning: Note very large processing time from
> slurm_topo_build_config: usec=226658 began=14:50:03.625
> [2021-07-30T14:50:04.805] Can not stat gres.conf file
> (/etc/slurm/gres.conf), using slurm.conf data
> [2021-07-30T14:50:04.853] ipmi-thread: ended
> [2021-07-30T14:50:05.018] AcctGatherEnergy IPMI plugin thread launched
> [2021-07-30T14:50:05.019] AcctGatherEnergy IPMI plugin loaded
> [2021-07-30T14:50:05.021] ipmi-thread: launched
> [2021-07-30T14:50:05.572] AcctGatherEnergy IPMI plugin thread init
> 
> 
> [2021-07-30T14:50:05.651] ipmi-thread: sensor 13 current_watts: 200,
> consumed 0 Joules 0 new, ave watts 200
> [2021-07-30T14:50:20.064] ipmi-thread: sensor 13 current_watts: 200,
> consumed 3000 Joules 3000 new, ave watts 200
> [2021-07-30T14:50:20.075] [82561.3] _get_joules_task: consumed
> 18446744073709523266 Joules (received 18446744073709439271(200 watts) from
> slurmd)
> 2021-07-30T14:52:35.657] ipmi-thread: sensor 13 current_watts: 200, consumed
> 30000 Joules 3200 new, ave watts 200
> [2021-07-30T14:52:50.063] ipmi-thread: sensor 13 current_watts: 200,
> consumed 33000 Joules 3000 new, ave watts 200
> 
> Energy of the step 82561.3] is now reset to 1650 Joules !
> 
> 
> [2021-07-30T14:52:50.080] [82561.3] _get_joules_task: consumed 1650 Joules
> (received 6000(200 watts) from slurmd)
> [2021-07-30T14:53:05.766] ipmi-thread: sensor 13 current_watts: 200,
> consumed 36000 Joules 3000 new, ave watts 200
> [2021-07-30T14:53:20.072] ipmi-thread: sensor 13 current_watts: 200,
> consumed 39000 Joules 3000 new, ave watts 200
> 
> 
> So the case where the energy of the node is becoming smaller than the energy
> of a running job is missing. I have tried the following patch but not sure
> it's clean but it seems to solve the problem:
> 
> 2021-07-29T11:10:03.432] [80756.0] _get_joules_task: consumed 113305 Joules
> (received 0(570 watts) from slurmd)
> 
> [2021-07-29T11:10:03.463] got reconfigure request
> [2021-07-29T11:10:03.465] waiting on 1 active threads
> 
> [2021-07-29T11:10:48.253] [80756.0] _get_joules_task: consumed 113305 Joules
> (
> [2021-07-29T11:10:49.034] [80756.0] _get_joules_task: consumed 113875 Joules
> (received 570(570 watts) from slurmd)
> [2021-07-29T11:10:50.063] [80756.0] _get_joules_task: consumed 114445 Joules
> (received 570(570 watts) from slurmd)
> 
> 
> ---
> slurm-19.05.5.orig/src/plugins/acct_gather_energy/ipmi/
> acct_gather_energy_ipmi.c    2019-12-20 21:44:51.000000000 +0100
> +++
> slurm-19.05.5.work/src/plugins/acct_gather_energy/ipmi/
> acct_gather_energy_ipmi.c    2021-07-28 15:00:14.374241513 +0200
> @@ -84,7 +84,8 @@
>  #define _DEBUG_ENERGY 1
>  #define IPMI_VERSION 2         /* Data structure version number */
>  #define MAX_LOG_ERRORS 5       /* Max sensor reading errors log messages */
> -
> +#define FUZZY_ENERGY_MAX 1000000000
> +#define fuzzy_max_energy(v1) ((v1) > FUZZY_ENERGY_MAX)
>  /*
>   * These variables are required by the generic plugin interface.  If they
>   * are not found in the plugin, the plugin loader will ignore it.
> @@ -822,6 +823,7 @@
>  {
>         time_t now = time(NULL);
>         static bool first = true;
> +       static bool impi_thread_restart = false;
>         uint64_t adjustment = 0;
>         uint16_t i;
> -               if (!first) {
> -                       new->consumed_energy -= start_current_energies[i];
> -                       new->base_consumed_energy = adjustment +
> 
> +                if  (  fuzzy_max_energy(adjustment) ) {
> +                    info("_get_joules_task: invalid  adjustment %"PRIu64"
> reset to 0 ",adjustment);
> +                    adjustment = 0;
> +                }
> +               if  (!first) {
> +                   if (new->consumed_energy >= old->consumed_energy )  {
> +                       new->consumed_energy -= start_current_energies[i];
> +                       new->base_consumed_energy = adjustment +
>                                 (new->consumed_energy -
> old->consumed_energy);
> +                   }else{
> +                        if (impi_thread_restart == false) {
> +                                start_current_energies[i] =0;
> +                                        info("_get_joules_task:
> start_current_energies reset ");
> +                                                          }
> +                        new->base_consumed_energy  = adjustment + 
> new->consumed_energy - start_current_energies[i] ;
> +                        start_current_energies[i] = new->consumed_energy +
> adjustment ;
> +                       impi_thread_restart = true;
> +
> +                 }
> 
>                 } else {
>                         /* This is just for the step, so take all the
> pervious
>                            consumption out of the mix.
> @@ -873,10 +900,31 @@
>                         new->base_consumed_energy = 0;
>                 }
> 
> 
> +
> +/* Restart slurmd or reconfigure  might give new consumed energy to
> NO_VAL64 reset new base consumed energy to 0 */
> +                if  (  fuzzy_max_energy(new->base_consumed_energy) ) {
> +                if (debug_flags & DEBUG_FLAG_ENERGY)
> +                    info("_get_joules_task: invalid
> new->base_consumed_energy %"PRIu64" reset to 0",new->base_consumed_energy);
> +                    info("_get_joules_task: invalid  adjustment %"PRIu64"
> ",adjustment);
> +                    info("_get_joules_task: invalid
> start_current_energies[i] %"PRIu64" ",start_current_energies[i]);
> +                    info("_get_joules_task: invalid new->consumed_energy
> %"PRIu64" ",new->consumed_energy );
> +                    info("_get_joules_task: invalid old->consumed_energy
> %"PRIu64" ",old->consumed_energy );
> +                    new->base_consumed_energy = 0;
> +                }
> +
>                 } else {
>                         /* This is just for the step, so take all the
> pervious
>                            consumption out of the mix.
> @@ -873,10 +900,31 @@
>                         new->base_consumed_energy = 0;
>                 }

Hi Regine.

First of all I apologize for not having responded earlier to this issue.
This situation is not new to me and I am currently investigating what's the best way to tackle this. The problem is that the energy value is taken from slurmd, so I think we must fix it in the slurmd side instead of trying a workaround like the one you did.

Also this FUZZY variable is not accurate, and the value from slurmd is not always exactly to NO_VAL64.

I will keep you informed of the progression.

Comment 19 Carlos Tripiana Montes 2022-01-25 09:55:47 MST

Hello Regine,

1st of all sorry that we haven't replied for a while.

We are working on this issue, but it seems quite more complex to come up with an orthodox solution, specially when we are looking after fixing this reconfig issue but also see if we can support a daemon reload.

Regards,
Carlos.

Comment 20 Carlos Tripiana Montes 2022-02-01 06:49:23 MST

Hi Regine,

Well, that's interesting. I've successfully set an out-of-band communication between an emulated IPMI (using ipmi_sim), and acct_gather_energy/ipmi (slurmd).

It was properly getting power metrics for awhile, then issued and scontrol reconfig.

In this scenario, I've got a segfault. Which is what I was expecting from you as well. Why I was expecting you to segfault? Well, at least in the git 21.08 branch (HEAD), this is what it's happening to me. And it makes sense, because variables like:

sensors
start_current_energies
descriptions

are freed in the plugin's fini function. And this function is called after issuing a reconfig. Then, it gets inited again.

But I see none of them properly reinitied after a reconfig. They remain pointing to a freed memory. In fact, in my case, I see them NULLed in GDB.

I think the discrepancy here about segfaulting or not is because I have all the debug stuff enabled, plus no optimizations and all the developer things enabled. While, I guess you don't, and slurmd is trying to access to the old pointer data.

In any case, your values are wrong because we're not properly handling reconfig. And that's what it matters in the end, whether segfaulting or not.

So now I'm going to look after a fix for this.

Thanks,
Carlos.

Comment 27 Carlos Tripiana Montes 2022-02-15 09:56:43 MST

Hi Regine,

We have a potential fix for this bug, and we are now testing it.

It looks promising and, if it passes our 1st internal testing round, I want to bring you the opportunity to apply this fix in your cluster.

It would be an early access patch that could still suffer from modifications but it would be very valuable to have the patch tested in your HW.

If you agree, I'll send the patch as soon as we consider it good enough to share.

Regards,
Carlos.

Comment 37 Regine Gaudin 2022-03-21 07:20:19 MDT

Hi, Yes we would be happy to have the patch

Comment 39 Martin Perry 2022-03-25 11:30:06 MDT

FYI: Atos has another customer that may also be experiencing this problem (Meteo France). They are running Slurm 20.02. We would like to provide them with your fix when it is available. We can rework it for 20.02 if necessary.

Comment 40 Regine Gaudin 2022-03-29 02:56:46 MDT

"I'll send the patch as soon as we consider it good enough to share."
Ok for meteo when do you think you could provide us the patch  ?
Thanks

Comment 41 Regine Gaudin 2022-04-01 00:47:51 MDT

This bug should have now higher priority as many users are writing papers on energy consuming based on false datas... So they are complaining about it.

Comment 42 Regine Gaudin 2022-04-01 00:48:56 MDT

Se ware slowly upgarding machien one by one to 20.11.8

Comment 43 Regine Gaudin 2022-04-01 00:49:49 MDT

(In reply to Regine Gaudin from comment #42)
We are slowly upgrading machine one by one to 20.11.8

Comment 45 Carlos Tripiana Montes 2022-04-04 02:24:26 MDT

Hi,

We don't have an ETA for this, even though we have a proposal that is going to be reviewed soon.

In any case, I've upgraded the sev to 3 to reflect your needs.

I'll let you know once I have more information.

Best regards,
Carlos.

Comment 49 Carlos Tripiana Montes 2022-04-06 02:17:55 MDT

Hi Regine,

Commit:

https://github.com/SchedMD/slurm/commit/d52c2e4f76c765872b114b12f3462e6885f27659

merges in 21.08 the commits added in master to fix this issue. You'll see the fixes address both scontrol reconfig and slurmd process restart, which was another issue found while fixing reconfig.

Cheers,
Carlos.