Hi We are making frequent reconfigure as our clusters have hundreds of nodes leaving and rejoigning production. When a reconfigure is requested, ipmi-thread on each node is restarted. consumed value of the node is then set to NOVAL64, 0 or 0<value < job_already_energy So a job already running with an old->consumed_energy value positive on that node will have an incorrect new energy... something like new->base_consumed_energy = 0 + NOVAL64 - old->consumed_energy as calculation is in src/plugins/acct_gather_energy/ipmi/acct_gather_energy.c new->base_consumed_energy = adjustment + new->base_consumed_energy = adjustment + (new->consumed_energy - old->consumed_energy); See slurmd debug energy log: 2021-07-30T14:49:35.830] ipmi-thread: sensor 13 current_watts: 200, consumed 112345 Joules 3000 new, ave watts 192 [2021-07-30T14:49:50.137] ipmi-thread: sensor 13 current_watts: 200, consumed 115345 Joules 3000 new, ave watts 192 [2021-07-30T14:49:50.176] [82561.3] _get_joules_task: consumed 83995 Joules (received 6000(200 watts) from slurmd) (step 82561.3] has consumed 83995 J83995 Joules by now) [2021-07-30T14:50:03.196] got reconfigure request [2021-07-30T14:50:03.198] all threads complete [2021-07-30T14:50:03.414] Considering each NUMA node as a socket [2021-07-30T14:50:03.617] Node reconfigured socket/core boundaries SocketsPerBoard=2:8(hw) CoresPerSocket=64:16(hw) [2021-07-30T14:50:03.622] Message aggregation disabled [2021-07-30T14:50:03.852] Warning: Note very large processing time from slurm_topo_build_config: usec=226658 began=14:50:03.625 [2021-07-30T14:50:04.805] Can not stat gres.conf file (/etc/slurm/gres.conf), using slurm.conf data [2021-07-30T14:50:04.853] ipmi-thread: ended [2021-07-30T14:50:05.018] AcctGatherEnergy IPMI plugin thread launched [2021-07-30T14:50:05.019] AcctGatherEnergy IPMI plugin loaded [2021-07-30T14:50:05.021] ipmi-thread: launched [2021-07-30T14:50:05.572] AcctGatherEnergy IPMI plugin thread init [2021-07-30T14:50:05.651] ipmi-thread: sensor 13 current_watts: 200, consumed 0 Joules 0 new, ave watts 200 [2021-07-30T14:50:20.064] ipmi-thread: sensor 13 current_watts: 200, consumed 3000 Joules 3000 new, ave watts 200 [2021-07-30T14:50:20.075] [82561.3] _get_joules_task: consumed 18446744073709523266 Joules (received 18446744073709439271(200 watts) from slurmd) 2021-07-30T14:52:35.657] ipmi-thread: sensor 13 current_watts: 200, consumed 30000 Joules 3200 new, ave watts 200 [2021-07-30T14:52:50.063] ipmi-thread: sensor 13 current_watts: 200, consumed 33000 Joules 3000 new, ave watts 200 Energy of the step 82561.3] is now reset to 1650 Joules ! [2021-07-30T14:52:50.080] [82561.3] _get_joules_task: consumed 1650 Joules (received 6000(200 watts) from slurmd) [2021-07-30T14:53:05.766] ipmi-thread: sensor 13 current_watts: 200, consumed 36000 Joules 3000 new, ave watts 200 [2021-07-30T14:53:20.072] ipmi-thread: sensor 13 current_watts: 200, consumed 39000 Joules 3000 new, ave watts 200 So the case where the energy of the node is becoming smaller than the energy of a running job is missing. I have tried the following patch but not sure it's clean but it seems to solve the problem: 2021-07-29T11:10:03.432] [80756.0] _get_joules_task: consumed 113305 Joules (received 0(570 watts) from slurmd) [2021-07-29T11:10:03.463] got reconfigure request [2021-07-29T11:10:03.465] waiting on 1 active threads [2021-07-29T11:10:48.253] [80756.0] _get_joules_task: consumed 113305 Joules ( [2021-07-29T11:10:49.034] [80756.0] _get_joules_task: consumed 113875 Joules (received 570(570 watts) from slurmd) [2021-07-29T11:10:50.063] [80756.0] _get_joules_task: consumed 114445 Joules (received 570(570 watts) from slurmd) --- slurm-19.05.5.orig/src/plugins/acct_gather_energy/ipmi/acct_gather_energy_ipmi.c 2019-12-20 21:44:51.000000000 +0100 +++ slurm-19.05.5.work/src/plugins/acct_gather_energy/ipmi/acct_gather_energy_ipmi.c 2021-07-28 15:00:14.374241513 +0200 @@ -84,7 +84,8 @@ #define _DEBUG_ENERGY 1 #define IPMI_VERSION 2 /* Data structure version number */ #define MAX_LOG_ERRORS 5 /* Max sensor reading errors log messages */ - +#define FUZZY_ENERGY_MAX 1000000000 +#define fuzzy_max_energy(v1) ((v1) > FUZZY_ENERGY_MAX) /* * These variables are required by the generic plugin interface. If they * are not found in the plugin, the plugin loader will ignore it. @@ -822,6 +823,7 @@ { time_t now = time(NULL); static bool first = true; + static bool impi_thread_restart = false; uint64_t adjustment = 0; uint16_t i; - if (!first) { - new->consumed_energy -= start_current_energies[i]; - new->base_consumed_energy = adjustment + + if ( fuzzy_max_energy(adjustment) ) { + info("_get_joules_task: invalid adjustment %"PRIu64" reset to 0 ",adjustment); + adjustment = 0; + } + if (!first) { + if (new->consumed_energy >= old->consumed_energy ) { + new->consumed_energy -= start_current_energies[i]; + new->base_consumed_energy = adjustment + (new->consumed_energy - old->consumed_energy); + }else{ + if (impi_thread_restart == false) { + start_current_energies[i] =0; + info("_get_joules_task: start_current_energies reset "); + } + new->base_consumed_energy = adjustment + new->consumed_energy - start_current_energies[i] ; + start_current_energies[i] = new->consumed_energy + adjustment ; + impi_thread_restart = true; + + } } else { /* This is just for the step, so take all the pervious consumption out of the mix. @@ -873,10 +900,31 @@ new->base_consumed_energy = 0; } + +/* Restart slurmd or reconfigure might give new consumed energy to NO_VAL64 reset new base consumed energy to 0 */ + if ( fuzzy_max_energy(new->base_consumed_energy) ) { + if (debug_flags & DEBUG_FLAG_ENERGY) + info("_get_joules_task: invalid new->base_consumed_energy %"PRIu64" reset to 0",new->base_consumed_energy); + info("_get_joules_task: invalid adjustment %"PRIu64" ",adjustment); + info("_get_joules_task: invalid start_current_energies[i] %"PRIu64" ",start_current_energies[i]); + info("_get_joules_task: invalid new->consumed_energy %"PRIu64" ",new->consumed_energy ); + info("_get_joules_task: invalid old->consumed_energy %"PRIu64" ",old->consumed_energy ); + new->base_consumed_energy = 0; + } + } else { /* This is just for the step, so take all the pervious consumption out of the mix. @@ -873,10 +900,31 @@ new->base_consumed_energy = 0; }
(In reply to Regine Gaudin from comment #0) > Hi > > We are making frequent reconfigure as our clusters have hundreds of nodes > leaving and rejoigning production. > > When a reconfigure is requested, ipmi-thread on each node is restarted. > consumed value of the node is then set to NOVAL64, 0 or 0<value < > job_already_energy > > So a job already running with an old->consumed_energy value positive on that > node > will have an incorrect new energy... something like > new->base_consumed_energy = 0 + NOVAL64 - old->consumed_energy > > as calculation is in > src/plugins/acct_gather_energy/ipmi/acct_gather_energy.c > > new->base_consumed_energy = adjustment + > new->base_consumed_energy = adjustment + > (new->consumed_energy - old->consumed_energy); > > > See slurmd debug energy log: > > 2021-07-30T14:49:35.830] ipmi-thread: sensor 13 current_watts: 200, consumed > 112345 Joules 3000 new, ave watts 192 > [2021-07-30T14:49:50.137] ipmi-thread: sensor 13 current_watts: 200, > consumed 115345 Joules 3000 new, ave watts 192 > [2021-07-30T14:49:50.176] [82561.3] _get_joules_task: consumed 83995 Joules > (received 6000(200 watts) from slurmd) > > (step 82561.3] has consumed 83995 J83995 Joules by now) > > [2021-07-30T14:50:03.196] got reconfigure request > [2021-07-30T14:50:03.198] all threads complete > [2021-07-30T14:50:03.414] Considering each NUMA node as a socket > [2021-07-30T14:50:03.617] Node reconfigured socket/core boundaries > SocketsPerBoard=2:8(hw) CoresPerSocket=64:16(hw) > [2021-07-30T14:50:03.622] Message aggregation disabled > [2021-07-30T14:50:03.852] Warning: Note very large processing time from > slurm_topo_build_config: usec=226658 began=14:50:03.625 > [2021-07-30T14:50:04.805] Can not stat gres.conf file > (/etc/slurm/gres.conf), using slurm.conf data > [2021-07-30T14:50:04.853] ipmi-thread: ended > [2021-07-30T14:50:05.018] AcctGatherEnergy IPMI plugin thread launched > [2021-07-30T14:50:05.019] AcctGatherEnergy IPMI plugin loaded > [2021-07-30T14:50:05.021] ipmi-thread: launched > [2021-07-30T14:50:05.572] AcctGatherEnergy IPMI plugin thread init > > > [2021-07-30T14:50:05.651] ipmi-thread: sensor 13 current_watts: 200, > consumed 0 Joules 0 new, ave watts 200 > [2021-07-30T14:50:20.064] ipmi-thread: sensor 13 current_watts: 200, > consumed 3000 Joules 3000 new, ave watts 200 > [2021-07-30T14:50:20.075] [82561.3] _get_joules_task: consumed > 18446744073709523266 Joules (received 18446744073709439271(200 watts) from > slurmd) > 2021-07-30T14:52:35.657] ipmi-thread: sensor 13 current_watts: 200, consumed > 30000 Joules 3200 new, ave watts 200 > [2021-07-30T14:52:50.063] ipmi-thread: sensor 13 current_watts: 200, > consumed 33000 Joules 3000 new, ave watts 200 > > Energy of the step 82561.3] is now reset to 1650 Joules ! > > > [2021-07-30T14:52:50.080] [82561.3] _get_joules_task: consumed 1650 Joules > (received 6000(200 watts) from slurmd) > [2021-07-30T14:53:05.766] ipmi-thread: sensor 13 current_watts: 200, > consumed 36000 Joules 3000 new, ave watts 200 > [2021-07-30T14:53:20.072] ipmi-thread: sensor 13 current_watts: 200, > consumed 39000 Joules 3000 new, ave watts 200 > > > So the case where the energy of the node is becoming smaller than the energy > of a running job is missing. I have tried the following patch but not sure > it's clean but it seems to solve the problem: > > 2021-07-29T11:10:03.432] [80756.0] _get_joules_task: consumed 113305 Joules > (received 0(570 watts) from slurmd) > > [2021-07-29T11:10:03.463] got reconfigure request > [2021-07-29T11:10:03.465] waiting on 1 active threads > > [2021-07-29T11:10:48.253] [80756.0] _get_joules_task: consumed 113305 Joules > ( > [2021-07-29T11:10:49.034] [80756.0] _get_joules_task: consumed 113875 Joules > (received 570(570 watts) from slurmd) > [2021-07-29T11:10:50.063] [80756.0] _get_joules_task: consumed 114445 Joules > (received 570(570 watts) from slurmd) > > > --- > slurm-19.05.5.orig/src/plugins/acct_gather_energy/ipmi/ > acct_gather_energy_ipmi.c 2019-12-20 21:44:51.000000000 +0100 > +++ > slurm-19.05.5.work/src/plugins/acct_gather_energy/ipmi/ > acct_gather_energy_ipmi.c 2021-07-28 15:00:14.374241513 +0200 > @@ -84,7 +84,8 @@ > #define _DEBUG_ENERGY 1 > #define IPMI_VERSION 2 /* Data structure version number */ > #define MAX_LOG_ERRORS 5 /* Max sensor reading errors log messages */ > - > +#define FUZZY_ENERGY_MAX 1000000000 > +#define fuzzy_max_energy(v1) ((v1) > FUZZY_ENERGY_MAX) > /* > * These variables are required by the generic plugin interface. If they > * are not found in the plugin, the plugin loader will ignore it. > @@ -822,6 +823,7 @@ > { > time_t now = time(NULL); > static bool first = true; > + static bool impi_thread_restart = false; > uint64_t adjustment = 0; > uint16_t i; > - if (!first) { > - new->consumed_energy -= start_current_energies[i]; > - new->base_consumed_energy = adjustment + > > + if ( fuzzy_max_energy(adjustment) ) { > + info("_get_joules_task: invalid adjustment %"PRIu64" > reset to 0 ",adjustment); > + adjustment = 0; > + } > + if (!first) { > + if (new->consumed_energy >= old->consumed_energy ) { > + new->consumed_energy -= start_current_energies[i]; > + new->base_consumed_energy = adjustment + > (new->consumed_energy - > old->consumed_energy); > + }else{ > + if (impi_thread_restart == false) { > + start_current_energies[i] =0; > + info("_get_joules_task: > start_current_energies reset "); > + } > + new->base_consumed_energy = adjustment + > new->consumed_energy - start_current_energies[i] ; > + start_current_energies[i] = new->consumed_energy + > adjustment ; > + impi_thread_restart = true; > + > + } > > } else { > /* This is just for the step, so take all the > pervious > consumption out of the mix. > @@ -873,10 +900,31 @@ > new->base_consumed_energy = 0; > } > > > + > +/* Restart slurmd or reconfigure might give new consumed energy to > NO_VAL64 reset new base consumed energy to 0 */ > + if ( fuzzy_max_energy(new->base_consumed_energy) ) { > + if (debug_flags & DEBUG_FLAG_ENERGY) > + info("_get_joules_task: invalid > new->base_consumed_energy %"PRIu64" reset to 0",new->base_consumed_energy); > + info("_get_joules_task: invalid adjustment %"PRIu64" > ",adjustment); > + info("_get_joules_task: invalid > start_current_energies[i] %"PRIu64" ",start_current_energies[i]); > + info("_get_joules_task: invalid new->consumed_energy > %"PRIu64" ",new->consumed_energy ); > + info("_get_joules_task: invalid old->consumed_energy > %"PRIu64" ",old->consumed_energy ); > + new->base_consumed_energy = 0; > + } > + > } else { > /* This is just for the step, so take all the > pervious > consumption out of the mix. > @@ -873,10 +900,31 @@ > new->base_consumed_energy = 0; > } Hi Regine. First of all I apologize for not having responded earlier to this issue. This situation is not new to me and I am currently investigating what's the best way to tackle this. The problem is that the energy value is taken from slurmd, so I think we must fix it in the slurmd side instead of trying a workaround like the one you did. Also this FUZZY variable is not accurate, and the value from slurmd is not always exactly to NO_VAL64. I will keep you informed of the progression.
Hello Regine, 1st of all sorry that we haven't replied for a while. We are working on this issue, but it seems quite more complex to come up with an orthodox solution, specially when we are looking after fixing this reconfig issue but also see if we can support a daemon reload. Regards, Carlos.
Hi Regine, Well, that's interesting. I've successfully set an out-of-band communication between an emulated IPMI (using ipmi_sim), and acct_gather_energy/ipmi (slurmd). It was properly getting power metrics for awhile, then issued and scontrol reconfig. In this scenario, I've got a segfault. Which is what I was expecting from you as well. Why I was expecting you to segfault? Well, at least in the git 21.08 branch (HEAD), this is what it's happening to me. And it makes sense, because variables like: sensors start_current_energies descriptions are freed in the plugin's fini function. And this function is called after issuing a reconfig. Then, it gets inited again. But I see none of them properly reinitied after a reconfig. They remain pointing to a freed memory. In fact, in my case, I see them NULLed in GDB. I think the discrepancy here about segfaulting or not is because I have all the debug stuff enabled, plus no optimizations and all the developer things enabled. While, I guess you don't, and slurmd is trying to access to the old pointer data. In any case, your values are wrong because we're not properly handling reconfig. And that's what it matters in the end, whether segfaulting or not. So now I'm going to look after a fix for this. Thanks, Carlos.
Hi Regine, We have a potential fix for this bug, and we are now testing it. It looks promising and, if it passes our 1st internal testing round, I want to bring you the opportunity to apply this fix in your cluster. It would be an early access patch that could still suffer from modifications but it would be very valuable to have the patch tested in your HW. If you agree, I'll send the patch as soon as we consider it good enough to share. Regards, Carlos.
Hi, Yes we would be happy to have the patch
FYI: Atos has another customer that may also be experiencing this problem (Meteo France). They are running Slurm 20.02. We would like to provide them with your fix when it is available. We can rework it for 20.02 if necessary.
"I'll send the patch as soon as we consider it good enough to share." Ok for meteo when do you think you could provide us the patch ? Thanks
This bug should have now higher priority as many users are writing papers on energy consuming based on false datas... So they are complaining about it.
Se ware slowly upgarding machien one by one to 20.11.8
(In reply to Regine Gaudin from comment #42) We are slowly upgrading machine one by one to 20.11.8
Hi, We don't have an ETA for this, even though we have a proposal that is going to be reviewed soon. In any case, I've upgraded the sev to 3 to reflect your needs. I'll let you know once I have more information. Best regards, Carlos.
Hi Regine, Commit: https://github.com/SchedMD/slurm/commit/d52c2e4f76c765872b114b12f3462e6885f27659 merges in 21.08 the commits added in master to fix this issue. You'll see the fixes address both scontrol reconfig and slurmd process restart, which was another issue found while fixing reconfig. Cheers, Carlos.