Ticket 20207

Summary: Bogus ConsumedEnergyRaw values from sacct
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: AccountingAssignee: Miquel Comas <miquel>
Status: OPEN --- QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: felip.moll, sts
Version: 23.11.8   
Hardware: Linux   
OS: Linux   
See Also: https://support.schedmd.com/show_bug.cgi?id=17040
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmd.log
syslog
slurm.conf
acct_gather.conf

Description Ole.H.Nielsen@fysik.dtu.dk 2024-06-20 02:40:23 MDT
We have enabled AcctGatherEnergyType=acct_gather_energy/ipmi in slurm.conf and I started to test some accounting with sacct printing the ConsumedEnergyRaw values.  The ConsumedEnergyRaw mostly looks sensible, but we have found some outliers:

An example is shown here:

/usr/bin/sacct --user <omitted> --nodelist=sd652 -np -X -S 061724 -E 061924 -o JobID,Partition,AllocNodes,AllocCPUS,Submit,Eligible,Start,End,CPUTimeRAW,State,Nodelist,ConsumedEnergyRaw
7398674|a100|1|8|17-Jun-2024_11:06|17-Jun-2024_11:06|17-Jun-2024_11:06|17-Jun-2024_11:13|3296|COMPLETED|sd652|337645|
7399015|a100|1|8|17-Jun-2024_11:07|17-Jun-2024_11:07|17-Jun-2024_11:07|17-Jun-2024_11:34|12656|COMPLETED|sd652|1277800|
7402270|a100|1|16|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:41|4624|COMPLETED|sd652|18446744073704391422|
7402271|a100|1|16|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:50|13392|COMPLETED|sd652|18446744073707747296|
7417128|a100|1|32|18-Jun-2024_16:19|18-Jun-2024_16:19|18-Jun-2024_16:19|18-Jun-2024_16:21|2464|COMPLETED|sd652|62099|
7417173|a100|1|32|18-Jun-2024_16:35|18-Jun-2024_16:35|18-Jun-2024_16:35|18-Jun-2024_16:37|3104|COMPLETED|sd652|69160|

The ConsumedEnergyRaw in lines 3 and 4 are obviously bogus.  Interestingly, we upgraded slurmd from 23.11.7 to 23.11.8 at about that time:

# rpm -qi slurm-slurmd | grep Install
Install Date: Mon 17 Jun 2024 11:37:04 AM CEST

We also note some errors in slurmd.log (attached) prior to the above job:

[2024-06-17T08:51:57.122] error: _rpc_acct_gather_energy: Too many pending requests
[2024-06-17T08:51:57.122] [7393907.batch] error: _get_joules_task: can't get info from slurmd

We haven't seen these errors after slurmd was upgraded.

The syslog (attached) reports some problems at the time slurmd was upgraded:

Jun 17 11:37:04 sd652 systemd[1]: slurmd.service: Found left-over process 12050 (slurmstepd) in control group while starting unit. Ignoring.
Jun 17 11:37:04 sd652 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

Presumably the cause of the bogus ConsumedEnergyRaw values are the errors: _rpc_acct_gather_energy: Too many pending requests.

Nevertheless, sacct should make sanity checks of the ConsumedEnergyRaw values in the database.  It would be better to print 0 in stead of the above astronomical values.
Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2024-06-20 02:40:46 MDT
Created attachment 37305 [details]
slurmd.log
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2024-06-20 02:41:01 MDT
Created attachment 37306 [details]
syslog
Comment 3 Felip Moll 2024-06-21 03:04:43 MDT
*** Ticket 20208 has been marked as a duplicate of this ticket. ***
Comment 4 Felip Moll 2024-06-21 03:28:05 MDT
Ole,
This was your comment on bug 20208:

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #0)
> This issue may be related to bug 20207.
> 
> We have enabled AcctGatherEnergyType=acct_gather_energy/ipmi in slurm.conf
> and I started to test some accounting with sacct printing the
> ConsumedEnergyRaw values.  The ConsumedEnergyRaw mostly looks sensible, but
> we have found some empty values like in this example:
> 
> /usr/bin/sacct --user <omitted> --partition a100 -np -X -S 061424 -E 061924
> -o
> JobID,Group,Partition,AllocNodes,AllocCPUS,Submit,Eligible,Start,End,
> CPUTimeRAW,State,Nodelist,ConsumedEnergyRaw  -s to
> 7393882|catvip|a100|1|32|14-Jun-2024_15:02|14-Jun-2024_15:02|14-Jun-2024_15:
> 02|14-Jun-2024_15:14|23104|TIMEOUT|sd651||
> 7393903|catvip|a100|1|32|14-Jun-2024_15:16|14-Jun-2024_15:16|14-Jun-2024_15:
> 16|14-Jun-2024_19:16|461088|TIMEOUT|sd651||
> 7403327|catvip|a100|1|32|17-Jun-2024_12:47|17-Jun-2024_12:47|17-Jun-2024_13:
> 03|18-Jun-2024_13:04|2765216|TIMEOUT|sd652||
> 
> This was found both with 23.11.7 and after upgrading top 23.11.8 at this
> time:
> 
> # rpm -qi slurm-slurmd | grep Install
> Install Date: Mon 17 Jun 2024 11:37:04 AM CEST
> 
> Other jobs on these nodes report non-empty values for ConsumedEnergyRaw.  I
> can't see any reason offhand for this behavior.

For job 7393882 the start/end times show that the elapsed time for this jobs was 2 seconds which means that the plugin might not have gathered a metric yet, or that the consumed energy was effectively 0.

For the other two jobs, that's more concerning. Can you print the TresUsageInTot and TresUsageOutTot of these jobs? Can you also check the steps of both jobs?
Just for the record, can you upload/paste your slurm.conf / acct_gather_energy.conf files? Or if you don't have acct_gather_energy.conf, an "scontrol show config".

About the rest of your points:

> 7402270|a100|1|16|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:36|17-
> Jun-2024_11:41|4624|COMPLETED|sd652|18446744073704391422|
> 7402271|a100|1|16|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:36|17-
> Jun-2024_11:50|13392|COMPLETED|sd652|18446744073707747296|
> 
> The ConsumedEnergyRaw in lines 3 and 4 are obviously bogus. 

These values represent NO_VAL. This can happen if the job have not been able to get any metric (e.g. for example because of the error you show later), because these times fall out of an acct_gather window and the elapsed time is very short, because slurmd could not get or get a wrong metric from the IPMI, or it might be because slurmd was restarted and it responded with a wrong metric to the stepd. We should investigate it a little bit more but for so short jobs we first need to see if the EnergyIPMIFrequency in acct_gather.conf and the JobAcctGatherFrequency in slurm.conf are correct for such short jobs.

> Interestingly,
> we upgraded slurmd from 23.11.7 to 23.11.8 at about that time:
> 
> # rpm -qi slurm-slurmd | grep Install
> Install Date: Mon 17 Jun 2024 11:37:04 AM CEST

Which could explain the slurmd being reset, not getting the (correct) metric from ipmi in time, and responding with a bad or 0 value.
Also it is possible that the ipmi plugin was not completely initialized yet and a stepd came and requested a metric.
Everytime the slurmd is reset the energy reported is 0 and the counter resets (the counter of slurmd, not the stepd).

> We also note some errors in slurmd.log (attached) prior to the above job:
> 
> [2024-06-17T08:51:57.122] error: _rpc_acct_gather_energy: Too many pending
> requests
> [2024-06-17T08:51:57.122] [7393907.batch] error: _get_joules_task: can't get
> info from slurmd

We are mitigating this error in ticket 19779 and ticket 19415 (private). This happens when more than 10 steps are requesting the energy to slurmd. Slurmd will query the ipmi once and if the ipmi is slow or non-functional it will take time. While the operation is being done, the steps don't get any response, and multiple requests may accumulate. We have a limit of 10 concurrent requests for energy to not interfere with normal slurmd operation and let other rpcs come in and be served. If this happens then the error is logged and the affected steps requests are dropped, so they will get 0 energy.

A non-functional / very slow ipmi, or a big amount of concurrent stepd requests to slurmd could cause the issues you're seeing. My suggestion is to tune the frequencies. Note that an BMC/IPMI won't tipically be able to serve a huge amount of queries in a short period of time, so it is recommended to reduce the frequency of the queries by increasing EnergyIPMIFrequency or JobAcctGatherFrequency (depending on what you have now).

Added to that, we are improving a TOCTOU condition in 23.11. If multiple stepds request the energy to slurmd it could happen that all of them trigger a query to the ipmi even if the last poll time was close and smaller than EnergyIPMIFrequency. This fix should reduce this probability to the minimum and would probably help not seeing the error you saw so frequently, and to not discard so many requests.

commit 7020f1f629464711949017d6e9501247c2488ca3
Author:     Miquel Comas Hervada <miquel@schedmd.com>
AuthorDate: Wed Jun 19 15:26:25 2024 +0200
Commit:     Felip Moll <felip.moll@schedmd.com>
CommitDate: Thu Jun 20 12:51:08 2024 +0200

    slurmd - Add mutex to avoid redundant energy gathering
    
    Add a mutex when calling _rpc_acct_gather_energy() to remove a TOCTOU
    with the last_poll value with multiple RPCs from different steps. The
    threads were getting the last_poll and then requesting an energy reading
    without considering last_poll updates. This fix eliminates the unnecessary
    consecutive calls to acct_gather_energy_g_get_data() that trigger new
    readings to the energy plugins (e.g. this will reduce calls to IPMI/BMC).
    
    Ticket 19779.
> The syslog (attached) reports some problems at the time slurmd was upgraded:
> 
> Jun 17 11:37:04 sd652 systemd[1]: slurmd.service: Found left-over process
> 12050 (slurmstepd) in control group while starting unit. Ignoring.
> Jun 17 11:37:04 sd652 systemd[1]: This usually indicates unclean termination
> of a previous run, or service implementation deficiencies.

I don't like this, why did systemd found a slurmstepd into the slurmd cgroup?
Can you please show us the output of these two commands in a node which actually have jobs in it?
systemctl status slurmstepd.scope
systemctl status slurmd

It could also been a temporary condition while the slurmd was reset. If it was forking new steps while it was reset I wouldn't find strange that the stepds could not be moved yet to the correct cgroup. That's unlikely and I think that we are doing it right, but it is something we could check. In any case this corresponds to another ticket.

Please send this info and we'll see if we need another ticket for that.


> Nevertheless, sacct should make sanity checks of the ConsumedEnergyRaw
> values in the database.  It would be better to print 0 in stead of the above
> astronomical values.

That's right, printing NO_VAL is ugly.
We can check in which case we print NO_VAL.
In any case, I am interested in the TresUsageInTot/Out values.
Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2024-06-22 10:32:28 MDT
Hi Felip,

Thanks a lot for your analysis:

(In reply to Felip Moll from comment #4)
> This was your comment on bug 20208:
> > We have enabled AcctGatherEnergyType=acct_gather_energy/ipmi in slurm.conf
> > and I started to test some accounting with sacct printing the
> > ConsumedEnergyRaw values.  The ConsumedEnergyRaw mostly looks sensible, but
> > we have found some empty values like in this example:
> > 
> > /usr/bin/sacct --user <omitted> --partition a100 -np -X -S 061424 -E 061924
> > -o
> > JobID,Group,Partition,AllocNodes,AllocCPUS,Submit,Eligible,Start,End,
> > CPUTimeRAW,State,Nodelist,ConsumedEnergyRaw  -s to
> > 7393882|catvip|a100|1|32|14-Jun-2024_15:02|14-Jun-2024_15:02|14-Jun-2024_15:
> > 02|14-Jun-2024_15:14|23104|TIMEOUT|sd651||
> > 7393903|catvip|a100|1|32|14-Jun-2024_15:16|14-Jun-2024_15:16|14-Jun-2024_15:
> > 16|14-Jun-2024_19:16|461088|TIMEOUT|sd651||
> > 7403327|catvip|a100|1|32|17-Jun-2024_12:47|17-Jun-2024_12:47|17-Jun-2024_13:
> > 03|18-Jun-2024_13:04|2765216|TIMEOUT|sd652||
...
> For job 7393882 the start/end times show that the elapsed time for this jobs
> was 2 seconds which means that the plugin might not have gathered a metric
> yet, or that the consumed energy was effectively 0.

Actually, the timestamps were printed in minutes so this job had 12 minutes elapsed (see below).

> For the other two jobs, that's more concerning. Can you print the
> TresUsageInTot and TresUsageOutTot of these jobs? Can you also check the
> steps of both jobs?

/usr/bin/sacct --user dmapa --partition a100 -np -X -S 061424 -E 061924 -o JobID,Group,Partition,AllocNodes,AllocCPUS,Submit,Eligible,Start,End,Elapsed,CPUTimeRAW,State,Nodelist,TresUsageInTot,TresUsageOutTot,ConsumedEnergyRaw  -s to
7393882|catvip|a100|1|32|2024-06-14T15:02:44|2024-06-14T15:02:44|2024-06-14T15:02:45|2024-06-14T15:14:47|00:12:02|23104|TIMEOUT|sd651||||
7393903|catvip|a100|1|32|2024-06-14T15:16:14|2024-06-14T15:16:14|2024-06-14T15:16:15|2024-06-14T19:16:24|04:00:09|461088|TIMEOUT|sd651||||
7403327|catvip|a100|1|32|2024-06-17T12:47:47|2024-06-17T12:47:47|2024-06-17T13:03:52|2024-06-18T13:04:05|1-00:00:13|2765216|TIMEOUT|sd652||||

> Just for the record, can you upload/paste your slurm.conf /
> acct_gather_energy.conf files? Or if you don't have acct_gather_energy.conf,
> an "scontrol show config".

Files attached.

> About the rest of your points:
> 
> > 7402270|a100|1|16|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:36|17-
> > Jun-2024_11:41|4624|COMPLETED|sd652|18446744073704391422|
> > 7402271|a100|1|16|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:36|17-
> > Jun-2024_11:50|13392|COMPLETED|sd652|18446744073707747296|
> > 
> > The ConsumedEnergyRaw in lines 3 and 4 are obviously bogus. 
> 
> These values represent NO_VAL. This can happen if the job have not been able
> to get any metric (e.g. for example because of the error you show later),
> because these times fall out of an acct_gather window and the elapsed time
> is very short, because slurmd could not get or get a wrong metric from the
> IPMI, or it might be because slurmd was restarted and it responded with a
> wrong metric to the stepd. We should investigate it a little bit more but
> for so short jobs we first need to see if the EnergyIPMIFrequency in
> acct_gather.conf and the JobAcctGatherFrequency in slurm.conf are correct
> for such short jobs.

The config files has EnergyIPMIFrequency=60

> > Interestingly,
> > we upgraded slurmd from 23.11.7 to 23.11.8 at about that time:
> > 
> > # rpm -qi slurm-slurmd | grep Install
> > Install Date: Mon 17 Jun 2024 11:37:04 AM CEST
> 
> Which could explain the slurmd being reset, not getting the (correct) metric
> from ipmi in time, and responding with a bad or 0 value.
> Also it is possible that the ipmi plugin was not completely initialized yet
> and a stepd came and requested a metric.
> Everytime the slurmd is reset the energy reported is 0 and the counter
> resets (the counter of slurmd, not the stepd).

I guess from the timestamps that slurmstepd would have had ample time to request IPMI measurements.

> > We also note some errors in slurmd.log (attached) prior to the above job:
> > 
> > [2024-06-17T08:51:57.122] error: _rpc_acct_gather_energy: Too many pending
> > requests
> > [2024-06-17T08:51:57.122] [7393907.batch] error: _get_joules_task: can't get
> > info from slurmd
> 
> We are mitigating this error in ticket 19779 and ticket 19415 (private).
> This happens when more than 10 steps are requesting the energy to slurmd.
> Slurmd will query the ipmi once and if the ipmi is slow or non-functional it
> will take time. While the operation is being done, the steps don't get any
> response, and multiple requests may accumulate. We have a limit of 10
> concurrent requests for energy to not interfere with normal slurmd operation
> and let other rpcs come in and be served. If this happens then the error is
> logged and the affected steps requests are dropped, so they will get 0
> energy.
> 
> A non-functional / very slow ipmi, or a big amount of concurrent stepd
> requests to slurmd could cause the issues you're seeing. My suggestion is to
> tune the frequencies. Note that an BMC/IPMI won't tipically be able to serve
> a huge amount of queries in a short period of time, so it is recommended to
> reduce the frequency of the queries by increasing EnergyIPMIFrequency or
> JobAcctGatherFrequency (depending on what you have now).

OK, can you suggest better values than what's in our current config?

For the record, the node is a quite new and powerful Lenovo ThinkSystem SD650-N V2 4-GPU server with the latest UEFI/BIOS and the latest BMC (Lenovo XCC) Firmware Revision 4.11.

> Added to that, we are improving a TOCTOU condition in 23.11. If multiple
> stepds request the energy to slurmd it could happen that all of them trigger
> a query to the ipmi even if the last poll time was close and smaller than
> EnergyIPMIFrequency. This fix should reduce this probability to the minimum
> and would probably help not seeing the error you saw so frequently, and to
> not discard so many requests.
> 
> commit 7020f1f629464711949017d6e9501247c2488ca3
> Author:     Miquel Comas Hervada <miquel@schedmd.com>
> AuthorDate: Wed Jun 19 15:26:25 2024 +0200
> Commit:     Felip Moll <felip.moll@schedmd.com>
> CommitDate: Thu Jun 20 12:51:08 2024 +0200
> 
>     slurmd - Add mutex to avoid redundant energy gathering
>     
>     Add a mutex when calling _rpc_acct_gather_energy() to remove a TOCTOU
>     with the last_poll value with multiple RPCs from different steps. The
>     threads were getting the last_poll and then requesting an energy reading
>     without considering last_poll updates. This fix eliminates the
> unnecessary
>     consecutive calls to acct_gather_energy_g_get_data() that trigger new
>     readings to the energy plugins (e.g. this will reduce calls to IPMI/BMC).
>     
>     Ticket 19779.
> > The syslog (attached) reports some problems at the time slurmd was upgraded:
> > 
> > Jun 17 11:37:04 sd652 systemd[1]: slurmd.service: Found left-over process
> > 12050 (slurmstepd) in control group while starting unit. Ignoring.
> > Jun 17 11:37:04 sd652 systemd[1]: This usually indicates unclean termination
> > of a previous run, or service implementation deficiencies.
> 
> I don't like this, why did systemd found a slurmstepd into the slurmd cgroup?
> Can you please show us the output of these two commands in a node which
> actually have jobs in it?
> systemctl status slurmstepd.scope
> systemctl status slurmd

Here's the output from the node sd652 at present:

[root@sd652 ~]# systemctl status slurmstepd.scope
Unit slurmstepd.scope could not be found.

[root@sd652 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─core_limit.conf
   Active: active (running) since Mon 2024-06-17 11:37:04 CEST; 5 days ago
 Main PID: 289498 (slurmd)
    Tasks: 61
   Memory: 35.3M
   CGroup: /system.slice/slurmd.service
           ├─289498 /usr/sbin/slurmd --systemd
           ├─505490 slurmstepd: [7426349.extern]
           ├─505495 sleep 100000000
           ├─505498 slurmstepd: [7426349.batch]
           ├─505503 /bin/bash -le /var/spool/slurmd/job7426349/slurm_script
           ├─505591 python md_input.py
           ├─542637 slurmstepd: [7428453.extern]
           ├─542640 slurmstepd: [7428451.extern]
           ├─542641 slurmstepd: [7428452.extern]
           ├─542654 sleep 100000000
           ├─542655 sleep 100000000
           ├─542656 sleep 100000000
           ├─542661 slurmstepd: [7428453.batch]
           ├─542663 slurmstepd: [7428452.batch]
           ├─542665 slurmstepd: [7428451.batch]
           ├─542678 /bin/bash /var/spool/slurmd/job7428452/slurm_script
           ├─542680 /bin/bash /var/spool/slurmd/job7428451/slurm_script
           ├─542681 /bin/bash /var/spool/slurmd/job7428453/slurm_script
           ├─542746 python /home/energy/ejfr/mace-venv-3113/lib/python3.11/site-packages/mace/cli/run_train.py --name=MACE_model_3 --train_file=/home/energy/ejfr/CuAu_co2_reduction/mace_test/small_and_few_large/train_sl_weight_keys.xyz --valid_file=/home/energy/ejfr/CuAu_co2_reduction/mace_test/small_and_few_large/>
           ├─542748 python /home/energy/ejfr/mace-venv-3113/lib/python3.11/site-packages/mace/cli/run_train.py --name=MACE_model_4 --train_file=/home/energy/ejfr/CuAu_co2_reduction/mace_test/small_and_few_large/train_sl_weight_keys.xyz --valid_file=/home/energy/ejfr/CuAu_co2_reduction/mace_test/small_and_few_large/>
           └─542761 python /home/energy/ejfr/mace-venv-3113/lib/python3.11/site-packages/mace/cli/run_train.py --name=MACE_model_2 --train_file=/home/energy/ejfr/CuAu_co2_reduction/mace_test/small_and_few_large/train_sl_weight_keys.xyz --valid_file=/home/energy/ejfr/CuAu_co2_reduction/mace_test/small_and_few_large/>

Jun 22 12:16:07 sd652.nifl.fysik.dtu.dk slurmd[289498]: slurmd: task/affinity: batch_bind: job 7428453 CPU final HW mask for node: 0xFF000000FF000000FF000000FF000000
Jun 22 12:16:07 sd652.nifl.fysik.dtu.dk slurmd[289498]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 7428451
Jun 22 12:16:07 sd652.nifl.fysik.dtu.dk slurmd[289498]: slurmd: task/affinity: batch_bind: job 7428451 CPU input mask for node: 0x00000000FFFF000000000000FFFF0000
Jun 22 12:16:07 sd652.nifl.fysik.dtu.dk slurmd[289498]: slurmd: task/affinity: batch_bind: job 7428451 CPU final HW mask for node: 0x0000FF000000FF000000FF000000FF00
Jun 22 12:16:07 sd652.nifl.fysik.dtu.dk slurmd[289498]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 7428452
Jun 22 12:16:07 sd652.nifl.fysik.dtu.dk slurmd[289498]: slurmd: task/affinity: batch_bind: job 7428452 CPU input mask for node: 0x0000FFFF000000000000FFFF00000000
Jun 22 12:16:07 sd652.nifl.fysik.dtu.dk slurmd[289498]: slurmd: task/affinity: batch_bind: job 7428452 CPU final HW mask for node: 0x00FF000000FF000000FF000000FF0000
Jun 22 12:16:07 sd652.nifl.fysik.dtu.dk slurmd[289498]: slurmd: Launching batch job 7428453 for UID 306865
Jun 22 12:16:07 sd652.nifl.fysik.dtu.dk slurmd[289498]: slurmd: Launching batch job 7428452 for UID 306865
Jun 22 12:16:07 sd652.nifl.fysik.dtu.dk slurmd[289498]: slurmd: Launching batch job 7428451 for UID 306865

> It could also been a temporary condition while the slurmd was reset. If it
> was forking new steps while it was reset I wouldn't find strange that the
> stepds could not be moved yet to the correct cgroup. That's unlikely and I
> think that we are doing it right, but it is something we could check. In any
> case this corresponds to another ticket.
> 
> Please send this info and we'll see if we need another ticket for that.
> 
> 
> > Nevertheless, sacct should make sanity checks of the ConsumedEnergyRaw
> > values in the database.  It would be better to print 0 in stead of the above
> > astronomical values.
> 
> That's right, printing NO_VAL is ugly.
> We can check in which case we print NO_VAL.
> In any case, I am interested in the TresUsageInTot/Out values.

I hope I've given all the info, otherwise please ask me for any missing items.

Best regards,
Ole
Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2024-06-22 10:32:52 MDT
Created attachment 37355 [details]
slurm.conf
Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2024-06-22 10:33:13 MDT
Created attachment 37356 [details]
acct_gather.conf
Comment 8 Ole.H.Nielsen@fysik.dtu.dk 2024-06-23 02:26:57 MDT
Note added: We have installed the latest version of FreeIPMI from RockyLinux 8.10 on all nodes:

# rpm -q freeipmi
freeipmi-1.6.14-2.el8.x86_64
Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2024-06-24 02:25:51 MDT
Hi Felip,

Exploring further the ConsumedEnergy numbers printed by sacct, I'm seeing some other numbers that don't make sense:  For example the job 7352013 which ran for 42 hours on 1 node (40 cores Intel Xeon CascadeLake):

/usr/bin/sacct --user <omitted> --node=a076 -np -X -S 060124 -E 060524 -o JobID,Group,Partition,AllocNodes,AllocCPUS,Submit,Eligible,Start,End,ElapsedRaw,CPUTimeRAW,State,Nodelist,TresUsageInTot,TresUsageOutTot,ConsumedEnergyRaw
7352012|catvip|xeon40el8|1|40|2024-06-02T12:54:15|2024-06-02T12:54:15|2024-06-02T14:07:17|2024-06-02T14:07:39|22|880|FAILED|a076|||4576|
7352013|catvip|xeon40el8|1|40|2024-06-02T12:54:47|2024-06-02T12:54:47|2024-06-02T14:07:40|2024-06-04T08:13:01|151521|6060840|COMPLETED|a076|||218813190241|

The ConsumedEnergyRaw is 218.8 GigaJoules, divide it by 3,600,000 to obtain 60781 kWh (conversion at https://www.rapidtables.com/convert/energy/Joule_to_kWh.html). That corresponds to an hourly average ConsumedEnergy of 1447 kW.

In fact, this type of node typically consumes 6-700 Watts, at present it is:

[root@a076 ~]# ipmi-dcmi --get-system-power-statistics
Current Power                        : 616 Watts
Minimum Power over sampling duration : 2 watts
Maximum Power over sampling duration : 1076 watts
Average Power over sampling duration : 569 watts
Time Stamp                           : 06/24/2024 - 08:18:43
Statistics reporting time period     : 1000 milliseconds
Power Measurement                    : Active

which is reported correctly by scontrol:

# scontrol show node a076 | grep CurrentWatts
   CurrentWatts=648 AveWatts=427

Therefore it seems that ConsumedEnergyRaw printed by sacct is approximately 2000 times the correct number!!  There must be some incorrect calculation of power in sacct or in the Slurm database.

Can you suggest a way to get correct ConsumedEnergyRaw numbers from sacct?

Thanks,
Ole
Comment 11 Felip Moll 2024-06-25 04:36:42 MDT
I see several things going on here:

1. This issue, that in 24.11 will be improved, but it might still have an impact. I will see if we canimprove the number of concurrent RPCS, which can make certain steps to not get the energy in an accurate way:

[2024-06-17T08:51:57.122] error: _rpc_acct_gather_energy: Too many pending requests
[2024-06-17T08:51:57.122] [7393907.batch] error: _get_joules_task: can't get info from slurmd

2. The -X in your request, is only getting info for allocations, can you try removing -X?

That's the reason of why I am not seeing TresUsageInTot,TresUsageOutTot information, which is only available for steps.

Submit two batch jobs, one with a step in it:
[lipi@llit build]$ sbatch -w gamba1  --mem=10M -c 1 --wrap "sleep 10"
Submitted batch job 205
[lipi@llit build]$ sbatch -w gamba1  --mem=10M -c 1 --wrap "srun sleep 10"
Submitted batch job 206

With -X:
[lipi@llit build]$ sacct --user lipi -np -X -S 062524 -E 062624 -o JobID,Start,End,Elapsed,CPUTimeRAW,State,TresUsageInTot,TresUsageOutTot,ConsumedEnergyRaw   -j 205,206
205|2024-06-25T12:10:55|2024-06-25T12:11:06|00:00:11|11|COMPLETED||||
206|2024-06-25T12:11:00|2024-06-25T12:11:11|00:00:11|11|COMPLETED|||924|

Without -X:
[lipi@llit build]$ sacct --user lipi -np  -S 062524 -E 062624 -o JobID,Start,End,Elapsed,CPUTimeRAW,State,TresUsageInTot,TresUsageOutTot,ConsumedEnergyRaw   -j 205,206
205|2024-06-25T12:10:55|2024-06-25T12:11:06|00:00:11|11|COMPLETED||||
205.batch|2024-06-25T12:10:55|2024-06-25T12:11:06|00:00:11|11|COMPLETED|cpu=00:00:00,energy=980,fs/disk=9209,mem=632K,pages=0,vmem=0|energy=82,fs/disk=1913|980|
206|2024-06-25T12:11:00|2024-06-25T12:11:11|00:00:11|11|COMPLETED|||924|
206.batch|2024-06-25T12:11:00|2024-06-25T12:11:11|00:00:11|11|COMPLETED|cpu=00:00:00,energy=1691,fs/disk=38003,mem=4372K,pages=0,vmem=0|energy=82,fs/disk=12433|1691|
206.0|2024-06-25T12:11:01|2024-06-25T12:11:11|00:00:10|10|COMPLETED|cpu=00:00:00,energy=924,fs/disk=4093,mem=188K,pages=0,vmem=0|energy=82,fs/disk=1738|924|

3. So for sbatch jobs, if there's not any step in the job and only user commands, the batch job energy is not accumulated to the job and not shown in sacct. I am looking if this is intentional, but I suspect/think I remember there is a reason for that.

4. Still looking into these values that are close to NO_VAL64:

7402270|a100|1|16|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:41|4624|COMPLETED|sd652|18446744073704391422|
7402271|a100|1|16|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:50|13392|COMPLETED|sd652|18446744073707747296|

I am not finding why this amount, can you please print these without -X and with TresUsageInTot?
Comment 12 Ole.H.Nielsen@fysik.dtu.dk 2024-06-25 05:00:19 MDT
Hi Felip,

(In reply to Felip Moll from comment #11)
> I see several things going on here:
> 
> 1. This issue, that in 24.11 will be improved, but it might still have an
> impact. I will see if we canimprove the number of concurrent RPCS, which can
> make certain steps to not get the energy in an accurate way:
> 
> [2024-06-17T08:51:57.122] error: _rpc_acct_gather_energy: Too many pending
> requests
> [2024-06-17T08:51:57.122] [7393907.batch] error: _get_joules_task: can't get
> info from slurmd
> 
> 2. The -X in your request, is only getting info for allocations, can you try
> removing -X?

I see, here it is without -X:

/usr/bin/sacct --user <omitted> --node=a076 -np -S 060124 -E 060524 -o JobID,Group,Partition,AllocNodes,AllocCPUS,Submit,Eligible,Start,End,ElapsedRaw,CPUTimeRAW,State,Nodelist,TresUsageInTot,TresUsageOutTot,ConsumedEnergyRaw
7352012|catvip|xeon40el8|1|40|2024-06-02T12:54:15|2024-06-02T12:54:15|2024-06-02T14:07:17|2024-06-02T14:07:39|22|880|FAILED|a076|||4576|
7352012.batch|||1|40|2024-06-02T14:07:17|2024-06-02T14:07:17|2024-06-02T14:07:17|2024-06-02T14:07:39|22|880|FAILED|a076|cpu=00:03:32,energy=4576,fs/disk=0,mem=0,pages=65,vmem=186516K|energy=569,fs/disk=1|4576|
7352012.extern|||1|40|2024-06-02T14:07:17|2024-06-02T14:07:17|2024-06-02T14:07:17|2024-06-02T14:07:39|22|880|COMPLETED|a076|cpu=00:00:00,energy=4576,fs/disk=5329,mem=0,pages=0,vmem=0|energy=569,fs/disk=1|4576|
7352013|catvip|xeon40el8|1|40|2024-06-02T12:54:47|2024-06-02T12:54:47|2024-06-02T14:07:40|2024-06-04T08:13:01|151521|6060840|COMPLETED|a076|||218813190241|
7352013.batch|||1|40|2024-06-02T14:07:40|2024-06-02T14:07:40|2024-06-02T14:07:40|2024-06-04T08:13:01|151521|6060840|COMPLETED|a076|cpu=00:00:00,energy=218813190241,fs/disk=1836797,mem=7820K,pages=0,vmem=7824K|energy=537,fs/disk=232872|218813190241|
7352013.extern|||1|40|2024-06-02T14:07:40|2024-06-02T14:07:40|2024-06-02T14:07:40|2024-06-04T08:13:01|151521|6060840|COMPLETED|a076|cpu=00:00:00,energy=218813190241,fs/disk=5329,mem=0,pages=0,vmem=0|energy=537,fs/disk=1|218813190241|
7352013.0|||1|40|2024-06-02T14:07:40|2024-06-02T14:07:40|2024-06-02T14:07:40|2024-06-04T08:13:01|151521|6060840|COMPLETED|a076|cpu=69-17:12:28,energy=218813190241,fs/disk=499109049,mem=57961944K,pages=0,vmem=58599044K|energy=537,fs/disk=59378185|218813190241|

In comment 9 the crazy ConsumedEnergyRaw numbers are still printed, they are 2000 times too large.

> 3. So for sbatch jobs, if there's not any step in the job and only user
> commands, the batch job energy is not accumulated to the job and not shown
> in sacct. I am looking if this is intentional, but I suspect/think I
> remember there is a reason for that.
> 
> 4. Still looking into these values that are close to NO_VAL64:
> 
> 7402270|a100|1|16|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:36|17-
> Jun-2024_11:41|4624|COMPLETED|sd652|18446744073704391422|
> 7402271|a100|1|16|17-Jun-2024_11:36|17-Jun-2024_11:36|17-Jun-2024_11:36|17-
> Jun-2024_11:50|13392|COMPLETED|sd652|18446744073707747296|
> 
> I am not finding why this amount, can you please print these without -X and
> with TresUsageInTot?

Here is the output now:

/usr/bin/sacct --user <omitted> --nodelist=sd652 -np -S 061724 -E 061924 -o JobID,Partition,AllocNodes,AllocCPUS,Submit,Eligible,Start,End,CPUTimeRAW,State,Nodelist,TresUsageInTot,ConsumedEnergyRaw
7403327|a100|1|32|2024-06-17T12:47:47|2024-06-17T12:47:47|2024-06-17T13:03:52|2024-06-18T13:04:05|2765216|TIMEOUT|sd652|||
7403327.batch||1|32|2024-06-17T13:03:52|2024-06-17T13:03:52|2024-06-17T13:03:52|2024-06-18T13:04:06|2765248|CANCELLED|sd652|cpu=23:47:45,energy=82123795,fs/disk=480003303,mem=5841048K,pages=2492,vmem=8437776K|82123795|
7403327.extern||1|32|2024-06-17T13:03:52|2024-06-17T13:03:52|2024-06-17T13:03:52|2024-06-18T13:04:05|2765216|COMPLETED|sd652|cpu=00:00:00,energy=82123795,fs/disk=5329,mem=0,pages=0,vmem=0|82123795|
7417929|a100|1|32|2024-06-18T17:36:48|2024-06-18T17:36:48|2024-06-18T17:36:49|2024-06-20T17:37:07|5530176|TIMEOUT|sd652|||
7417929.batch||1|32|2024-06-18T17:36:49|2024-06-18T17:36:49|2024-06-18T17:36:49|2024-06-20T17:37:09|5530240|CANCELLED|sd652|cpu=1-23:35:10,energy=137492079,fs/disk=480003865,mem=5865184K,pages=2042,vmem=8406088K|137492079|
7417929.extern||1|32|2024-06-18T17:36:49|2024-06-18T17:36:49|2024-06-18T17:36:49|2024-06-20T17:37:08|5530208|COMPLETED|sd652|cpu=00:00:00,energy=137492079,fs/disk=5329,mem=0,pages=0,vmem=0|137492079|

It's really funny that the NO_VAL64 are no longer printed, but something more sensible.  I wonder if some database operation has corrected the issue?

Best regards,
Ole
Comment 14 Felip Moll 2024-06-25 05:31:26 MDT
I managed to reproduce the issue of getting NO_VAL64 from slurmd.

[lipi@llit build]$ sacct --user lipi -np  -S 062524 -E 062624 -o JobID,Start,End,Elapsed,CPUTimeRAW,State,TresUsageInTot,TresUsageOutTot,ConsumedEnergyRaw -j227
227|2024-06-25T13:19:04|2024-06-25T13:19:24|00:00:20|20|COMPLETED||||
227.batch|2024-06-25T13:19:04|2024-06-25T13:19:24|00:00:20|20|COMPLETED|cpu=00:00:00,energy=16?,fs/disk=9209,mem=632K,pages=0,vmem=0|energy=168,fs/disk=1913|18446744073709551273|

It definetively happened when slurmd was restarted. I am working on it.

For the NO_VAL64, can you print these two jobs: 7402270 and 7402271, without -X? You can also check manually the "cluster"_step/job_table in your database and print the steps/job and tres_alloc field for this jobid to see what's stored there.
Comment 15 Ole.H.Nielsen@fysik.dtu.dk 2024-06-25 06:27:31 MDT
(In reply to Felip Moll from comment #14)
> I managed to reproduce the issue of getting NO_VAL64 from slurmd.
> 
> [lipi@llit build]$ sacct --user lipi -np  -S 062524 -E 062624 -o
> JobID,Start,End,Elapsed,CPUTimeRAW,State,TresUsageInTot,TresUsageOutTot,
> ConsumedEnergyRaw -j227
> 227|2024-06-25T13:19:04|2024-06-25T13:19:24|00:00:20|20|COMPLETED||||
> 227.batch|2024-06-25T13:19:04|2024-06-25T13:19:24|00:00:
> 20|20|COMPLETED|cpu=00:00:00,energy=16?,fs/disk=9209,mem=632K,pages=0,
> vmem=0|energy=168,fs/disk=1913|18446744073709551273|
> 
> It definetively happened when slurmd was restarted. I am working on it.

My 2 cents: When slurmd is restarted, it ought to store the current jobid ConsumedEnergyRaw number in the job's /var/spool/slurmd/job... folder.  This would allow to keep ConsumedEnergy values across slurmd restarts. 

> For the NO_VAL64, can you print these two jobs: 7402270 and 7402271, without
> -X? You can also check manually the "cluster"_step/job_table in your
> database and print the steps/job and tres_alloc field for this jobid to see
> what's stored there.

I hope this is the correct command you wanted:

sacct -j 7402270,7402271 -np -o JobID,Group,Partition,AllocNodes,AllocCPUS,Submit,Eligible,Start,End,ElapsedRaw,CPUTimeRAW,State,Nodelist,AllocTRES,TresUsageInTot,TresUsageOutTot,ConsumedEnergyRaw
7402270|qwise|a100|1|16|2024-06-17T11:36:20|2024-06-17T11:36:20|2024-06-17T11:36:21|2024-06-17T11:41:10|289|4624|COMPLETED|sd652|billing=64,cpu=16,energy=18446744073704391422,gres/gpu:a100=1,gres/gpu=1,mem=56000M,node=1|||18446744073704391422|
7402270.batch|||1|16|2024-06-17T11:36:21|2024-06-17T11:36:21|2024-06-17T11:36:21|2024-06-17T11:41:10|289|4624|COMPLETED|sd652|cpu=16,gres/gpu:a100=1,gres/gpu=1,mem=56000M,node=1|cpu=00:00:00,energy=18014398509476944K,fs/disk=343006,mem=5144K,pages=9,vmem=6448K|energy=878,fs/disk=31370|18446744073704391422|
7402270.extern|||1|16|2024-06-17T11:36:21|2024-06-17T11:36:21|2024-06-17T11:36:21|2024-06-17T11:41:10|289|4624|COMPLETED|sd652|billing=64,cpu=16,gres/gpu:a100=1,gres/gpu=1,mem=56000M,node=1|cpu=00:00:00,energy=18014398509476944K,fs/disk=5329,mem=0,pages=0,vmem=0|energy=878,fs/disk=1|18446744073704391422|
7402270.0|||1|16|2024-06-17T11:36:21|2024-06-17T11:36:21|2024-06-17T11:36:21|2024-06-17T11:41:10|289|4624|COMPLETED|sd652|billing=64,cpu=16,gres/gpu:a100=1,gres/gpu=1,mem=56000M,node=1|cpu=00:16:32,energy=18014398509476944K,fs/disk=1359186148,mem=5643900K,pages=4160,vmem=6885856K|energy=878,fs/disk=4311798|18446744073704391422|
7402271|qwise|a100|1|16|2024-06-17T11:36:55|2024-06-17T11:36:55|2024-06-17T11:36:57|2024-06-17T11:50:54|837|13392|COMPLETED|sd652|billing=64,cpu=16,energy=18446744073707747296,gres/gpu:a100=1,gres/gpu=1,mem=56000M,node=1|||18446744073707747296|
7402271.batch|||1|16|2024-06-17T11:36:57|2024-06-17T11:36:57|2024-06-17T11:36:57|2024-06-17T11:50:54|837|13392|COMPLETED|sd652|cpu=16,gres/gpu:a100=1,gres/gpu=1,mem=56000M,node=1|cpu=00:00:00,energy=18014398509480222K,fs/disk=342731,mem=5148K,pages=0,vmem=5164K|energy=857,fs/disk=30878|18446744073707747296|
7402271.extern|||1|16|2024-06-17T11:36:57|2024-06-17T11:36:57|2024-06-17T11:36:57|2024-06-17T11:50:54|837|13392|COMPLETED|sd652|billing=64,cpu=16,gres/gpu:a100=1,gres/gpu=1,mem=56000M,node=1|cpu=00:00:00,energy=18014398509480222K,fs/disk=5329,mem=0,pages=0,vmem=0|energy=857,fs/disk=1|18446744073707747296|
7402271.0|||1|16|2024-06-17T11:36:57|2024-06-17T11:36:57|2024-06-17T11:36:57|2024-06-17T11:50:54|837|13392|COMPLETED|sd652|billing=64,cpu=16,gres/gpu:a100=1,gres/gpu=1,mem=56000M,node=1|cpu=03:19:00,energy=18014398509480222K,fs/disk=1359599402,mem=7980376K,pages=263,vmem=8587048K|energy=857,fs/disk=4311284|18446744073707747296|

Thanks,
Ole
Comment 16 Ole.H.Nielsen@fysik.dtu.dk 2024-06-25 06:34:41 MDT
(In reply to Felip Moll from comment #14)
> -X? You can also check manually the "cluster"_step/job_table in your
> database and print the steps/job and tres_alloc field for this jobid to see
> what's stored there.

Could you advise the command to print this?

Thanks,
Ole
Comment 17 Felip Moll 2024-06-25 06:52:23 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #15)
> (In reply to Felip Moll from comment #14)
> > I managed to reproduce the issue of getting NO_VAL64 from slurmd.
> > 
> > [lipi@llit build]$ sacct --user lipi -np  -S 062524 -E 062624 -o
> > JobID,Start,End,Elapsed,CPUTimeRAW,State,TresUsageInTot,TresUsageOutTot,
> > ConsumedEnergyRaw -j227
> > 227|2024-06-25T13:19:04|2024-06-25T13:19:24|00:00:20|20|COMPLETED||||
> > 227.batch|2024-06-25T13:19:04|2024-06-25T13:19:24|00:00:
> > 20|20|COMPLETED|cpu=00:00:00,energy=16?,fs/disk=9209,mem=632K,pages=0,
> > vmem=0|energy=168,fs/disk=1913|18446744073709551273|
> > 
> > It definetively happened when slurmd was restarted. I am working on it.
> 
> My 2 cents: When slurmd is restarted, it ought to store the current jobid
> ConsumedEnergyRaw number in the job's /var/spool/slurmd/job... folder.  This
> would allow to keep ConsumedEnergy values across slurmd restarts. 

Thanks, it is not failing in this part, it seems it is slurmstepd which is making a miscalculation after getting the new node energy which is smaller than the previous one.

> > For the NO_VAL64, can you print these two jobs: 7402270 and 7402271, without
> > -X? You can also check manually the "cluster"_step/job_table in your
> > database and print the steps/job and tres_alloc field for this jobid to see
> > what's stored there.
> 
> I hope this is the correct command you wanted:

Thanks, this shows how the NO_VAL64 values are still there as I expected:

> 7402271.0|||1|16|2024-06-17T11:36:57|2024-06-17T11:36:57|2024-06-17T11:36:
> 57|2024-06-17T11:50:54|837|13392|COMPLETED|sd652|billing=64,cpu=16,gres/gpu:
> a100=1,gres/gpu=1,mem=56000M,node=1|cpu=03:19:00,energy=18014398509480222K,
> fs/disk=1359599402,mem=7980376K,pages=263,vmem=8587048K|energy=857,fs/
> disk=4311284|18446744073707747296|

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #16)
> (In reply to Felip Moll from comment #14)
> > -X? You can also check manually the "cluster"_step/job_table in your
> > database and print the steps/job and tres_alloc field for this jobid to see
> > what's stored there.
> 
> Could you advise the command to print this?

After your previous sacct I don't need this anymore.

I will let you know about the progress!.
Comment 20 Felip Moll 2024-06-25 06:54:54 MDT
Ole, I found that these wrong values are probably a variant or even the same issue as ticket 12170.
Comment 26 Felip Moll 2024-06-27 12:28:45 MDT
Ole,

I am looking into that and I've found several issues that could be hit at some point.
For the moment, my recommendation is:

- The JobAcctGatherFrequency correct syntax is:
JobAcctGatherFrequency=<datatype>=<interval>

So you should specify something like:
JobAcctGatherFrequency=task=60,energy=60

- Don't set a JobAcctGatherFrequency lower than EnergyIPMIFrequency. It will cause the polling thread in every stepd to spin-up and send a RPC to get the sensor measurement too often. In the worst case, with your values of 30 for JobAcctGatherFrequency and 60 for EnergyIPMIFrequency, it will get the same measure twice per interval, wasting 1/2 of the RPCs.

JobAcctGatherFrequency should be >= EnergyIPMIFrequency

- Disable EnergyIPMICalcAdjustment. I've found that depending on your node consumption and slurmd restart/reconfigs, it might overestimate the consumption of the node. It could cause the plugin to think that slurmd was restarted and would update incorrectly the energy values. I am working on it, but basically the idea behind this parameter is that stepd tries to estimate the consumption from the time the metric was taken by slurmd and "now", and the average watts. This idea is fine if the distance of the measurements are not too long, but it can be quite wrong otherwise, like in your case, where your measurements are separated at least 30 seconds between them.

- About the NO_VAL values after a restart, I've found the problem and it is that we are not detecting correctly if slurmd has been restarted. The slurmstepd gets the initial energy counter from slurmd when the step starts, and never modifies it. If the slurmd is restarted, the "initial energy" is not valid anymore, but we're using it for the calculation of the next metric. It is possible then, that we're doing a substract and underflow a value, which results in NO_VAL. I am also working on this part. The difficulty here is to detect when we have a slurmd restart, as now we rely solely on the previous consumed energy by the stepd and the new consumed energy by slurmd, which in corner cases could not always indicate a restart if the last one is smaller than the first one (for example when a job just starts up, the slurmd is restarted, and there's a spike in consumption, plus the jobacctgather and energyipmi frequencies are very different).

I will keep you posted.
Comment 28 Ole.H.Nielsen@fysik.dtu.dk 2024-06-28 01:19:50 MDT
Hi Felip,

Thanks for your impressive analysis of the issue!

(In reply to Felip Moll from comment #26)
> I am looking into that and I've found several issues that could be hit at
> some point.
> For the moment, my recommendation is:
> 
> - The JobAcctGatherFrequency correct syntax is:
> JobAcctGatherFrequency=<datatype>=<interval>
> 
> So you should specify something like:
> JobAcctGatherFrequency=task=60,energy=60

Thanks!  I was misled by the page https://slurm.schedmd.com/big_sys.html which suggests "JobAcctGatherFrequency=300".  Could you correct/update the example in big_sys.html?

> - Don't set a JobAcctGatherFrequency lower than EnergyIPMIFrequency. It will
> cause the polling thread in every stepd to spin-up and send a RPC to get the
> sensor measurement too often. In the worst case, with your values of 30 for
> JobAcctGatherFrequency and 60 for EnergyIPMIFrequency, it will get the same
> measure twice per interval, wasting 1/2 of the RPCs.
> 
> JobAcctGatherFrequency should be >= EnergyIPMIFrequency

This makes sense.  Could you add the recommendations to the acct_gather.conf manual page?

It might be a good idea to print a warning message to the logs if the condition is not fulfilled.

> - Disable EnergyIPMICalcAdjustment. I've found that depending on your node
> consumption and slurmd restart/reconfigs, it might overestimate the
> consumption of the node. It could cause the plugin to think that slurmd was
> restarted and would update incorrectly the energy values. I am working on
> it, but basically the idea behind this parameter is that stepd tries to
> estimate the consumption from the time the metric was taken by slurmd and
> "now", and the average watts. This idea is fine if the distance of the
> measurements are not too long, but it can be quite wrong otherwise, like in
> your case, where your measurements are separated at least 30 seconds between
> them.

Thanks for the recommendation.  The EnergyIPMICalcAdjustment as documented sounds nice, but may not be so in all cases.

I really wonder about the optimal values of JobAcctGatherFrequency and EnergyIPMIFrequency when we want to obtain useful IPMI energy measurements.  Small values will yield accurate measurements, but as advised in https://slurm.schedmd.com/big_sys.html this may cause too much OS "jitter" system noise which is bad for application codes :-(  Can you propose recommendations on the various measurement frequencies?

> - About the NO_VAL values after a restart, I've found the problem and it is
> that we are not detecting correctly if slurmd has been restarted. The
> slurmstepd gets the initial energy counter from slurmd when the step starts,
> and never modifies it. If the slurmd is restarted, the "initial energy" is
> not valid anymore, but we're using it for the calculation of the next
> metric. It is possible then, that we're doing a substract and underflow a
> value, which results in NO_VAL. I am also working on this part. The
> difficulty here is to detect when we have a slurmd restart, as now we rely
> solely on the previous consumed energy by the stepd and the new consumed
> energy by slurmd, which in corner cases could not always indicate a restart
> if the last one is smaller than the first one (for example when a job just
> starts up, the slurmd is restarted, and there's a spike in consumption, plus
> the jobacctgather and energyipmi frequencies are very different).

Oh, this sounds really tricky!  Could the slurmstepd store the initial energy counter from slurmd in a file in the job's /var/spool/slurmd/job... folder and read it later?

In any case we really have to avoid any NO_VAL energy values because it would upset all energy accounting systems!

I look forward to your continued work on this issue.  It seems that you're getting close to a good way to perform reliable IPMI energy measurements!

Thanks,
Ole
Comment 29 Ole.H.Nielsen@fysik.dtu.dk 2024-06-28 01:33:45 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #28)
> I really wonder about the optimal values of JobAcctGatherFrequency and
> EnergyIPMIFrequency when we want to obtain useful IPMI energy measurements. 
> Small values will yield accurate measurements, but as advised in
> https://slurm.schedmd.com/big_sys.html this may cause too much OS "jitter"
> system noise which is bad for application codes :-(  Can you propose
> recommendations on the various measurement frequencies?

I've now read this in the slurm.conf manual page:

   The default value for task sampling interval is 30 seconds.  The default  value  for  all other intervals is 0.

   Smaller (non-zero) values have a greater impact upon job performance,  but a value of 30 seconds is not likely to be noticeable for applications having less than 10,000 tasks.

It seems to me that in most cases that JobAcctGatherFrequency = EnergyIPMIFrequency = 30 would be good choices.  Would you agree on this?

Thanks,
Ole
Comment 30 Ole.H.Nielsen@fysik.dtu.dk 2024-06-28 01:50:30 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #28)
> > So you should specify something like:
> > JobAcctGatherFrequency=task=60,energy=60
> 
> Thanks!  I was misled by the page https://slurm.schedmd.com/big_sys.html
> which suggests "JobAcctGatherFrequency=300".  Could you correct/update the
> example in big_sys.html?

Actually, the semantics of the (previously valid?) JobAcctGatherFrequency=<number> is currently undocumented.  Does is set all the <datatype> parameters to <number> or only the task parameter?  Since our JobAcctGatherFrequency was working before, both the task, energy and other <datatypes> were probably set.

In any case, using JobAcctGatherFrequency=<number> in slurm.conf ought to generate a warning message in the logs.

Would it make sense to have JobAcctGatherFrequency=<number> generate an error from the next 24.11 release?

Thanks,
Ole
Comment 31 Felip Moll 2024-06-28 06:03:45 MDT
Summary of tasks to do:
1. Correct big_sys.html which suggest JobAcctGatherFrequency=300, bad syntax.
2. Add recommendation to acct_gather.conf man page that JobAcctGatherFreq should be >= EnergyIPMIFreq, or enforce it?
3. Review EnergyIPMICalcAdjustment logic
4. Properly detect a slurmd restart from slurmstepd to reset the start energy counters, this should fix NO_VAL values on slurmd restart/reconfig.
5. Double-check what happens when incorrect syntax o JobAcctGatherFrequency is set. It seems to apply to all plugins but I am quite sure it is wrong.
I also noticed if you just set JobAcctGatherFrequency=energy=30 without task=30, energy freq is ignored. Decide if we need to extend the syntax or
emit a warning/error if misconfigured.
6. Check why the batch step energy is not accumulated into the job accounting. It is probably right or at least intentional to avoid duplicating the
accounting for the external/interactive/other step launched in the same node. Maybe we should only account the batch consumption when there's no other
step in the node? Need more investigation and can be tricky. In any case the batch step is always run in one single node and the odds that somebody does
not run in other nodes and with no other steps, could not be very high (or yes?).

I think this bug would be resolved if these 6 points are resolved.

Responding to your questions:

> I really wonder about the optimal values of JobAcctGatherFrequency and EnergyIPMIFrequency when we want to obtain useful IPMI energy measurements.
> Small values will yield accurate measurements, but as advised in https://slurm.schedmd.com/big_sys.html this may cause too much OS "jitter" system
> noise which is bad for application codes :-(  Can you propose recommendations on the various measurement frequencies?

In a step, if the ipmi thread spins up too often (because of JobAcctGatherFrequency) it can occupy a core on the machine for a while and interrupt the execution of other processes causing a context switch. The same for EnergyIPMIFrequency, which at every interval slurmd will query the IPMI. I remember a performance analysis on a sensitive compute bound application where the graphs were not constant and frequent dips were saw. If the IPMI is slow, the that can be even worse as RPCS from slurmstepd to slurmd can accumulate and cause more jittering.

It will all depend on your type of workflow and compute power of your machines. I would suggest something in the order of tenths of seconds which matches also with the duration of your jobs, and the desired resolution of the measurements. I cannot say for sure, but 30 seconds seems to be a general value in our configs database. 60 seconds is not uncommon either.Only a couple of sites have 10 seconds. For the EnergyIPMIFrequency, the value must be similar, but take in mind how performant are your BMC too, I've seen BMCs which are sluggish or which just get hung if the queries are done too often.


> Oh, this sounds really tricky!  Could the slurmstepd store the initial energy counter from slurmd in a file in the job's /var/spool/slurmd/job... folder and read it later?
No, that's not how this must be fixed. The stepd already stores the initial energy counter from slurmd in its internal data structs, in a variable called start_current_energies[]. The problem is that when slurmd restarts, the energy is reset to 0 in slurmd but not in slurmstepd, and slurmstepd uses the previous start_current_energies[] value to calculate the consumption. Even storing the value in a file as you suggest, from slurmd, won't be correct because the actual concept shown in scontrol show node is "how much energy does the node consumed from last slurmd start up" and not "how much energy does the node consumed from last boot". We could change that idea, but for now it seems better to just tell to slurmstepd when slurmd has been rebooted, and slurmstepd could set again its start_current_energies[]. But I still have to look into the best way to do that, because that probably implies to change the protocol of the RPC, some data structs, and so on.

Does it all sound reasonable?
Comment 32 Ole.H.Nielsen@fysik.dtu.dk 2024-07-01 02:57:48 MDT
(In reply to Felip Moll from comment #31)
> Summary of tasks to do:
> 1. Correct big_sys.html which suggest JobAcctGatherFrequency=300, bad syntax.
> 2. Add recommendation to acct_gather.conf man page that JobAcctGatherFreq
> should be >= EnergyIPMIFreq, or enforce it?
> 3. Review EnergyIPMICalcAdjustment logic
> 4. Properly detect a slurmd restart from slurmstepd to reset the start
> energy counters, this should fix NO_VAL values on slurmd restart/reconfig.
> 5. Double-check what happens when incorrect syntax o JobAcctGatherFrequency
> is set. It seems to apply to all plugins but I am quite sure it is wrong.
> I also noticed if you just set JobAcctGatherFrequency=energy=30 without
> task=30, energy freq is ignored. Decide if we need to extend the syntax or
> emit a warning/error if misconfigured.
> 6. Check why the batch step energy is not accumulated into the job
> accounting. It is probably right or at least intentional to avoid
> duplicating the
> accounting for the external/interactive/other step launched in the same
> node. Maybe we should only account the batch consumption when there's no
> other
> step in the node? Need more investigation and can be tricky. In any case the
> batch step is always run in one single node and the odds that somebody does
> not run in other nodes and with no other steps, could not be very high (or
> yes?).
> 
> I think this bug would be resolved if these 6 points are resolved.

I agree that your 6 points cover all the issues that I have reported.  I'm surprised about the complexity of the IPMI power monitoring, and I'm really happy that you getting to the bottom of it :-)  I would be surprised if any other sites have succeeded in making this IPMI power monitoring work correctly until now :-)

> Responding to your questions:
> 
> > I really wonder about the optimal values of JobAcctGatherFrequency and EnergyIPMIFrequency when we want to obtain useful IPMI energy measurements.
> > Small values will yield accurate measurements, but as advised in https://slurm.schedmd.com/big_sys.html this may cause too much OS "jitter" system
> > noise which is bad for application codes :-(  Can you propose recommendations on the various measurement frequencies?
> 
> In a step, if the ipmi thread spins up too often (because of
> JobAcctGatherFrequency) it can occupy a core on the machine for a while and
> interrupt the execution of other processes causing a context switch. The
> same for EnergyIPMIFrequency, which at every interval slurmd will query the
> IPMI. I remember a performance analysis on a sensitive compute bound
> application where the graphs were not constant and frequent dips were saw.
> If the IPMI is slow, the that can be even worse as RPCS from slurmstepd to
> slurmd can accumulate and cause more jittering.
> 
> It will all depend on your type of workflow and compute power of your
> machines. I would suggest something in the order of tenths of seconds which
> matches also with the duration of your jobs, and the desired resolution of
> the measurements. I cannot say for sure, but 30 seconds seems to be a
> general value in our configs database. 60 seconds is not uncommon
> either.Only a couple of sites have 10 seconds. For the EnergyIPMIFrequency,
> the value must be similar, but take in mind how performant are your BMC too,
> I've seen BMCs which are sluggish or which just get hung if the queries are
> done too often.

You're right that some BMCs are worse than others, and that BMC's can get saturated by too many requests.  We have a number of different node generations from several vendors, so we see a spectrum of BMCs.  For example,  Huawei and Xfusion server BMCs simply are unable to provide DCMI power readings as discussed in bug 17704.

> > Oh, this sounds really tricky!  Could the slurmstepd store the initial energy counter from slurmd in a file in the job's /var/spool/slurmd/job... folder and read it later?
> No, that's not how this must be fixed. The stepd already stores the initial
> energy counter from slurmd in its internal data structs, in a variable
> called start_current_energies[]. The problem is that when slurmd restarts,
> the energy is reset to 0 in slurmd but not in slurmstepd, and slurmstepd
> uses the previous start_current_energies[] value to calculate the
> consumption. Even storing the value in a file as you suggest, from slurmd,
> won't be correct because the actual concept shown in scontrol show node is
> "how much energy does the node consumed from last slurmd start up" and not
> "how much energy does the node consumed from last boot". We could change
> that idea, but for now it seems better to just tell to slurmstepd when
> slurmd has been rebooted, and slurmstepd could set again its
> start_current_energies[]. But I still have to look into the best way to do
> that, because that probably implies to change the protocol of the RPC, some
> data structs, and so on.

This does sound tricky :-)  I hope you can find a solution.

> Does it all sound reasonable?

Yes, thank you very much!

Ole
Comment 37 Felip Moll 2024-07-08 05:37:17 MDT
Hello Olé,

While working on this I found yet another think that must be documented, it is regarding point number 6 of my comment 31.
If you run two parallel steps in a node, the energy accounted in the *job* will be summed, for example, this job runs two steps running at the exact same time on the same node. You see how the ConsumedEnergyRaw of the job is the energy of 1464.0+1464.1, while the real joules of the node won't be 1406 but 703. The energy of the batch job is not accounted for in the job or that would cause double accounting, as the batch step is a step which is always running.

]$ sacct -np  -o JobID,Start,End,Elapsed,CPUTimeRAW,State,TresUsageInTot,TresUsageOutTot,ConsumedEnergyRaw -j 1464
1464|2024-07-08T13:25:31|2024-07-08T13:25:38|00:00:07|1792|COMPLETED|||1406|
1464.batch|2024-07-08T13:25:31|2024-07-08T13:25:38|00:00:07|1792|COMPLETED|cpu=00:00:00,energy=606,fs/disk=84900,mem=8596K,pages=0,vmem=0|energy=0,fs/disk=24549|606|
1464.0|2024-07-08T13:25:32|2024-07-08T13:25:40|00:00:08|8|CANCELLED|cpu=00:00:00,energy=703,fs/disk=4093,mem=196K,pages=0,vmem=0|energy=0,fs/disk=1757|703|
1464.1|2024-07-08T13:25:32|2024-07-08T13:25:40|00:00:08|8|CANCELLED|cpu=00:00:00,energy=703,fs/disk=4093,mem=192K,pages=0,vmem=0|energy=0,fs/disk=1757|703|

Also, I see here 606 and it should be >= 703 as this step should've run more time than the other steps, but what happened is that it ended earlier, and there's a little unsync between when it started gathering metrics, and when the steps termination code executed and the energy plugins finished in the 3 steps. But that's just a slight corner case that I wouldn't give much importance.

I am going on PTO for some weeks, so I will pass my work on this to my colleague Miquel which will continue working on it internally.
Thanks for your report and we'll keep you informed.
Comment 40 Ole.H.Nielsen@fysik.dtu.dk 2024-08-01 01:06:08 MDT
Hi Felip and Miguel,

With the releases 24.05.2 and 23.11.9 made yesterday, I wonder how many of the points in Comment 31 have been fixed now, and which ones are outstanding?

Thanks,
Ole
Comment 41 Miquel Comas 2024-08-01 01:26:25 MDT
Hi Olé,

right now our main working effort is in points 5 and 6; the others are currently in a patch which is going to be reviewed. When we push those changes we will give you the commit id so that even if they are pushed between releases you can apply the fixes as soon as possible.

Best regards,

Miquel
Comment 51 Miquel Comas 2024-10-03 02:28:41 MDT
Hi Ole,

after some time working on it we have finally pushed the following commits to the master branch:
* 39636a0215 (HEAD -> master, origin/master, origin/HEAD) Add NEWS entry for the last 9 commits
* 8a158436f3 acct_gather_energy - Add restart and faulty IPMI detection in energy logic
* ca7dbd1a5f acct_gather_energy - Log slurmd start time in the energy logging
* 266c54fde0 acct_gather_energy - Add logic to detect when slurmd has been restarted
* 11a8223cf8 slurmd - make slurmd_start_time symbol available to energy plugins
* 852ad55754 acct_gather_energy - add new variable to protocol pack/unpack logic
* 110ae98677 acct_gather_energy - add new protocol block for pack/unpack energy
* 1be49e8b3f Docs - Clarify different types of JobAcctGatherFrequency
* 0fd7283cc5 Docs - Add recommendation about minimum JobAcctGatherFreq
* 5aa16be265 Docs - Fix incorrect JobAcctGatherFreq parameter format

The documentation changes aim to clarify what some paramters are actually for (which correspond to points 1, 2 and 5 from Felip's comment 31).

The commits which actually change and fix energy gathering behaviour correspond to Felip's 4th point. Specifically, the changes now enable IPMI energy gathering to appropiately detect a slurmd restart, and also tackle with faulty IPMIs by detecting if any non-start sensor reading is 0; when this happens, the steps affected by this IPMI interface will abort energy gathering and display a 0 Consumed Energy (while also displaying an error to warn of an issue with the IPMI sensors).

Right now, then, the only issue remaining is solving the 6th point.

Best regards,
Comment 52 Ole.H.Nielsen@fysik.dtu.dk 2024-10-04 00:07:00 MDT
Hi Miguel,

Thanks very much for these extensive and impressive commits to resolve the present ticket!  The acct_gather_energy plugin should now become more precise and useful. Is this going to be part of 24.11?

The remaining point 6 is perhaps less urgent.

Thanks,
Ole  


(In reply to Miquel Comas from comment #51)
> Hi Ole,
> 
> after some time working on it we have finally pushed the following commits
> to the master branch:
> * 39636a0215 (HEAD -> master, origin/master, origin/HEAD) Add NEWS entry for
> the last 9 commits
> * 8a158436f3 acct_gather_energy - Add restart and faulty IPMI detection in
> energy logic
> * ca7dbd1a5f acct_gather_energy - Log slurmd start time in the energy logging
> * 266c54fde0 acct_gather_energy - Add logic to detect when slurmd has been
> restarted
> * 11a8223cf8 slurmd - make slurmd_start_time symbol available to energy
> plugins
> * 852ad55754 acct_gather_energy - add new variable to protocol pack/unpack
> logic
> * 110ae98677 acct_gather_energy - add new protocol block for pack/unpack
> energy
> * 1be49e8b3f Docs - Clarify different types of JobAcctGatherFrequency
> * 0fd7283cc5 Docs - Add recommendation about minimum JobAcctGatherFreq
> * 5aa16be265 Docs - Fix incorrect JobAcctGatherFreq parameter format
> 
> The documentation changes aim to clarify what some paramters are actually
> for (which correspond to points 1, 2 and 5 from Felip's comment 31).
> 
> The commits which actually change and fix energy gathering behaviour
> correspond to Felip's 4th point. Specifically, the changes now enable IPMI
> energy gathering to appropiately detect a slurmd restart, and also tackle
> with faulty IPMIs by detecting if any non-start sensor reading is 0; when
> this happens, the steps affected by this IPMI interface will abort energy
> gathering and display a 0 Consumed Energy (while also displaying an error to
> warn of an issue with the IPMI sensors).
> 
> Right now, then, the only issue remaining is solving the 6th point.
> 
> Best regards,
Comment 53 Miquel Comas 2024-10-07 03:43:44 MDT
Hi Ole,

Yes, this changes will be available in 24.11. We'll keep you updated with the last issue to resolve.

Best regards,