Ticket 2084 - No energy data collected and errors printed to stdout with acct_gather_energy/cray
Summary: No energy data collected and errors printed to stdout with acct_gather_energy...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 15.08.2
Hardware: Cray XC Linux
: 3 - Medium Impact
Assignee: Danny Auble
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-10-29 04:40 MDT by Doug Jacobsen
Modified: 2015-10-29 09:53 MDT (History)
4 users (show)

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 15.08.3 16.05.0-pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
partch for Cray energy (990 bytes, patch)
2015-10-29 06:06 MDT, Danny Auble
Details | Diff
energy2.patch (601 bytes, text/x-diff)
2015-10-29 06:36 MDT, Danny Auble
Details
patch working on cori (1.48 KB, patch)
2015-10-29 07:12 MDT, Doug Jacobsen
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Doug Jacobsen 2015-10-29 04:40:19 MDT
Created attachment 2357 [details]
energy2.patch

Hello,

I'm trying to get collection of power data working for accounting purposes.

I've configured:

dmj@cori01:~> scontrol show config | grep -i acct
AcctGatherEnergyType    = acct_gather_energy/cray
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)


The errors I'm getting:

+ srun sleep 1
slurmstepd: acct_gather_energy_p_get_data: unknown enum 7
slurmstepd: acct_gather_energy_p_get_data: unknown enum 7
slurmstepd: acct_gather_energy_p_get_data: unknown enum 7


There are similar errors in the slurmd logs.

I couldn't find much in the way of documentation for this plugin, so I appreciate any advice you can give.

Thanks,
Doug
Comment 1 Danny Auble 2015-10-29 06:06:44 MDT
Created attachment 2354 [details]
partch for Cray energy

Doug, does this patch fix the issue?  Looks like it was missed from a change to
the enum.

If this fixes the issue I'll check it in.  I haven't been able to test it on a real Cray just yet, if you can that would be great ;).
Comment 2 Doug Jacobsen 2015-10-29 06:24:14 MDT
Hi Danny,

The patch doesn't include a definition for sensor_cnt and thus doesn't compile:

given:

        acct_gather_energy_t *energy = (acct_gather_energy_t *)data;
        time_t *last_poll = (time_t *)data;


and:

               *last_poll = local_energy->poll_time;
...
                *sensor_cnt = 1;

I assume that sensor_cnt should be some cast of data, what type?

Thanks,
Doug
Comment 3 Danny Auble 2015-10-29 06:36:06 MDT
put this on top of the other, sorry for missing it.

On 10/29/15 11:24, bugs@schedmd.com wrote:
>
> *Comment # 2 <http://bugs.schedmd.com/show_bug.cgi?id=2084#c2> on bug 
> 2084 <http://bugs.schedmd.com/show_bug.cgi?id=2084> from Doug Jacobsen 
> <mailto:dmjacobsen@lbl.gov> *
> Hi Danny,
>
> The patch doesn't include a definition for sensor_cnt and thus doesn't compile:
>
> given:
>
>          acct_gather_energy_t *energy = (acct_gather_energy_t *)data;
>          time_t *last_poll = (time_t *)data;
>
>
> and:
>
>                 *last_poll = local_energy->poll_time;
> ...
>                  *sensor_cnt = 1;
>
> I assume that sensor_cnt should be some cast of data, what type?
>
> Thanks,
> Doug
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>
Comment 4 Doug Jacobsen 2015-10-29 07:12:32 MDT
Created attachment 2358 [details]
patch working on cori
Comment 5 Doug Jacobsen 2015-10-29 07:16:03 MDT
Thanks for sending it out,  I ended coming to the same solution while the messages were in flight.  We're collecting data now.  What are the units for CollectedEnergy in the sacct?

Are all the steps orthogonal in terms of usage?  e.g., a trival job:

dmj@cori03:~> sacct -j 6990 --format=user,job,ConsumedEnergy,ConsumedEnergyRaw
     User        JobID ConsumedEnergy ConsumedEnergyRaw
--------- ------------ -------------- -----------------
      dmj 6990
          6990.0                  801        801.000000
          6990.1                   11         11.000000
dmj@cori03:~>

Thanks,
Doug
Comment 6 Danny Auble 2015-10-29 07:22:09 MDT
I believe joules. 

On October 29, 2015 12:16:03 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2084
>
>--- Comment #5 from Doug Jacobsen <dmjacobsen@lbl.gov> ---
>Thanks for sending it out,  I ended coming to the same solution while
>the
>messages were in flight.  We're collecting data now.  What are the
>units for
>CollectedEnergy in the sacct?
>
>Are all the steps orthogonal in terms of usage?  e.g., a trival job:
>
>dmj@cori03:~> sacct -j 6990
>--format=user,job,ConsumedEnergy,ConsumedEnergyRaw
>     User        JobID ConsumedEnergy ConsumedEnergyRaw
>--------- ------------ -------------- -----------------
>      dmj 6990
>          6990.0                  801        801.000000
>          6990.1                   11         11.000000
>dmj@cori03:~>
>
>Thanks,
>Doug
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.
Comment 7 Doug Jacobsen 2015-10-29 07:56:34 MDT
Now that this is working, I have a followup question, if we aren't using the hdf5 profiling, is there benefit to setting AcctGatherFilesystemType to the lustre plugin?  Will the data be gathered for total job read/write/size?

It's unclear from the documentation.

Thank you,
Doug
Comment 8 Danny Auble 2015-10-29 09:15:29 MDT
Currently it only matters with profiling.  The same holds true for the acct_gather_infiniband as well (just incase you wanted to ask that question as well ;)).

Pretty much what is stored in the struct jobacctinfo defined in src/common/slurm_jobacct_gather.h is stored in the database.  If it isn't there it isn't stored.

Let me know if you have any other questions.  The patch is now in 15.08 commit fe9cc7426c0cb2e.

Do you have anything else on this one?
Comment 9 Doug Jacobsen 2015-10-29 09:53:49 MDT
I think this is great -- thanks again.

-Doug