Ticket 2084

Summary: No energy data collected and errors printed to stdout with acct_gather_energy/cray
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: ConfigurationAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: alex, brian, da, tim
Version: 15.08.2   
Hardware: Cray XC   
OS: Linux   
Site: NERSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 15.08.3 16.05.0-pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: partch for Cray energy
energy2.patch
patch working on cori

Description Doug Jacobsen 2015-10-29 04:40:19 MDT
Created attachment 2357 [details]
energy2.patch

Hello,

I'm trying to get collection of power data working for accounting purposes.

I've configured:

dmj@cori01:~> scontrol show config | grep -i acct
AcctGatherEnergyType    = acct_gather_energy/cray
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)


The errors I'm getting:

+ srun sleep 1
slurmstepd: acct_gather_energy_p_get_data: unknown enum 7
slurmstepd: acct_gather_energy_p_get_data: unknown enum 7
slurmstepd: acct_gather_energy_p_get_data: unknown enum 7


There are similar errors in the slurmd logs.

I couldn't find much in the way of documentation for this plugin, so I appreciate any advice you can give.

Thanks,
Doug
Comment 1 Danny Auble 2015-10-29 06:06:44 MDT
Created attachment 2354 [details]
partch for Cray energy

Doug, does this patch fix the issue?  Looks like it was missed from a change to
the enum.

If this fixes the issue I'll check it in.  I haven't been able to test it on a real Cray just yet, if you can that would be great ;).
Comment 2 Doug Jacobsen 2015-10-29 06:24:14 MDT
Hi Danny,

The patch doesn't include a definition for sensor_cnt and thus doesn't compile:

given:

        acct_gather_energy_t *energy = (acct_gather_energy_t *)data;
        time_t *last_poll = (time_t *)data;


and:

               *last_poll = local_energy->poll_time;
...
                *sensor_cnt = 1;

I assume that sensor_cnt should be some cast of data, what type?

Thanks,
Doug
Comment 3 Danny Auble 2015-10-29 06:36:06 MDT
put this on top of the other, sorry for missing it.

On 10/29/15 11:24, bugs@schedmd.com wrote:
>
> *Comment # 2 <http://bugs.schedmd.com/show_bug.cgi?id=2084#c2> on bug 
> 2084 <http://bugs.schedmd.com/show_bug.cgi?id=2084> from Doug Jacobsen 
> <mailto:dmjacobsen@lbl.gov> *
> Hi Danny,
>
> The patch doesn't include a definition for sensor_cnt and thus doesn't compile:
>
> given:
>
>          acct_gather_energy_t *energy = (acct_gather_energy_t *)data;
>          time_t *last_poll = (time_t *)data;
>
>
> and:
>
>                 *last_poll = local_energy->poll_time;
> ...
>                  *sensor_cnt = 1;
>
> I assume that sensor_cnt should be some cast of data, what type?
>
> Thanks,
> Doug
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>
Comment 4 Doug Jacobsen 2015-10-29 07:12:32 MDT
Created attachment 2358 [details]
patch working on cori
Comment 5 Doug Jacobsen 2015-10-29 07:16:03 MDT
Thanks for sending it out,  I ended coming to the same solution while the messages were in flight.  We're collecting data now.  What are the units for CollectedEnergy in the sacct?

Are all the steps orthogonal in terms of usage?  e.g., a trival job:

dmj@cori03:~> sacct -j 6990 --format=user,job,ConsumedEnergy,ConsumedEnergyRaw
     User        JobID ConsumedEnergy ConsumedEnergyRaw
--------- ------------ -------------- -----------------
      dmj 6990
          6990.0                  801        801.000000
          6990.1                   11         11.000000
dmj@cori03:~>

Thanks,
Doug
Comment 6 Danny Auble 2015-10-29 07:22:09 MDT
I believe joules. 

On October 29, 2015 12:16:03 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2084
>
>--- Comment #5 from Doug Jacobsen <dmjacobsen@lbl.gov> ---
>Thanks for sending it out,  I ended coming to the same solution while
>the
>messages were in flight.  We're collecting data now.  What are the
>units for
>CollectedEnergy in the sacct?
>
>Are all the steps orthogonal in terms of usage?  e.g., a trival job:
>
>dmj@cori03:~> sacct -j 6990
>--format=user,job,ConsumedEnergy,ConsumedEnergyRaw
>     User        JobID ConsumedEnergy ConsumedEnergyRaw
>--------- ------------ -------------- -----------------
>      dmj 6990
>          6990.0                  801        801.000000
>          6990.1                   11         11.000000
>dmj@cori03:~>
>
>Thanks,
>Doug
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.
Comment 7 Doug Jacobsen 2015-10-29 07:56:34 MDT
Now that this is working, I have a followup question, if we aren't using the hdf5 profiling, is there benefit to setting AcctGatherFilesystemType to the lustre plugin?  Will the data be gathered for total job read/write/size?

It's unclear from the documentation.

Thank you,
Doug
Comment 8 Danny Auble 2015-10-29 09:15:29 MDT
Currently it only matters with profiling.  The same holds true for the acct_gather_infiniband as well (just incase you wanted to ask that question as well ;)).

Pretty much what is stored in the struct jobacctinfo defined in src/common/slurm_jobacct_gather.h is stored in the database.  If it isn't there it isn't stored.

Let me know if you have any other questions.  The patch is now in 15.08 commit fe9cc7426c0cb2e.

Do you have anything else on this one?
Comment 9 Doug Jacobsen 2015-10-29 09:53:49 MDT
I think this is great -- thanks again.

-Doug