| Summary: | No energy data collected and errors printed to stdout with acct_gather_energy/cray | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | Configuration | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex, brian, da, tim |
| Version: | 15.08.2 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 15.08.3 16.05.0-pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
partch for Cray energy
energy2.patch patch working on cori |
||
Created attachment 2354 [details]
partch for Cray energy
Doug, does this patch fix the issue? Looks like it was missed from a change to
the enum.
If this fixes the issue I'll check it in. I haven't been able to test it on a real Cray just yet, if you can that would be great ;).
Hi Danny,
The patch doesn't include a definition for sensor_cnt and thus doesn't compile:
given:
acct_gather_energy_t *energy = (acct_gather_energy_t *)data;
time_t *last_poll = (time_t *)data;
and:
*last_poll = local_energy->poll_time;
...
*sensor_cnt = 1;
I assume that sensor_cnt should be some cast of data, what type?
Thanks,
Doug
put this on top of the other, sorry for missing it. On 10/29/15 11:24, bugs@schedmd.com wrote: > > *Comment # 2 <http://bugs.schedmd.com/show_bug.cgi?id=2084#c2> on bug > 2084 <http://bugs.schedmd.com/show_bug.cgi?id=2084> from Doug Jacobsen > <mailto:dmjacobsen@lbl.gov> * > Hi Danny, > > The patch doesn't include a definition for sensor_cnt and thus doesn't compile: > > given: > > acct_gather_energy_t *energy = (acct_gather_energy_t *)data; > time_t *last_poll = (time_t *)data; > > > and: > > *last_poll = local_energy->poll_time; > ... > *sensor_cnt = 1; > > I assume that sensor_cnt should be some cast of data, what type? > > Thanks, > Doug > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You are the assignee for the bug. > Created attachment 2358 [details]
patch working on cori
Thanks for sending it out, I ended coming to the same solution while the messages were in flight. We're collecting data now. What are the units for CollectedEnergy in the sacct?
Are all the steps orthogonal in terms of usage? e.g., a trival job:
dmj@cori03:~> sacct -j 6990 --format=user,job,ConsumedEnergy,ConsumedEnergyRaw
User JobID ConsumedEnergy ConsumedEnergyRaw
--------- ------------ -------------- -----------------
dmj 6990
6990.0 801 801.000000
6990.1 11 11.000000
dmj@cori03:~>
Thanks,
Doug
I believe joules. On October 29, 2015 12:16:03 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2084 > >--- Comment #5 from Doug Jacobsen <dmjacobsen@lbl.gov> --- >Thanks for sending it out, I ended coming to the same solution while >the >messages were in flight. We're collecting data now. What are the >units for >CollectedEnergy in the sacct? > >Are all the steps orthogonal in terms of usage? e.g., a trival job: > >dmj@cori03:~> sacct -j 6990 >--format=user,job,ConsumedEnergy,ConsumedEnergyRaw > User JobID ConsumedEnergy ConsumedEnergyRaw >--------- ------------ -------------- ----------------- > dmj 6990 > 6990.0 801 801.000000 > 6990.1 11 11.000000 >dmj@cori03:~> > >Thanks, >Doug > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. Now that this is working, I have a followup question, if we aren't using the hdf5 profiling, is there benefit to setting AcctGatherFilesystemType to the lustre plugin? Will the data be gathered for total job read/write/size? It's unclear from the documentation. Thank you, Doug Currently it only matters with profiling. The same holds true for the acct_gather_infiniband as well (just incase you wanted to ask that question as well ;)). Pretty much what is stored in the struct jobacctinfo defined in src/common/slurm_jobacct_gather.h is stored in the database. If it isn't there it isn't stored. Let me know if you have any other questions. The patch is now in 15.08 commit fe9cc7426c0cb2e. Do you have anything else on this one? I think this is great -- thanks again. -Doug |
Created attachment 2357 [details] energy2.patch Hello, I'm trying to get collection of power data working for accounting purposes. I've configured: dmj@cori01:~> scontrol show config | grep -i acct AcctGatherEnergyType = acct_gather_energy/cray AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInfinibandType = acct_gather_infiniband/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) The errors I'm getting: + srun sleep 1 slurmstepd: acct_gather_energy_p_get_data: unknown enum 7 slurmstepd: acct_gather_energy_p_get_data: unknown enum 7 slurmstepd: acct_gather_energy_p_get_data: unknown enum 7 There are similar errors in the slurmd logs. I couldn't find much in the way of documentation for this plugin, so I appreciate any advice you can give. Thanks, Doug