Ticket 8484 - Documentation on the acct_gather_energy plugin for Lenovo SD650 servers XCC
Summary: Documentation on the acct_gather_energy plugin for Lenovo SD650 servers XCC
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Documentation (show other tickets)
Version: 19.05.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Felip Moll
QA Contact: Ben Roberts
URL:
Depends on:
Blocks:
 
Reported: 2020-02-11 04:03 MST by Karsten Kutzer
Modified: 2020-04-15 00:47 MDT (History)
1 user (show)

See Also:
Site: LRZ
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.02.2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Karsten Kutzer 2020-02-11 04:03:37 MST
Please provide some documentation on the Slurm web-site on the acct-gather plugin provided by Felipe Moll which works with Lenovo ThinkSystem SD650 to collect power/energy data. Thank you!
Comment 2 Felip Moll 2020-04-14 10:24:01 MDT
Hi Karsten,

I prepared a patch mentioning XCC and which parameters can be applied on production systems. This appear (after reviewed) in acct_gather.conf and slurm.conf man pages and also in acct gather energy section in the website.

This can take a bit though to come in.

I also want to mention that specific design documentation is not included, as is not of any other plugin. Nevertheless I refer you to bug 6213 (where you have access) for further internal details.

The basic items to remember I put here for the record.

XCC Lenovo SD650 plugin is a fork of the current acct_gather/ipmi plugin. It currently works now for Lenovo SD650 servers only, but this could be easily changed or extended in the future to support other servers.

The difference with ipmi is that we just have one sensor and do not use ipmi sensor library but use direct raw commands to query the XCC. The RAW command and its output decodification can be found on the code. Besides that, everything is exactly equal to standard IPMI plugin.

The idea behind this is to work as ipmi plugin does, just initiating a thread in slurmd which will query the XClarity Controller periodically, depending on the EnergyIPMIFrequency (in seconds) set in acct_gather.conf. This measure will be stored in slurmd and other stepd threads will consume it through RPCs, calculating each step power and used energy.

Slurmctld will periodically gather this information depending on AcctGatherNodeFreq which will be used to show node information, i.e. in 'scontrol show node'. I recommend setting this to 0, because the important information is the total consumption of the job, not of a single node, and this does comes with a (small) performance penalty.

Besides that, TRESUsageIN/OUT* are filled periodically every gather interval, this is the information used to show sstat and sacct data and which represents the more important job consumption.

TRESUsageIN* fields are used for energy.
TRESUsageOUT* fields are used for power.

There's no more secret on this plugin. All other stuff is identical to other acct_gather plugins.

There are a couple of developer options for programmers only which can be found on the code (EnergyXCCFake and EnergyIPMIDriverType mainly), these are intended for developing and testing purposes.

I keep this bug open until the documentation patch is applied.

If you have specific questions don't hesitate to ask.
Comment 5 Felip Moll 2020-04-15 00:46:52 MDT
Documentation notes has been added in commit d4c4ad1a064c, Slurm 20.02.2 which will be available in a near future.

Thanks!