Summary: | Extend acct_gather_energy_rapl to support AMD Zen | ||
---|---|---|---|
Product: | Slurm | Reporter: | Jurij Pečar <jurij.pecar> |
Component: | Accounting | Assignee: | Dominik Bartkiewicz <bart> |
Status: | OPEN --- | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | CC: | Alan.Sill, alex, bart, misha.ahmadian, osmith, sts |
Version: | 21.08.8 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | EMBL | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | amd epyc support for acct_gather_energy_rapl |
Description
Jurij Pečar
2019-07-31 06:00:59 MDT
Hi, Jurij and thank you for pointing this out. We will evaluate this and see what is possible. I believe you'll need a kernel of version 5.8 or higher to read AMD Zen power through RAPL. Important fixes were also introduced in version 5.11 of the kernel. The patches were originally introduced by Google and eventually merged: https://lore.kernel.org/lkml/20200515215733.20647-1-eranian@google.com/#r https://lore.kernel.org/lkml/20200601155437.GA1042527@gmail.com/ Ah - bad news: It looks like support for AMD via RAPL was removed as ov version 5.13: https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.13-AMD-Energy-Removed Yeah I'm following this saga. Now we need to see how the major distros will decide - I'd hate to maintain external patches for each kernel upgrade just to get the energy info ... Revisiting this topic. Looks like amd_energy kernel module is available in el8 since 8.4 and I can confirm that "sensors" command shows me per-core kJ power usage. However slurm (21.08.8) still shows me n/a for watts and joules associated with our amd nodes. What's missing? Looks like due to CVE-2020-12912 something like chmod 444 /sys/devices/platform/amd_energy.0/hwmon/hwmon*/energy* is needed. I guess it's up to each admin to determine if this is acceptable risk for their systems. Doing this makes 'sensors' output KJ numbers per core also to nonprivileged users. Now to see if slurm can make use of that... Created attachment 33361 [details]
amd epyc support for acct_gather_energy_rapl
Hi,
I've written a patch to add support for AMD Epyc CPUs to acct_gather_energy_rapl in slurm 23.02. Honestly I'm not a developer so I'm sure it's rough in some places but hopefully it might serve as a starting point to adding support officially.
It's working on our test cluster and seems to report reasonable stats for CPU power use.
Thanks
Patch applies fine to 23.11.1 too and it appears to function correctly. However I'm not 100% sure about the values it collects. Energy numbers on zen2 seem to be much lower than on zen3 and zen4. Not sure what to make of that ... Do we have a reference job that uses known amount of energy? Maybe some simple HPL run? |