Ticket 2526 - Better support for intel_pstate driver
Summary: Better support for intel_pstate driver
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 16.05.x
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-03-07 20:55 MST by Janne Blomqvist
Modified: 2016-04-20 02:11 MDT (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 16.05.0-pre3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Support the intel_pstate scaling driver (8.51 KB, patch)
2016-04-19 19:03 MDT, Janne Blomqvist
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Janne Blomqvist 2016-03-07 20:55:15 MST
Currently the slurm --cpu-freq= option is dependent on the ACPI cpufreq driver. However, with current Linux distros and x86 hardware, the out-of-the-box cpufreq driver is intel_pstate, which offers a number of benefits over the old ACPI cpufreq. This means that setting cpu frequency policy is not available for slurm, and slurmd spams the logs with messages that it can't find the scaling_cur_freq file in /sys.

While intel_pstate does not support setting an exact frequency like acpi-cpufreq, a subset of the functionality could, I think, be usefully supported. For instance, a useful out-of-the-box behavior could be something like:

- When a job starts, set the governor to "performance" on the cpu's allocated to the job.
- When the job ends, set the governor to "powersave".

Then one could allow jobs to override the above defaults with the --cpu-freq= option, although intel_pstate supports only "powersave" and "performance", so the usefulness of this is perhaps not that big.

Similarly, while intel_pstate does not support setting the frequency, one can set the maximum and minimum p-states in /sys/devices/system/cpu/intel_pstate/. Although setting the governor to "performance" sets /sys/devices/system/cpu/intel_pstate/min_perf_pct to 100, so again, allowing to set this is perhaps not that useful.

For more information about intel_pstate, see:

https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt

https://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
Comment 1 Tim Wickberg 2016-04-07 08:46:49 MDT
Updating this would certainly be useful, although we don't have a plan to handle it just yet.

Marking as a potential feature enhancement request, and updating assignee to match.
Comment 2 Janne Blomqvist 2016-04-19 19:03:53 MDT
Created attachment 3013 [details]
Support the intel_pstate scaling driver

Hi,

here's a patch which implements support for intel_pstate. 

I noticed that the CpuFreqDef config option was only partially implemented. The value was parsed, but the never used. So I took the liberty of re-purposing it to mean sort of the opposite, namely the frequency governor to use when running a job step in case the job doesn't explicitly provide any --cpu-freq option. 

I also changed the default of the CpuFreqGovernors option to be "ondemand,performance", since ondemand isn't available with the intel_pstate driver.

Otherwise the patch should be relatively straightforward and only changes a few minor things here and there.
Comment 3 Moe Jette 2016-04-20 02:11:32 MDT
Thank you for your contribution.
Your patch is committed here:
https://github.com/SchedMD/slurm/commit/a4f35c45eddf54d9305e5a16352dabdab3ad97b3