Summary: | Cgroup configuration with ConstrainCores,ConstrainRAMSpace and hwloc 1.11.2 | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | Configuration | Assignee: | Tim Wickberg <tim> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | alex |
Version: | 16.05.10 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DTU Physics | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Ole.H.Nielsen@fysik.dtu.dk
2017-05-31 10:32:51 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #0) > In slurm.conf we configure: > TaskPlugin=task/cgroup > > and in cgroup.conf: > CgroupAutomount=yes > CgroupReleaseAgentDir="/etc/slurm/cgroup" > ConstrainCores=yes > ConstrainRAMSpace=yes > ConstrainSwapSpace=yes > > I've encountered some warnings in the on-line manual pages: > > The page https://slurm.schedmd.com/slurm.conf.html says regarding > task/cgroup: > ... NOTE: When ContrainRAMSpace is set in the cgroup.conf this plugin > noticibly slows down performance. It should probably be avoided in an HTC > environment. > > (please spell out the acronym HTC) That note was recently removed (I believe if you hit refresh on that page you won't see it any more). We do recommend using ConstrainRAMSpace for any environments, and the performance impacts noted were with much older Linux kernels. > Also, https://slurm.schedmd.com/cgroup.conf.html says: > ConstrainCores=<yes|no> > ... Due to a bug fixed in version 1.11.5 of HWLOC, the task/affinity plugin > may be required in addition to task/cgroup for this to function properly. > > Apparently I'm violating both of these warnings (we have yet to identify any > bad effects though). Our CentOS 7.3 provides hwloc-1.11.2-1.el7.x86_64. > > Question: Can you please recommend the correct and optimal configuration of > TaskPlugin and the cgroup.conf parameters given the hwloc version in CentOS? We continue to recommend using both the affinity and cgroup, even after installing a newer version of hwloc. My current recommendation would be: TaskPlugin=affinity,cgroup and ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes The CgroupReleaseAgentDir is not needed after 16.05.5 was released (the cleanup is handled internally now, and that option will be ignored in a future release). - Tim I'm out of the office until June 2. Jeg er ikke på kontoret, tilbage igen 2. juni. Best regards / Venlig hilsen, Ole Holm Nielsen Thanks a lot for the info! I will reconfigure ASAP. The slurm.conf web page still contains the outdated text after refreshing the page... Please close this case. (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #3) > Thanks a lot for the info! I will reconfigure ASAP. > > The slurm.conf web page still contains the outdated text after refreshing > the page... > > Please close this case. Should be sorted now, sorry about that. One last note - I should have mentioned that you may still want the ReleaseAgent setting in place until you get a chance to upgrade to 17.02.3 or later. There were some unfortunate complications from the earlier cleanup code, and some edge cases won't be handled properly until that point release. Marking resolved. One more clarification, as I'd slightly miscategorized some of this. There is always a performance impact from enabling ConstrainRAMSpace - in testing, it can result in per-node throughput dropping from ~ 150 jobs/second to 15 jobs/second launched. In most environments this should not be significant, and in my opinion the potential impact from unconstrained memory use by the job leading to OOM / interference from others jobs sharing the node is a much more pressing concern. I've revised some of the cgroup.conf man page to better hint at this, rather than just referring to the loosely-defined "HTC systems" as potentially needing to avoid that option. (Commit 2e833147838.) - Tim |