In slurm.conf we configure: TaskPlugin=task/cgroup and in cgroup.conf: CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes I've encountered some warnings in the on-line manual pages: The page https://slurm.schedmd.com/slurm.conf.html says regarding task/cgroup: ... NOTE: When ContrainRAMSpace is set in the cgroup.conf this plugin noticibly slows down performance. It should probably be avoided in an HTC environment. (please spell out the acronym HTC) Also, https://slurm.schedmd.com/cgroup.conf.html says: ConstrainCores=<yes|no> ... Due to a bug fixed in version 1.11.5 of HWLOC, the task/affinity plugin may be required in addition to task/cgroup for this to function properly. Apparently I'm violating both of these warnings (we have yet to identify any bad effects though). Our CentOS 7.3 provides hwloc-1.11.2-1.el7.x86_64. Question: Can you please recommend the correct and optimal configuration of TaskPlugin and the cgroup.conf parameters given the hwloc version in CentOS?
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #0) > In slurm.conf we configure: > TaskPlugin=task/cgroup > > and in cgroup.conf: > CgroupAutomount=yes > CgroupReleaseAgentDir="/etc/slurm/cgroup" > ConstrainCores=yes > ConstrainRAMSpace=yes > ConstrainSwapSpace=yes > > I've encountered some warnings in the on-line manual pages: > > The page https://slurm.schedmd.com/slurm.conf.html says regarding > task/cgroup: > ... NOTE: When ContrainRAMSpace is set in the cgroup.conf this plugin > noticibly slows down performance. It should probably be avoided in an HTC > environment. > > (please spell out the acronym HTC) That note was recently removed (I believe if you hit refresh on that page you won't see it any more). We do recommend using ConstrainRAMSpace for any environments, and the performance impacts noted were with much older Linux kernels. > Also, https://slurm.schedmd.com/cgroup.conf.html says: > ConstrainCores=<yes|no> > ... Due to a bug fixed in version 1.11.5 of HWLOC, the task/affinity plugin > may be required in addition to task/cgroup for this to function properly. > > Apparently I'm violating both of these warnings (we have yet to identify any > bad effects though). Our CentOS 7.3 provides hwloc-1.11.2-1.el7.x86_64. > > Question: Can you please recommend the correct and optimal configuration of > TaskPlugin and the cgroup.conf parameters given the hwloc version in CentOS? We continue to recommend using both the affinity and cgroup, even after installing a newer version of hwloc. My current recommendation would be: TaskPlugin=affinity,cgroup and ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes The CgroupReleaseAgentDir is not needed after 16.05.5 was released (the cleanup is handled internally now, and that option will be ignored in a future release). - Tim
I'm out of the office until June 2. Jeg er ikke pƄ kontoret, tilbage igen 2. juni. Best regards / Venlig hilsen, Ole Holm Nielsen
Thanks a lot for the info! I will reconfigure ASAP. The slurm.conf web page still contains the outdated text after refreshing the page... Please close this case.
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #3) > Thanks a lot for the info! I will reconfigure ASAP. > > The slurm.conf web page still contains the outdated text after refreshing > the page... > > Please close this case. Should be sorted now, sorry about that. One last note - I should have mentioned that you may still want the ReleaseAgent setting in place until you get a chance to upgrade to 17.02.3 or later. There were some unfortunate complications from the earlier cleanup code, and some edge cases won't be handled properly until that point release. Marking resolved.
One more clarification, as I'd slightly miscategorized some of this. There is always a performance impact from enabling ConstrainRAMSpace - in testing, it can result in per-node throughput dropping from ~ 150 jobs/second to 15 jobs/second launched. In most environments this should not be significant, and in my opinion the potential impact from unconstrained memory use by the job leading to OOM / interference from others jobs sharing the node is a much more pressing concern. I've revised some of the cgroup.conf man page to better hint at this, rather than just referring to the loosely-defined "HTC systems" as potentially needing to avoid that option. (Commit 2e833147838.) - Tim