As part of the thread in a previous bug report: https://bugs.schedmd.com/show_bug.cgi?id=2881 there was a note pointed out that I wanted to call out and ask more about now as we have some different usage cases. The note I wanted to call out was in comment #4 from Tim. What he shared was this: NOTE: A memory size specification of zero is treated as a spe‐ cial case and grants the job access to all of the memory on each node. If the job is allocated multiple nodes in a heterogeneous cluster, the memory limit on each node will be that of the node in the allocation with the smallest memory size (same limit will apply to every node in the job's allocation). The key point here is the note that heterogeneous nodes allocates result in the memory limit on each node being the "smallest memory size" instead of a per node evaluation and per node limit set per the capacity in that node. For a traditional style mpi like application this didn't prove to be an issue especially considering one could also use --mem-per-core under the usual condition that all mpi processes use the same memory footprint in a job. OK so the new situation is we have some genomics users that run both on our systems as well as on various cloud resources. Their design is instead to fire up mpi processes per node that evaluate that node and adjust the run time params of that tasks based on available cpus and memory. So it is safe to say that if they got a mixed set of nodes that say some had 12 cores and 24 GB of ram and other nodes had 20 cores and 64GB of ram that they would want to maximize the usage of these resources. The issue is back to the note above. When a job is allocated a mixed set of nodes the cgroup memory limit ends up setting all nodes (using my example) at a 24GB cgroup limit on memory and we loose 40GB of the RAM on the 20 core nodes. So my question is has anything changed in slurm where there are means to address this at this point better then the feedback that Tim was able to offer in the last? If not would asking for this as a feature extension be reasonable?
(In reply to Brian Haymore from comment #0) > As part of the thread in a previous bug report: > https://bugs.schedmd.com/show_bug.cgi?id=2881 there was a note pointed out > that I wanted to call out and ask more about now as we have some different > usage cases. The note I wanted to call out was in comment #4 from Tim. > What he shared was this: > > NOTE: A memory size specification of zero is treated as a > spe‐ > cial case and grants the job access to all of the memory on > each > node. If the job is allocated multiple nodes in a > heterogeneous > cluster, the memory limit on each node will be that of the > node > in the allocation with the smallest memory size (same limit > will > apply to every node in the job's allocation). > > The key point here is the note that heterogeneous nodes allocates result in > the memory limit on each node being the "smallest memory size" instead of a > per node evaluation and per node limit set per the capacity in that node. > For a traditional style mpi like application this didn't prove to be an > issue especially considering one could also use --mem-per-core under the > usual condition that all mpi processes use the same memory footprint in a > job. > > OK so the new situation is we have some genomics users that run both on our > systems as well as on various cloud resources. Their design is instead to > fire up mpi processes per node that evaluate that node and adjust the run > time params of that tasks based on available cpus and memory. So it is safe > to say that if they got a mixed set of nodes that say some had 12 cores and > 24 GB of ram and other nodes had 20 cores and 64GB of ram that they would > want to maximize the usage of these resources. > > The issue is back to the note above. When a job is allocated a mixed set of > nodes the cgroup memory limit ends up setting all nodes (using my example) > at a 24GB cgroup limit on memory and we loose 40GB of the RAM on the 20 core > nodes. > > So my question is has anything changed in slurm where there are means to > address this at this point better then the feedback that Tim was able to > offer in the last? If not would asking for this as a feature extension be > reasonable? No, nothing's changed that covers your specific use case here. The Heterogeneous Job support in 17.11 would make it possible to submit jobs with separate components landing on different classes of hardware, but my understanding here is that you want a single job to land on a span of node types, and have a method to automatically provide additional memory on the larger memory nodes to those specific processes. There's no way to do that at the moment; I can certainly re-tag this bug as an enhancement request if you'd like, but can't promise if/when we'll get to work on it. - Tim
Yes please do tag this as a feature request and I understand on the when/if points. Yes you have the gist of it. This Genomics group has their software quite well setup for heterogeneous resources. So when a cgroup is setup on members of a job for memory that matches the smallest memory size we are clearly locking off resources that at least in this case would have been used vs block/wasted. -- Brian D. Haymore University of Utah Center for High Performance Computing 155 South 1452 East RM 405 Salt Lake City, Ut 84112 Phone: 801-558-1150, Fax: 801-585-5366 http://bit.ly/1HO1N2C ________________________________ From: bugs@schedmd.com [bugs@schedmd.com] Sent: Thursday, December 28, 2017 2:24 PM To: Brian Haymore Subject: [Bug 4556] Revisiting part of ticket 2881: memory allocation on heterogeneous nodes & cgroups Tim Wickberg<mailto:tim@schedmd.com> changed bug 4556<https://bugs.schedmd.com/show_bug.cgi?id=4556> What Removed Added Assignee support@schedmd.com tim@schedmd.com Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=4556#c1> on bug 4556<https://bugs.schedmd.com/show_bug.cgi?id=4556> from Tim Wickberg<mailto:tim@schedmd.com> (In reply to Brian Haymore from comment #0<https://bugs.schedmd.com/show_bug.cgi?id=4556#c0>) > As part of the thread in a previous bug report: > https://bugs.schedmd.com/show_bug.cgi?id=2881 there was a note pointed out > that I wanted to call out and ask more about now as we have some different > usage cases. The note I wanted to call out was in comment #4<https://bugs.schedmd.com/show_bug.cgi?id=4556#c4> from Tim. > What he shared was this: > > NOTE: A memory size specification of zero is treated as a > spe‐ > cial case and grants the job access to all of the memory on > each > node. If the job is allocated multiple nodes in a > heterogeneous > cluster, the memory limit on each node will be that of the > node > in the allocation with the smallest memory size (same limit > will > apply to every node in the job's allocation). > > The key point here is the note that heterogeneous nodes allocates result in > the memory limit on each node being the "smallest memory size" instead of a > per node evaluation and per node limit set per the capacity in that node. > For a traditional style mpi like application this didn't prove to be an > issue especially considering one could also use --mem-per-core under the > usual condition that all mpi processes use the same memory footprint in a > job. > > OK so the new situation is we have some genomics users that run both on our > systems as well as on various cloud resources. Their design is instead to > fire up mpi processes per node that evaluate that node and adjust the run > time params of that tasks based on available cpus and memory. So it is safe > to say that if they got a mixed set of nodes that say some had 12 cores and > 24 GB of ram and other nodes had 20 cores and 64GB of ram that they would > want to maximize the usage of these resources. > > The issue is back to the note above. When a job is allocated a mixed set of > nodes the cgroup memory limit ends up setting all nodes (using my example) > at a 24GB cgroup limit on memory and we loose 40GB of the RAM on the 20 core > nodes. > > So my question is has anything changed in slurm where there are means to > address this at this point better then the feedback that Tim was able to > offer in the last? If not would asking for this as a feature extension be > reasonable? No, nothing's changed that covers your specific use case here. The Heterogeneous Job support in 17.11 would make it possible to submit jobs with separate components landing on different classes of hardware, but my understanding here is that you want a single job to land on a span of node types, and have a method to automatically provide additional memory on the larger memory nodes to those specific processes. There's no way to do that at the moment; I can certainly re-tag this bug as an enhancement request if you'd like, but can't promise if/when we'll get to work on it. - Tim ________________________________ You are receiving this mail because: * You reported the bug.
Retagging as an enhancement request (albeit with no commitment that we'll get to this). To summarize: when allocating entire nodes to a job under CR_*_Memory, Slurm should allocate all available memory on each node to the job, rather than setting the limit to the lowest of any of the allocated nodes.