Ticket 4556 - Allow for differing memory allocation on heterogeneous nodes & cgroups
Summary: Allow for differing memory allocation on heterogeneous nodes & cgroups
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 17.02.7
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-12-22 22:31 MST by Brian Haymore
Modified: 2019-01-04 13:43 MST (History)
1 user (show)

See Also:
Site: University of Utah
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Brian Haymore 2017-12-22 22:31:36 MST
As part of the thread in a previous bug report: https://bugs.schedmd.com/show_bug.cgi?id=2881 there was a note pointed out that I wanted to call out and ask more about now as we have some different usage cases.  The note I wanted to call out was in comment #4 from Tim.  What he shared was this:

              NOTE:  A  memory size specification of zero is treated as a spe‐
              cial case and grants the job access to all of the memory on each
              node.  If the job is allocated multiple nodes in a heterogeneous
              cluster, the memory limit on each node will be that of the  node
              in the allocation with the smallest memory size (same limit will
              apply to every node in the job's allocation).

The key point here is the note that heterogeneous nodes allocates result in the memory limit on each node being the "smallest memory size" instead of a per node evaluation and per node limit set per the capacity in that node.  For a traditional style mpi like application this didn't prove to be an issue especially considering one could also use --mem-per-core under the usual condition that all mpi processes use the same memory footprint in a job.

OK so the new situation is we have some genomics users that run both on our systems as well as on various cloud resources.  Their design is instead to fire up mpi processes per node that evaluate that node and adjust the run time params of that tasks based on available cpus and memory.  So it is safe to say that if they got a mixed set of nodes that say some had 12 cores and 24 GB of ram and other nodes had 20 cores and 64GB of ram that they would want to maximize the usage of these resources.  

The issue is back to the note above.  When a job is allocated a mixed set of nodes the cgroup memory limit ends up setting all nodes (using my example) at a 24GB cgroup limit on memory and we loose 40GB of the RAM on the 20 core nodes.

So my question is has anything changed in slurm where there are means to address this at this point better then the feedback that Tim was able to offer in the last?  If not would asking for this as a feature extension be reasonable?
Comment 1 Tim Wickberg 2017-12-28 14:24:35 MST
(In reply to Brian Haymore from comment #0)
> As part of the thread in a previous bug report:
> https://bugs.schedmd.com/show_bug.cgi?id=2881 there was a note pointed out
> that I wanted to call out and ask more about now as we have some different
> usage cases.  The note I wanted to call out was in comment #4 from Tim. 
> What he shared was this:
> 
>               NOTE:  A  memory size specification of zero is treated as a
> spe‐
>               cial case and grants the job access to all of the memory on
> each
>               node.  If the job is allocated multiple nodes in a
> heterogeneous
>               cluster, the memory limit on each node will be that of the 
> node
>               in the allocation with the smallest memory size (same limit
> will
>               apply to every node in the job's allocation).
> 
> The key point here is the note that heterogeneous nodes allocates result in
> the memory limit on each node being the "smallest memory size" instead of a
> per node evaluation and per node limit set per the capacity in that node. 
> For a traditional style mpi like application this didn't prove to be an
> issue especially considering one could also use --mem-per-core under the
> usual condition that all mpi processes use the same memory footprint in a
> job.
> 
> OK so the new situation is we have some genomics users that run both on our
> systems as well as on various cloud resources.  Their design is instead to
> fire up mpi processes per node that evaluate that node and adjust the run
> time params of that tasks based on available cpus and memory.  So it is safe
> to say that if they got a mixed set of nodes that say some had 12 cores and
> 24 GB of ram and other nodes had 20 cores and 64GB of ram that they would
> want to maximize the usage of these resources.  
> 
> The issue is back to the note above.  When a job is allocated a mixed set of
> nodes the cgroup memory limit ends up setting all nodes (using my example)
> at a 24GB cgroup limit on memory and we loose 40GB of the RAM on the 20 core
> nodes.
> 
> So my question is has anything changed in slurm where there are means to
> address this at this point better then the feedback that Tim was able to
> offer in the last?  If not would asking for this as a feature extension be
> reasonable?

No, nothing's changed that covers your specific use case here.

The Heterogeneous Job support in 17.11 would make it possible to submit jobs with separate components landing on different classes of hardware, but my understanding here is that you want a single job to land on a span of node types, and have a method to automatically provide additional memory on the larger memory nodes to those specific processes.

There's no way to do that at the moment; I can certainly re-tag this bug as an enhancement request if you'd like, but can't promise if/when we'll get to work on it.

- Tim
Comment 2 Brian Haymore 2017-12-29 02:31:53 MST
Yes please do tag this as a feature request and I understand on the when/if points.  Yes you have the gist of it.  This Genomics group has their software quite well setup for heterogeneous resources.  So when a cgroup is setup on members of a job for memory that matches the smallest memory size we are clearly locking off resources that at least in this case would have been used vs block/wasted.

--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112
Phone: 801-558-1150, Fax: 801-585-5366
http://bit.ly/1HO1N2C
________________________________
From: bugs@schedmd.com [bugs@schedmd.com]
Sent: Thursday, December 28, 2017 2:24 PM
To: Brian Haymore
Subject: [Bug 4556] Revisiting part of ticket 2881: memory allocation on heterogeneous nodes & cgroups

Tim Wickberg<mailto:tim@schedmd.com> changed bug 4556<https://bugs.schedmd.com/show_bug.cgi?id=4556>
What    Removed Added
Assignee        support@schedmd.com     tim@schedmd.com

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=4556#c1> on bug 4556<https://bugs.schedmd.com/show_bug.cgi?id=4556> from Tim Wickberg<mailto:tim@schedmd.com>

(In reply to Brian Haymore from comment #0<https://bugs.schedmd.com/show_bug.cgi?id=4556#c0>)
> As part of the thread in a previous bug report:
> https://bugs.schedmd.com/show_bug.cgi?id=2881 there was a note pointed out
> that I wanted to call out and ask more about now as we have some different
> usage cases.  The note I wanted to call out was in comment #4<https://bugs.schedmd.com/show_bug.cgi?id=4556#c4> from Tim.
> What he shared was this:
>
>               NOTE:  A  memory size specification of zero is treated as a
> spe‐
>               cial case and grants the job access to all of the memory on
> each
>               node.  If the job is allocated multiple nodes in a
> heterogeneous
>               cluster, the memory limit on each node will be that of the
> node
>               in the allocation with the smallest memory size (same limit
> will
>               apply to every node in the job's allocation).
>
> The key point here is the note that heterogeneous nodes allocates result in
> the memory limit on each node being the "smallest memory size" instead of a
> per node evaluation and per node limit set per the capacity in that node.
> For a traditional style mpi like application this didn't prove to be an
> issue especially considering one could also use --mem-per-core under the
> usual condition that all mpi processes use the same memory footprint in a
> job.
>
> OK so the new situation is we have some genomics users that run both on our
> systems as well as on various cloud resources.  Their design is instead to
> fire up mpi processes per node that evaluate that node and adjust the run
> time params of that tasks based on available cpus and memory.  So it is safe
> to say that if they got a mixed set of nodes that say some had 12 cores and
> 24 GB of ram and other nodes had 20 cores and 64GB of ram that they would
> want to maximize the usage of these resources.
>
> The issue is back to the note above.  When a job is allocated a mixed set of
> nodes the cgroup memory limit ends up setting all nodes (using my example)
> at a 24GB cgroup limit on memory and we loose 40GB of the RAM on the 20 core
> nodes.
>
> So my question is has anything changed in slurm where there are means to
> address this at this point better then the feedback that Tim was able to
> offer in the last?  If not would asking for this as a feature extension be
> reasonable?

No, nothing's changed that covers your specific use case here.

The Heterogeneous Job support in 17.11 would make it possible to submit jobs
with separate components landing on different classes of hardware, but my
understanding here is that you want a single job to land on a span of node
types, and have a method to automatically provide additional memory on the
larger memory nodes to those specific processes.

There's no way to do that at the moment; I can certainly re-tag this bug as an
enhancement request if you'd like, but can't promise if/when we'll get to work
on it.

- Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 3 Tim Wickberg 2018-01-18 18:36:06 MST
Retagging as an enhancement request (albeit with no commitment that we'll get to this).

To summarize: when allocating entire nodes to a job under CR_*_Memory, Slurm should allocate all available memory on each node to the job, rather than setting the limit to the lowest of any of the allocated nodes.