Ticket 2881

Summary:	Question on effective ways to support node sharing, specifically for memory requests
Product:	Slurm	Reporter:	Brian Haymore <brian.haymore>
Component:	slurmctld	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	bart
Version:	15.08.10
Hardware:	Linux
OS:	Linux
Site:	University of Utah	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Brian Haymore 2016-07-07 13:47:51 MDT

First I see that BUG ID 2552 is a similar request to this. Also I posted on this on the dev list but have not had a reply, see here: https://groups.google.com/forum/#!topic/slurm-devel/uoZPdxd4WkI

What I wanted to open a ticket for was to further what I sent to the list and what I see in ticket 2552 in that we are starting to look at node sharing. A side effect of this is that we will enable cgroups to enforce memory limits and also set consres to allocate for CORE and MEMORY. Since the cgroup memory limit is a global (or appears to be a global) setting I'm seeing a couple groups impacted by this. Let me explain.

I have a group that runs Visit to make movies from their data sets. They have a tool that's now part of Visit that lets them break a whole movie into frames and then they let each node chew on a set of frames. This is an application that wants all the memory it can get an that starts our issue. We have mixed nodes on this cluster, 3 different core count and memory capacity groups. So currently without the cgroup enforcement and without memory allocation they just submit to a number of nodes or cores and they use everything they can get. So if I turn on cgroups for memory limiting what I know they would really like is to have a way to ask for "all" memory on an allocated node. Any thoughts on this? I'm happy to take any feedback including, this isn't possible, it's a bad idea, step back and relook at things in this new way, etc. We are just new to this and trying to carry forward existing, and convenient, usage. :) Thanks!

Comment 1 Tim Wickberg 2016-07-07 16:24:07 MDT

(In reply to Brian Haymore from comment #0)
> First I see that BUG ID 2552 is a similar request to this.  Also I posted on
> this on the dev list but have not had a reply, see here:
> https://groups.google.com/forum/#!topic/slurm-devel/uoZPdxd4WkI
> 
> What I wanted to open a ticket for was to further what I sent to the list
> and what I see in ticket 2552 in that we are starting to look at node
> sharing.  A side effect of this is that we will enable cgroups to enforce
> memory limits and also set consres to allocate for CORE and MEMORY.  Since
> the cgroup memory limit is a global (or appears to be a global) setting I'm
> seeing a couple groups impacted by this.  Let me explain.
> 
> I have a group that runs Visit to make movies from their data sets.  They
> have a tool that's now part of Visit that lets them break a whole movie into
> frames and then they let each node chew on a set of frames.  This is an
> application that wants all the memory it can get an that starts our issue. 
> We have mixed nodes on this cluster, 3 different core count and memory
> capacity groups.  So currently without the cgroup enforcement and without
> memory allocation they just submit to a number of nodes or cores and they
> use everything they can get.  So if I turn on cgroups for memory limiting
> what I know they would really like is to have a way to ask for "all" memory
> on an allocated node.  Any thoughts on this?  I'm happy to take any feedback
> including, this isn't possible, it's a bad idea, step back and relook at
> things in this new way, etc.  We are just new to this and trying to carry
> forward existing, and convenient, usage. :)   Thanks!

Two options that may fit you use case:

- Have the job use --exclusive to claim the whole node, thus getting all memory on that node.

- Set --mem-per-cpu on the job to an appropriate value. My assumption is that there should be a more-or-less-constant memory demand per rendering thread, and this would ensure cpu's and memory are allocated in proportion even on heterogeneous hardware. This would also allow multiple jobs to share the nodes, rather than blocking it off with --exclusive.

(The DefMemPerCPU / MaxMemPerCPU settings on the partition may also help set defaults for some of this.)

Comment 3 Brian Haymore 2016-07-11 13:02:22 MDT

OK so trying this out what I seem to be finding is that --exclusive is not grabbing all the RAM in the node, but instead is grabbing #Cores * DefRAMPerCore.  Am I missing something here?  I do not have a MaxRAMPerCore set since there are different memory sizes in the same partition.

Your right that in the common case memory per process should be uniform.  However in this case there are added values for having more RAM as the system will cache more aggressively.  We also have applications that look at the ram available and choose different ways to solve the problem based on how much RAM it can get.  So these are somewhat corner cases, though not so much that I don't want to ask you all about them.

Trying to distill things down a bit what we are after is a way to say give me a whole node, any node (maybe I can set some min requirements?) and then have cgroups max that out for me.  Then the job script or the application can take it from there and make good use of what is allocated.  This doesn't exist right now as best as I can tell, so what I'm looking for is how close to that can we get given that I have mixed node types in a common partition (not an uncommon practice for condo style clusters).  Then after looking at how close we can get then maybe I would like to turn this back and work with you all to submit this as a feature extension request if appropriate.

Comment 4 Tim Wickberg 2016-07-11 14:25:17 MDT

--mem=0 may get you a bit closer, but still leaves some shortcomings if you have a mixed set of nodes:

              NOTE:  A  memory size specification of zero is treated as a spe‐
              cial case and grants the job access to all of the memory on each
              node.  If the job is allocated multiple nodes in a heterogeneous
              cluster, the memory limit on each node will be that of the  node
              in the allocation with the smallest memory size (same limit will
              apply to every node in the job's allocation).


Does a single job commonly split across heterogeneous nodes? I don't think there's currently a way to ask for "all the memory in whichever nodes I get, regardless of varying capacities" which it sounds like is what you're after. (And would that only apply to whole-node allocations, or how would this work with partial nodes? Assign all that's free on the node?)

We can open an enhancement request to look into that (or re-purpose this bug) if desired.

Comment 5 Brian Haymore 2016-07-13 12:16:33 MDT

So after some testing I think this covers what were after.  The --exclusive gives us an easy way to get a whole nodes worth of CPUs and --mem=0 gives us the same for memory.  I also was able to test our non-shared partitions by setting the default memory to Zero there too also works so that users not using shared partitions will still get a whole nodes worth of CPUs and RAM even with CR_CORES_MEMORY and CGroups enforcement enabled.  This is great.  Thanks for the help and I think we can close this.