Ticket 8470

Summary:	partition node assignment
Product:	Slurm	Reporter:	Anthony DelSorbo <anthony.delsorbo>
Component:	Configuration	Assignee:	Gavin D. Howard <gavin>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	aradeva, jbooth
Version:	19.05.3
Hardware:	Linux
OS:	Linux
Site:	NOAA	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	NESCC	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Anthony DelSorbo 2020-02-07 11:30:23 MST

We're adding new nodes to our production system.  We would like these nodes to be part of several partitions but used only as a "last resort" if the other nodes already assigned to that partition have been exhausted.

I recall seeing a setting to permit this, but now I can't seem to find it again.  Would you help me identify that setting and point me to the relevant documentation?

Thanks,

Tony.

Comment 1 Gavin D. Howard 2020-02-07 15:21:38 MST

Tony,

There is an option that might work. The option is the `Weight` option for nodes. This is what the docs say about it:

> Weight
> 
> The priority of the node for scheduling purposes. All things being equal,
> jobs will be allocated the nodes with the lowest weight which satisfies their
> requirements. For example, a heterogeneous collection of nodes might be
> placed into a single partition for greater system utilization, responsiveness
> and capability. It would be preferable to allocate smaller memory nodes
> rather than larger memory nodes if either will satisfy a job's requirements.
> The units of weight are arbitrary, but larger weights should be assigned to
> nodes with more processors, memory, disk space, higher processor speed, etc.
> Note that if a job allocation request can not be satisfied using the nodes
> with the lowest weight, the set of nodes with the next lowest weight is added
> to the set of nodes under consideration for use (repeat as needed for higher
> weight values). If you absolutely want to minimize the number of higher
> weight nodes allocated to a job (at a cost of higher scheduling overhead),
> give each node a distinct Weight value and they will be added to the pool of
> nodes being considered for scheduling individually. The default value is 1.

The key part is this:

> All things being equal, jobs will be allocated the nodes with the lowest
> weight which satisfies their requirements.

This means that your last resort nodes should have their weight set *higher* than the weight of all other nodes. This will cause them to be considered last.

I hope that all makes sense. Please send any further questions you have, and I will do my best to answer them.

Comment 2 Anthony DelSorbo 2020-02-11 08:20:55 MST

Gavin,

Thanks for the reply.  I had written a reply of my own to yours last week but turns out I didn't click on "Save Changes."  Unfortunately, now I'm in maintenance mode so I'll forego those original follow-up questions as I have an urgent question I need to address today:

The nodes I'm adding have more memory than the current compute nodes.  I am adding those nodes to both the default partition (hera) and the big memory partition (bigmem).  But, today I specify: 

hera: DefMemPerCpu=2300
bigmem: DefMemPerCpu=9500

So, what's the best practice here?  Should I move this setting to the NodeName definition line such as:

NodeName=h[32-36]m[01-52] CPUs=40 CoresPerSocket=20 RealMemory=384000 Weight=2 DefMemPerCpu=9500

and remove the settings from the partitions?  Or is there a better way?

What are the ramifications of the proper solution?

I hope you don't mind, but I've increased this ticket's priority for the time being.  

Best,

Tony.

Comment 3 Gavin D. Howard 2020-02-11 09:19:12 MST

Tony,

It sounds like your increase in the severity is justified. I will get right on looking into your new question.

Comment 4 Gavin D. Howard 2020-02-11 11:01:26 MST

Tony,

Do the nodes you are adding have more memory than any other nodes? Or do they have less memory than, say, the rest of the nodes in bigmem?

Comment 5 Anthony DelSorbo 2020-02-11 11:10:24 MST

(In reply to Gavin D. Howard from comment #4)
> Tony,
> 
> Do the nodes you are adding have more memory than any other nodes? Or do
> they have less memory than, say, the rest of the nodes in bigmem?

They have the same amount as the bigmem nodes and clearly more than the nodes in the hera partition:

NodeName=h1c[01-52],h[2-5]c[01-56],h6c[01-57],h8c[01-56],h9c[01-54],h10c[01-57],h[11-12]c[01-56],h[13,14]c[01-52],h[15-24]c[01-56],h[25]c[01-52] CPUs=40 CoresPerSocket=20 RealMemory=95000 
NodeName=h1m[01-04],h13m[01-04]           CPUs=40 CoresPerSocket=20 RealMemory=384000
NodeName=h[32-36]m[01-52]                 CPUs=40 CoresPerSocket=20 RealMemory=384000 Weight=2


and here are the partition defs:

PartitionName=hera \
  Nodes=h1c[01-52],h[2-5]c[01-56],h6c[01-57],h8c[01-56],h9c[01-54],h10c[01-57],h[11-12]c[01-56],h[13,14]c[01-52],h[15-24]c[01-56],h[25]c[01-52],h[32-36]m[01-52] \  # All compute nodes parallel only
  Default=yes \
  DefMemPerCpu=2300 \
  AllowQos=windfall,batch,debug,novel,urgent,admin

PartitionName=bigmem Nodes=h1m[01-04],h13m[01-04],h[32-36]m[01-52] \
  TRESBillingWeights=cpu=1 \
  DefMemPerCpu=9500 \
  AllowQos=windfall,batch,debug,urgent,admin

Comment 6 Gavin D. Howard 2020-02-11 11:17:31 MST

Tony,

With that info, I don't think you need to do anything special, as long as you set the weight of the nodes as I suggested in Comment #1.

The reason is that DefMemPerCPU is meant to prevent oversubscription. It should be set to the lowest amount of memory per CPU for any node in the partition to prevent such oversubscription.

(Slurm will, in fact, allow a job to use more memory if it's available, but if some of the memory is swap, that could cause paging.)

However, in this case, since the nodes you are worried about have at least as much memory as any other node in the cluster, you don't have to worry about them being oversubscribed if, for example, a job is submitted to bigmem with its DefMemPerCPU setting because that DefMemPerCPU setting should not be higher than the amount of memory those nodes have available.

Does this all make sense?

Comment 7 Anthony DelSorbo 2020-02-11 12:02:13 MST

(In reply to Gavin D. Howard from comment #6)
...
> 
> Does this all make sense?

Sure - that makes sense - thanks Gavin.  However, I'm trying to anticipate the customer's questions.   So, please bear with me as I pick your collective brains on this issue. 

If the user submits a job to the hera partition, if they don't specify a mem per cpu amount, they will get the default of 2300.  So, if the larger nodes get assigned to that job, the memory there will also be 2300 per CPU.  So the question is:  supposing the user purposely wants a mix of both node types.  Can they specify a memory amount on 1 set of nodes and a different amount on the other set?  If so, what is the syntax for making that happen.

Comment 8 Gavin D. Howard 2020-02-11 12:20:42 MST

Tony,

As far as I know, there is no way to specify different amounts of memory for different nodes. I did a bit of research, and it seems that doing so could complicate scheduling.

Comment 9 Gavin D. Howard 2020-02-11 13:21:32 MST

Tony,

I have a bit better information for you.

If a user wants to have two different node types, they can submit a heterogeneous job, as explained in this page: https://slurm.schedmd.com/heterogeneous_jobs.html .

This should allow them to do what they want, but it is complicated.

Comment 10 Anthony DelSorbo 2020-02-12 09:01:58 MST

(In reply to Gavin D. Howard from comment #9)
> Tony,
> 
> I have a bit better information for you.
> 
> If a user wants to have two different node types, they can submit a
> heterogeneous job, as explained in this page:
> https://slurm.schedmd.com/heterogeneous_jobs.html .
> 
> This should allow them to do what they want, but it is complicated.

Gavin, thanks.  Our extended experience with heterogenous jobs was not very positive - we had to abandon it, at least until it better matures.  

It just seems that it would be much better to say give me this type of node with these memory requirements and this type of node with these other memory requirements.  And while the nodes may be heterogeneous, the job shouldn't be treated so different.

Comment 11 Gavin D. Howard 2020-02-12 13:11:51 MST

Tony,

I am sorry that I could not give you better news about that.

Is there anything else I can do for you?

Comment 12 Anthony DelSorbo 2020-02-13 13:54:08 MST

(In reply to Gavin D. Howard from comment #11)
> Tony,
> 
> I am sorry that I could not give you better news about that.
> 
> Is there anything else I can do for you?

Well, now that you asked, Gavin ... customer asked if we can set a time limit on the nodes (other than partition).  That is, say I add these nodes to an existing partition.  Normally, the partition permits jobs to run up to 8 hours.  But, since these nodes have other requirements, the customer want to be able to limit only jobs that can fit in 4 hours on these nodes.  

So, could I define the nodes as

NodeName=h[32-36]m[01-52] CPUs=40 CoresPerSocket=20 RealMemory=384000 Weight=2 features=bigmem MaxTime=0-04:00

Given that these nodes may be in multiple partitions, will slurm be "smart enough" to permit jobs less than four hours to use these nodes and those more than four hours to use other nodes in that partition?  Or, will the partition MaxTime take precedence over node maxTime?

Thanks,

Tony.

Comment 13 Gavin D. Howard 2020-02-13 14:16:51 MST

Tony,

As far as I can tell from the docs and source, there is no `MaxTime` option for
node definitions.

Comment 14 Gavin D. Howard 2020-02-19 09:47:06 MST

Tony,

Assuming there is nothing else you need on this bug, I am going to close it. If you have more questions related to this bug, feel free to reopen.

Comment 15 Anthony DelSorbo 2020-02-19 10:03:19 MST

(In reply to Gavin D. Howard from comment #14)
> Tony,
> 
> Assuming there is nothing else you need on this bug, I am going to close it.
> If you have more questions related to this bug, feel free to reopen.

Gavin,

Sorry about leaving you hanging.  It's been a real busy couple of weeks.  No problem here.  Please close and I'll re-open if I need to.

Best,

Tony.