Ticket 733

Summary: Initial setup again: partitions, qos, features
Product: Slurm Reporter: Bill Wichser <bill>
Component: SchedulingAssignee: David Bigagli <david>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: da
Version: 14.03.0   
Hardware: Linux   
OS: Linux   
Site: Princeton (PICSciE) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Bill Wichser 2014-04-21 00:31:47 MDT
We have in production a simple Slurm setup where a single partition exists.  Users are assigned a QOS using the job_submit.lua script according to walltime requests (test, short, medium, long).  These QOS levels are assigned to all ACCOUNT descriptions so any user can use and QOS.

We have expanded this initial test to now include a heterogeneous mix of node, both Westmere and Ivybridge.  Users who have compiled their codes using AVX extensions now require landing on the Ivybridge nodes.

One solution to this problem is through the use of Feature attached to a node.  The second way we might accomplish this is by using a Partition for the Ivybridge nodes, overlapping the Westmere.

In either case the user will need to make additions to the job submit script so either way we just need to document and inform users.

I asked in the beginning of my endeavor with Slurm what the methodology should be concerning partitions.  Almost everyone who responded has said that they used a single partition assigned to all nodes.  Now maybe that still holds true and this remains the correct thinking.  I'm just not sure at the moment.  When I use "qstat" to look at my setup, queues  in the PBS sense seem to refer to partitions.  So maybe I was mistaken from the beginning and should have what I have assigned for QOS to actually be partitions, even if they all overlap.

This morning I will probably implement the Feature option and request users to use the $SBATCH -C ivy directive and be done with it.  This would eliminate the need to rewrite the job_submit.lua script to assign partitions instead of QOS.  But if you could comment that would be great.

Bill
Comment 1 Moe Jette 2014-04-21 03:59:11 MDT
The more partitions that you configure, the more that resources tend to get fragmented and difficult to use. The highest system utilization tends to happen when there are one or two partitions (say one for small/short interactive jobs and another for batch jobs).

I would suggest use of the Feature option that you have mentioned as a good solution. You might also want to make use of the "Weight" configuration parameter associated with the nodes (at least that's configuration used successfully at NASA Goddard). Below is from the slurm.conf man page:

Weight
The priority of the node for scheduling purposes.  All  things  being  equal,  jobs will  be  allocated the nodes with the lowest weight which satisfies their requirements.  For example, a heterogeneous collection of nodes might  be  placed  into  a single  partition for greater system utilization, responsiveness and capability. It  would be preferable to allocate smaller memory  nodes  rather  than  larger  memory nodes  if  either will satisfy a job's requirements.  The units of weight are arbitrary, but larger weights should be assigned to nodes with more processors, memory, disk space, higher processor speed, etc.  Note that if a job allocation request can not be satisfied using the nodes with the lowest weight, the set of nodes with  the next lowest weight is added to the set of nodes under consideration for use (repeat as needed for higher weight values). If you absolutely want to minimize the  number of  higher  weight  nodes  allocated to a job (at a cost of higher scheduling overhead), give each node a distinct Weight value and they will be added to the pool of nodes being considered for scheduling individually.  The default value is 1.
Comment 2 Bill Wichser 2014-04-21 10:18:41 MDT
Status: resolved

All right then.  Feature it is, single partition, and face the next issue.
Comment 3 David Bigagli 2014-04-21 10:20:39 MDT
Closing.

David