Ticket 3404

Summary:	Basic Scheduling Assistance
Product:	Slurm	Reporter:	IT <it-slurm>
Component:	Scheduling	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	16.05.0
Hardware:	Linux
OS:	Linux
Site:	Altius Institute	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description IT 2017-01-15 18:08:56 MST

We are fairly new to SLURM, moving from SGE, and need to implement some forn of priority scheme 
for scheduling to assure all users get a crack at compute nodes. At present we see resources 
getting swamped by either long-running jobs or massive numbers of relatively small short-running 
jobs. We have not yet done the "hard work" of discussing what we as an organization want in terms
of real priorities, fairshare, preemption..., but we do need something in the interim. 

We currently have 3 partitions:
	PartitionName=queue0
	   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
	   AllocNodes=ALL Default=YES QoS=N/A
	   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
	   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
	   Nodes=A[01-12]
	   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO PreemptMode=OFF
	   State=UP TotalCPUs=576 TotalNodes=12 SelectTypeParameters=NONE
	   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

		** default queue (if no queue name is specified)
		** no job limits
		** 12 nodes [RealMemory=500000 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 TmpDisk=1500000]
			-- complete overlap with nodes with queue1
		** purpose: provide a general purpose "slow lane" for any jobs

	PartitionName=queue1
	   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
	   AllocNodes=ALL Default=NO QoS=N/A
	   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
	   MaxNodes=UNLIMITED MaxTime=01:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
	   Nodes=A[01-16]
	   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO PreemptMode=OFF
	   State=UP TotalCPUs=768 TotalNodes=16 SelectTypeParameters=NONE
	   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

		** max job execution time is 1 hr
		** 16 nodes [RealMemory=500000 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 TmpDisk=1500000]
			-- first 10 nodes overlap with queue0, last 4 dedicated to this queue
		** purpose: provide a "fast lane" for shorter jobs

	PartitionName=queue2
	   AllowGroups=ALL AllowAccounts=seq AllowQos=ALL
	   AllocNodes=ALL Default=NO QoS=N/A
	   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
	   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
	   Nodes=B[01-08]
	   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO PreemptMode=OFF
	   State=UP TotalCPUs=128 TotalNodes=8 SelectTypeParameters=NONE
	   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

		** reserved for production users
		** 8 nodes [RealMemory=95000 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 TmpDisk=750000]
		** purpose: provide dedicated resources for production jobs

What we are considering is implementing QOS limits in order to assure no one user can monopolize 
the full cluster. It looks to me like this might be implemented something like:
	sacctmgr add qos 80pcnt  grpcpus=615 grpnodes=13 grpmemory=6599411
	sacctmgr add qos 60pcnt  grpcpus=460 grpnodes=10 grpmemory=4949558
	sacctmgr modify user name=user1 set qos=60pcnt defaultqos=60pcnt
And in slurm.conf
	AccountingStorageEnforce = associations,limits,qos

As I noted, we are new to SLURM, so maybe this is not the right way to go, even as an interim 
solution. Can you offer comments, suggestions, pointers?

Thank you,
   --Bill

Comment 1 Tim Wickberg 2017-01-17 10:16:27 MST

Are you running 16.05.0 as indicated? I'd highly recommend updating to 16.05.8 if possible; there are quite a few bug fixes you may want to have at some point.

(In reply to Bill from comment #0)
> We are fairly new to SLURM, moving from SGE, and need to implement some forn
> of priority scheme 
> for scheduling to assure all users get a crack at compute nodes. At present
> we see resources 
> getting swamped by either long-running jobs or massive numbers of relatively
> small short-running 
> jobs. We have not yet done the "hard work" of discussing what we as an
> organization want in terms
> of real priorities, fairshare, preemption..., but we do need something in
> the interim. 
> 
> We currently have 3 partitions:
> 	PartitionName=queue0

(snip)

> What we are considering is implementing QOS limits in order to assure no one
> user can monopolize 
> the full cluster. It looks to me like this might be implemented something
> like:
> 	sacctmgr add qos 80pcnt  grpcpus=615 grpnodes=13 grpmemory=6599411
> 	sacctmgr add qos 60pcnt  grpcpus=460 grpnodes=10 grpmemory=4949558
> 	sacctmgr modify user name=user1 set qos=60pcnt defaultqos=60pcnt
> And in slurm.conf
> 	AccountingStorageEnforce = associations,limits,qos
> 
> As I noted, we are new to SLURM, so maybe this is not the right way to go,
> even as an interim 
> solution. Can you offer comments, suggestions, pointers?

Your QOS limits seem reasonable, although if you're going to make these apply throughout the cluster you may want to use a PartitionQOS rather than just setting the same QOS on every user/account.

The QOS is created in the same way; once created you just set the QOS option on the Partition to that node (in slurm.conf) and run 'scontrol reconfigure'.

https://slurm.schedmd.com/qos.html#partition

I'd also suggest looking into using GrpRunMins or some similar limitations - I assume what matters more is limiting the total pool of resources a user can tie up at any given time more so than the node counts themselves.

Comment 2 IT 2017-01-17 12:16:14 MST

Thank you Tim,

I looked for documentation on GrpRunMins but found only an enhancement request ticket.
Can you point me at anything descriptive?

As I look at applying the QOS to the partition, it is unclear to me if the limits will
throttle/limit usage *by* a user or usage *of* a partition. I am fine with the partition
being fully utilized, just not by one user. It would be nice however to not have to
manage this at the user level. Could you please clarify?

Thanks,
	--Bill

Comment 3 Tim Wickberg 2017-01-17 12:25:19 MST

> I looked for documentation on GrpRunMins but found only an enhancement
> request ticket.
> Can you point me at anything descriptive?

https://slurm.schedmd.com/resource_limits.html

I should have said to use 'GrpTRESRunMins' - when we extended the resource tracking and allocation system to cover anything ("TRES") we moved away from the individual elements.

GrpTRESRunMinds=cpu=1000 would be limit a group to (1000 cpus * minutes) under that QOS from any combination of jobs.

> As I look at applying the QOS to the partition, it is unclear to me if the
> limits will
> throttle/limit usage *by* a user or usage *of* a partition. I am fine with
> the partition
> being fully utilized, just not by one user. It would be nice however to not
> have to
> manage this at the user level. Could you please clarify?

The nomenclature's admittedly a bit confusing at times. For any of the "Grp" limits they'll track and constrain the account that's in use. When used as a partition QOS each account should be tracked separately.

If you want to limit usage of the partition as a whole (this can be used to create "virtual partitions" that don't correspond to specific hardware, but are allowed to only use a subset of the system without regard to which specific nodes) you can use the MaxTRES / MaxTRESMins limits within the QOS.

Comment 4 Tim Wickberg 2017-01-19 19:36:21 MST

Hey Bill - 

I'm assuming that was enough to get you going, and am marking this as resolved/infogiven. Please reopen if there was anything else I could address.

- Tim