Ticket 7511

Summary:	Best way to implement overridable partition limits
Product:	Slurm	Reporter:	Matt Ezell <ezellma>
Component:	Limits	Assignee:	Unassigned Developer <dev-unassigned>
Status:	OPEN ---	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	sts
Version:	19.05.1
Hardware:	Linux
OS:	Linux
Site:	ORNL-OLCF	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Matt Ezell 2019-08-05 08:55:15 MDT

We have a machine with 2 partitions: batch and gpu.  The desire is to limit the batch partition to 4 running jobs at a time and the gpu partition to 1 running job at a time.  However we have a small number of users/accounts that should be exempt from this limit.

You can't specify MaxJob on a partition directly, so I would have to create a partition QOS.  That's the first item evaluated for limits, so you can't override it with a (Cluster,Account,User,Partition) association.  You could use OverPartQOS and give the user access to a QOS with a relaxed limit, but then you get into "qos creep".

Alternatively, I could just create all the user associations as (Cluster,Account,User,Partition), requiring an association per user per partition.  This would be completely unmanageable by hand, but is potentially viable with automated processes manipulating the slurmdb.

What's the best way to do this?

Comment 1 Jason Booth 2019-08-05 10:50:54 MDT

Hi Matt - I am looking into this. As you have already discovered the mechanisms to do this are rather limited at this time but I will see what I can find for you. Your idea with partition associations may just be the best solution given the current state of "partition limits" and overriding.

Do all of these users have the same QoS and would you be able to use different QoSes as part of a solution here without a partition QOS or OverPartQOS?

Comment 2 Matt Ezell 2019-08-05 11:30:04 MDT

Today, we aren't using QOS (everyone just uses "normal").  The problem then becomes the QOS creep I mentioned, where you need a QOS for every possible combination of limits (jobs running, jobs accruing, max walltime, etc).

To support allowing 1 or 4 running jobs with a 1day or 2 day max walltime, you need 4 QOS:
running1_wall24
running1_wall48
running4_wall24
running4_wall48

Each time a user comes with another request (I need 10 jobs, etc) you have to add more QOS and logic to the submit filter to 'steer' the job to the right QOS. Need to also be able to change the accrue limits?  That doubles the number of needed QOS.

Comment 3 Jason Booth 2019-08-05 15:00:09 MDT

Matt - What you are asking for is a blanket MaxJob to be applied to all submitted job into that partition and have it as a per partition limit and overridable by the user association. As you already are aware of the MaxJob setting is an association or QoS limit and no such limit is in the core slurmctld code.
You options are limited since you want to avoid creating QoSes.


Option 1)
Which seems like the best solution in your case
Configure (Cluster,Account,User,Partition) as previously mentioned in the description second paragraph.

Option 2)
Setup partition QoS to limit batch (4 jobs) GPU (1 jobs).
Use a job_submit plugin for the exception QOS w/OverPartQOS and otherwise use the partition default QOS.


No matter what option you choose you will get some type of "creep". You could open an NRE to have the association override the QoS similar to what we do with OverPartQOS.

Comment 4 Jason Booth 2019-08-05 15:25:21 MDT

Matt - after talking this over a bit more internally we want to put more emphasis on option 2 as the preferred option. We see partition QoS as the right answer to this problem.

Comment 5 Matt Ezell 2019-08-06 12:29:21 MDT

That's non-ideal for several reasons:
- QOSs are global on a slurmdbd
- I'll need a QOS for every possible limit combination

So we are going to have a LOT if we are serving multiple clusters from the same SlurmDBD (as recommended).

I guess that's the path I'll go down for now, but I'd be interested to chat to scope NRE paths that could simplify this.

Comment 6 Jason Booth 2019-08-06 14:27:52 MDT

> I guess that's the path I'll go down for now, but I'd be interested to chat to scope NRE paths that could simplify this.

To simplify the NRE process would you open a new bug with that request?