4283 – QoS Multisubmit

Ticket 4283 - QoS Multisubmit

Summary: QoS Multisubmit

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	17.02.8
Hardware:	Linux Linux

Severity:	5 - Enhancement
Assignee:	Unassigned Developer
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-10-19 13:31 MDT by Paul Edmon
Modified:	2022-11-03 15:21 MDT (History)
CC List:	1 user (show)

See Also:	15349
Site:	Harvard University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2017-10-19 13:31:55 MDT

Currently QoS's (or reservations for that matter) do not overflow. Namely once the QoS resources exhaust the jobs pending in that QoS pend even if there are resources available if the user had submitted under the default QoS or another QoS. Similar to multisubmitting for partitions it would be good to allow overflow for QoS by multisubmitting. Namely the user would submit against multiple QoS's, the system would use the higher priority QoS first and exhaust that, then if full it would still consider the jobs for execution under other defined QoS's by the user.

The reason for this is that we find that QoS's can create backups as that top of the queue as the high priority QoS's boost the jobs to the top and they sit there pending. The scheduler considers then first but finding no room gives up. Meanwhile lower priority jobs that could run don't get scheduled until the backfill loop which is slower than the primary. In addition, the users using the QoS could have used other resources while the QoS jobs were pending if only they had assigned an alternate lower priority QoS to run in. It would be good to then give them the option of defining another QoS to submit against so they only have to submit once rather than submit to the QoS, realize it is full and have to cancel their jobs and resubmit against a new QoS.

It would also be good to do the same for reservations, or at least permit jobs that are using the reservation to overflow into the normal queue if there is space. This could be a flag for the reservation. Separate problem but in principle the same issue, there are resources the user could use but either they aren't savvy enough or motivated enough to use them as it is too much of a hassle. Plus it would alleviate blockage at the top of the primary loop as the jobs for the QoS/reservation would then fall to the normal priority order only regaining top priority when the QoS/reservation has space.

Comment 1 Tim Wickberg 2017-10-19 13:37:48 MDT

It's an interesting idea, but will need to be handled as an enhancement request[1]. I'm retagging as such here.

As I'm sure you've heard before, we'd need to spend some time looking into how feasible this is, and our development priorities are always on sponsored work first. (If Harvard is interested in what that process looks like we can have that discussion out of band.)

cheers,
- Tim

[1] You may have submitted this to Sev5 originally, but I have bugzilla set to route to Sev4 to ensure we perform some up front triage first.

Comment 2 Paul Edmon 2017-10-19 13:39:29 MDT

Sure, no rush.  Just something we have encountered several times in our 
environment.

-Paul Edmon-


On 10/19/2017 3:37 PM, bugs@schedmd.com wrote:
> Tim Wickberg <mailto:tim@schedmd.com> changed bug 4283 
> <https://bugs.schedmd.com/show_bug.cgi?id=4283>
> What 	Removed 	Added
> Assignee 	support@schedmd.com 	dev-unassigned@schedmd.com
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=4283#c1> on bug 
> 4283 <https://bugs.schedmd.com/show_bug.cgi?id=4283> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> It's an interesting idea, but will need to be handled as an enhancement
> request[1]. I'm retagging as such here.
>
> As I'm sure you've heard before, we'd need to spend some time looking into how
> feasible this is, and our development priorities are always on sponsored work
> first. (If Harvard is interested in what that process looks like we can have
> that discussion out of band.)
>
> cheers,
> - Tim
>
> [1] You may have submitted this to Sev5 originally, but I have bugzilla set to
> route to Sev4 to ensure we perform some up front triage first.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 3 Paul Edmon 2017-10-19 13:39:43 MDT

On 10/19/2017 3:37 PM, bugs@schedmd.com wrote:
> Tim Wickberg <mailto:tim@schedmd.com> changed bug 4283 
> <https://bugs.schedmd.com/show_bug.cgi?id=4283>
> What 	Removed 	Added
> Assignee 	support@schedmd.com 	dev-unassigned@schedmd.com
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=4283#c1> on bug 
> 4283 <https://bugs.schedmd.com/show_bug.cgi?id=4283> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> It's an interesting idea, but will need to be handled as an enhancement
> request[1]. I'm retagging as such here.
>
> As I'm sure you've heard before, we'd need to spend some time looking into how
> feasible this is, and our development priorities are always on sponsored work
> first. (If Harvard is interested in what that process looks like we can have
> that discussion out of band.)
>
> cheers,
> - Tim
>
> [1] You may have submitted this to Sev5 originally, but I have bugzilla set to
> route to Sev4 to ensure we perform some up front triage first.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>