Ticket 4681

Summary:	How to Override partition or QoS limits for specific user/account pairs
Product:	Slurm	Reporter:	Ben Matthews <matthews>
Component:	Documentation	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, doug.parisek, felip.moll, luca.capello, ssg
Version:	17.11.0
Hardware:	Linux
OS:	Linux
Site:	UCAR	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.11.3 17.11.4
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Ben Matthews 2018-01-24 17:48:46 MST

Under the "Hierarchy" section, https://slurm.schedmd.com/resource_limits.html says that  QoS limits take precedence over User association limits, which agrees with my testing. It also implies that User associations take precedence over limits set on partitions, which does not seem to be the case. 

This is at best confusing and the documentation should be adjusted.

Is there a way to override partition limits for users who are special -- that is, I'd like to set a per-partition limit on walltime, except for a specific user running under a specific account (potentially with a specific QoS)

Comment 1 Alejandro Sanchez 2018-01-25 05:42:38 MST

(In reply to Ben Matthews from comment #0)
> Under the "Hierarchy" section,
> https://slurm.schedmd.com/resource_limits.html says that  QoS limits take
> precedence over User association limits, which agrees with my testing. It
> also implies that User associations take precedence over limits set on
> partitions, which does not seem to be the case. 
> 
> This is at best confusing and the documentation should be adjusted.

My testing shows Slurm behaves as documented:

Partition 'p1' with MaxTime=2
QOS 'normal' with MaxWall=3 (and no OverPartQOS flag set).
User 'alex' with MaxWall=4

If user 'alex' submits a job to partition 'p1' and qos 'normal', the job is killed after it has run for 2 minutes, which is the partition limit, and that preceeds the QOS limit since OverPartQOS flag isn't set for QOS 'normal'.

I think your statement isn't correct:

... 'It also implies that User associations take precedence over limits set on partitions' ...
 
> Is there a way to override partition limits for users who are special --
> that is, I'd like to set a per-partition limit on walltime, except for a
> specific user running under a specific account (potentially with a specific
> QoS)

I'd suggest defining a QOS with flags including 'OverPartQOS'[1], then set whatever limits you want on either the partition and the QOS. Users allowed to submit jobs under that QOS will be governed by the QOS limits instead of the Partition ones.

The slurm.conf partition QOS[1] option might also be of your interest.

[1] https://slurm.schedmd.com/sacctmgr.html#SECTION_SPECIFICATIONS-FOR-QOS
[2] https://slurm.schedmd.com/slurm.conf.html#OPT_QOS

Please, let me know if all this makes sense.

Comment 2 Ben Matthews 2018-01-25 16:45:55 MST

I'm aware that I could do this with QoS but that would require me to establish a QoS for each exception that our user services people grant, which I'd rather not do. I think the best I can do is to define limits on the cluster and override them with user associations (or implement exactly what I want in a submit filter). 

I had more intended this as a doc bug, regarding this block of text:

***
Slurm's hierarchical limits are enforced in the following order with Job QOS and Partition QOS order being reversible by using the QOS flag 'OverPartQOS':

1. Partition QOS limit
2. Job QOS limit
3. User association
4. Account association(s), ascending the hierarchy
5. Root/Cluster association
6. Partition limit
7. None
***

To me, this means that partition Qos is evaluated first, then Job QOS, then user associations, etc, which would mean that what I define on a user should take precedence over what I set on a partition and that doesn't seem to be the case (but it would be a convenient behavior in our situation).

Comment 3 Alejandro Sanchez 2018-01-26 05:48:49 MST

After doing some more tests this morning, I think there's something not working properly indeed. Yesterday I thought it was all OK and working as documented but today I see it's not the case, so I apologize.

I'm investigating further and will come back to you, but what I'm seeing now is this precedence:

Slurm's hierarchical limits are enforced in the following order with Job QOS and Partition QOS order being reversible by using the QOS flag 'OverPartQOS':

1. Partition QOS limit
2. Job QOS limit
3. User association                                <--- Never applied enforced
4. Account association(s), ascending the hierarchy <--- Never applied enforced
5. Root/Cluster association                        <--- Never applied enforced   
6. Partition limit
7. None

So 1,2 and 6,7 work in the precedence order documented. 3,4 and 5 limits set are never enforced even if I have AccountingStorageEnforce=all, at least with Max[Wall|Time] limit which is the limit I'm testing, and submitting with a non-root, non-SlurmUser, non-Admin|Coord (regular user). I'll come back to you.

Comment 4 Alejandro Sanchez 2018-01-29 05:26:30 MST

Just as an update I see only association MaxWall limit is not being enforced. I tested some others and they are enforced with the documented precedence order. Will continue investigating this and come back to you.

Comment 5 Alejandro Sanchez 2018-01-29 06:02:37 MST

I wanted to make sure this wasn't a regression introduced in previous versions, so went testing back to 17.02 and 16.05.0 and neither enforced the association limit MaxWall. So I'll just focus on 17.11 code and see if I manage to fix it.

Comment 8 Alejandro Sanchez 2018-01-30 02:56:05 MST

Please find a fix in the following commit, available in Slurm 17.02.3 and up when released. 17.02.3 should be released this week, but feel free to apply it at your will:

https://github.com/SchedMD/slurm/commit/9143c7c964
https://github.com/SchedMD/slurm/commit/9143c7c964.patch

Precedence order should be respected as documented. Please, reopen if you still find any issues after this patch. 

Thanks for reporting.

Comment 9 Ben Matthews 2018-02-01 17:44:53 MST

2-5 work like I would expect. I didn't test 1. I'm confused about partition limits -- 3 doesn't seem to beat 6. 

Observations: 

- A QoS limit that is larger than the partition limit wins (as it should)
- A QoS limit that is larger than a user-association wins (again, as expected)
- If the QoS limit is unset and the user-association limit is smaller than the partition limit, then the association limit wins, which I would expect.
- If the QoS limit is unset and the cluster limit is larger than the partition limit, the cluster limit wins (as it should) 
- If the user-association limit is smaller than the cluster limit and the cluster limit is larger than the partition limit, then the association limit wins (correct I think). 

- If the QoS limit is unset, the cluster limit is unset, and the user-association limit is larger than the partition limit, then the partition limit wins, which, unless I'm confused about how it is intended to work, is still wrong. At least it's inconsistent with how QoS limits work. Unfortunately, this is the case I really care about.

Comment 10 Alejandro Sanchez 2018-02-02 04:02:53 MST

(In reply to Ben Matthews from comment #9)
> 2-5 work like I would expect. I didn't test 1. I'm confused about partition
> limits -- 3 doesn't seem to beat 6. 
> 
> Observations: 
> 
> - A QoS limit that is larger than the partition limit wins (as it should)
> - A QoS limit that is larger than a user-association wins (again, as
> expected)
> - If the QoS limit is unset and the user-association limit is smaller than
> the partition limit, then the association limit wins, which I would expect.
> - If the QoS limit is unset and the cluster limit is larger than the
> partition limit, the cluster limit wins (as it should) 
> - If the user-association limit is smaller than the cluster limit and the
> cluster limit is larger than the partition limit, then the association limit
> wins (correct I think). 

They all are expected behavior, correct.
 
> - If the QoS limit is unset, the cluster limit is unset, and the
> user-association limit is larger than the partition limit, then the
> partition limit wins, which, unless I'm confused about how it is intended to
> work, is still wrong. At least it's inconsistent with how QoS limits work.
> Unfortunately, this is the case I really care about.

You are right, actually 2-5 (qos or assoc limits) vs 6 (partition limit) seems to win the one which is more restrictive (smaller MaxWall), instead of always 2-5 beating 6 as documented. I guess there's a MIN() macro somewhere in the code leading to this behavior. Let's see what I find...

Comment 21 Alejandro Sanchez 2018-02-07 09:27:38 MST

Ben, this has been fixed in the following commit:

https://github.com/SchedMD/slurm/commit/2ef56d4b96f93e0854

which will be available since 17.11.4 and up. The commit includes a NOTE in the resource_limits.html page. Please test it if you don't want to wait till .4 and let me know how it goes before I close this.

Thanks.

Comment 22 Ben Matthews 2018-02-07 12:58:08 MST

I think I don't really understand what you changed here. The behavior seems pretty similar to before - the only change is that this lets QoS override partition (larger or smaller), but not anything else? [for MaxWall]

Are there plans to restore the old documented behavior in a future release? Really, I want to be able to give an exception to individual users. 

Also, there's a typo in your documentation edit (and it could clearer overall - I prefer the wording in the commit message). 

(In reply to Alejandro Sanchez from comment #21)
> Ben, this has been fixed in the following commit:
> 
> https://github.com/SchedMD/slurm/commit/2ef56d4b96f93e0854
> 
> which will be available since 17.11.4 and up. The commit includes a NOTE in
> the resource_limits.html page. Please test it if you don't want to wait till
> .4 and let me know how it goes before I close this.
> 
> Thanks.

Comment 23 Alejandro Sanchez 2018-02-08 05:21:20 MST

Let's see if I explain it better with examples:

Default partition with MaxTime=1m:

$ scontrol show part | grep MaxTime
   MaxNodes=UNLIMITED MaxTime=00:01:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED

Default QOS with no flags enabled nor MaxWall:
$ sacctmgr show qos normal -p | cut -d'|' -f1,6,18
Name|Flags|MaxWall
normal||

Assoc hierarchy, user 'test' can submit to normal QOS and no MaxWall for now:
$ sacctmgr show assoc tree format=cluster,account,user,qos,maxwall
   Cluster              Account       User                  QOS     MaxWall 
---------- -------------------- ---------- -------------------- ----------- 
     ibiza root                                          normal             
     ibiza  root                      root               normal             
     ibiza  acct1                                        normal             
     ibiza   acct1                    alex               normal             
     ibiza   acct1                    test               normal             

Test:

test@ibiza:~/t$ salloc
salloc: Granted job allocation 20005
test@ibiza:~/t$ scontrol show job 20005 | grep -i timelimit
   RunTime=00:00:08 TimeLimit=00:01:00 TimeMin=N/A
test@ibiza:~/t$

We see it is enforced the Partition MaxTime since it's the only place it is set.

Now let's modify the assoc hierarchy so that we set user 'test' MaxWall=2m (higher than Partition one). Note that I've still not enabled any flags for the QOS normal:

$ sacctmgr -i modify user test set maxwall=2
 Modified user associations...
  C = ibiza      A = acct1                U = test     
$ sacctmgr show assoc tree format=cluster,account,user,qos,maxwall
   Cluster              Account       User                  QOS     MaxWall 
---------- -------------------- ---------- -------------------- ----------- 
     ibiza root                                          normal             
     ibiza  root                      root               normal             
     ibiza  acct1                                        normal             
     ibiza   acct1                    alex               normal             
     ibiza   acct1                    test               normal    00:02:00 

Let's test again:

test@ibiza:~/t$ salloc
salloc: Requested partition configuration not available now
salloc: Pending job allocation 20006
salloc: job 20006 queued and waiting for resources
...
slurmctld: debug2: _part_access_check: Job time limit (2) exceeds limit of partition p1(1)

As we can see, the job timelimit is enforced with the user MaxWall=2, but can't be allocated since it exceeds the Partition MaxTime. Here's were this limit is an exception, since the Partition is listed below the user assoc in the ordered list of precedence.

So, what you want, is to give users limit an exception. That's where the QOS flag PartitionTimeLimit is useful. Let's set it:

$ sacctmgr -i modify qos normal set flags=partitiontimelimit
 Modified qos...
  normal
$ sacctmgr show qos normal -p | cut -d'|' -f1,6,18
Name|Flags|MaxWall
normal|PartitionTimeLimit|

Let's test again. Remember (User MaxWall=2, Partition MaxTime=1, Job QOS has flag PartitionTimeLimit):

test@ibiza:~/t$ salloc
salloc: Granted job allocation 20007
test@ibiza:~/t$ scontrol show job 20007 | grep -i timelimit
   RunTime=00:00:07 TimeLimit=00:02:00 TimeMin=N/A
test@ibiza:~/t$

Here you have the user MaxWall exception working.

I've opened an internal bug 4750 targeted for 18.08 so that we revert those QOS flags to only have the partition limit trump the QOS if set (new flags PartitionTrump* to allow backward functionality).

I've tried to better reword the note added to the resource limits here:
https://github.com/SchedMD/slurm/commit/cbe2380a660cb5203e529215

Please, let me know if this makes sense to you and if you can accomplish your goal of having a MaxTime at partition level but be able to overhaul it by setting MaxWall at association level together with the PartitionTimeLimit QOS flag.

Thanks.

Comment 24 Ben Matthews 2018-02-14 09:50:01 MST

> I've opened an internal bug 4750 targeted for 18.08 so that we revert those
> QOS flags to only have the partition limit trump the QOS if set (new flags
> PartitionTrump* to allow backward functionality).

I'm not sure if I understand what you're planning to change in 18.08. Will you reverse the behavior such that the behavior that we get today when PartitionTimeLimit is set is the default (revertible by PartitionTrumpTimeLimit) and also do the same for the other limits? Will you therefore remove the PartitionTimeLimit flag? Will we have to do anything special to handle this change when we move to 18.x? 

> I've tried to better reword the note added to the resource limits here:
> https://github.com/SchedMD/slurm/commit/cbe2380a660cb5203e529215
> 
> Please, let me know if this makes sense to you and if you can accomplish
> your goal of having a MaxTime at partition level but be able to overhaul it
> by setting MaxWall at association level together with the PartitionTimeLimit
> QOS flag.

Yes, this seems to meet our needs for now. 

I'm a little surprised that it seems to reject non-conforming jobs at submission time rather than queueing and never running them. This is probably the better behavior for us, but not what we've gotten in the past with the partition limit unset (maybe I'm just missing a config option in my test environment?). What happens when slurmdbd is down?

Comment 25 Felip Moll 2018-02-16 02:49:19 MST

Hi Ben, Alex is out until Monday, I respond by him:

(In reply to Ben Matthews from comment #24)
> > I've opened an internal bug 4750 targeted for 18.08 so that we revert those
> > QOS flags to only have the partition limit trump the QOS if set (new flags
> > PartitionTrump* to allow backward functionality).
>
> I'm not sure if I understand what you're planning to change in 18.08. Will
> you reverse the behavior such that the behavior that we get today when
> PartitionTimeLimit is set is the default (revertible by
> PartitionTrumpTimeLimit) and also do the same for the other limits? Will you
> therefore remove the PartitionTimeLimit flag? Will we have to do anything
> special to handle this change when we move to 18.x?

Yes, you understood the idea.

With this change the exception shown in comment 23 will not happen by default.

The flags intended to be added are all of them: PartitionTrump<whatever>

and the flags to be removed:
PartitionMaxNodes
PartitionMinNodes
PartitionTimeLimit

The behavior would be the one that was found in the old documentation.

Change is still not completelly designed but probably the QoS could be automatically
modified on the database conversion depending on your settings.


> > I've tried to better reword the note added to the resource limits here:
> > https://github.com/SchedMD/slurm/commit/cbe2380a660cb5203e529215
> >
> > Please, let me know if this makes sense to you and if you can accomplish
> > your goal of having a MaxTime at partition level but be able to overhaul it
> > by setting MaxWall at association level together with the PartitionTimeLimit
> > QOS flag.
>
> Yes, this seems to meet our needs for now.

Cool.

> I'm a little surprised that it seems to reject non-conforming jobs at
> submission time rather than queueing and never running them. This is
> probably the better behavior for us, but not what we've gotten in the past
> with the partition limit unset (maybe I'm just missing a config option in my
> test environment?). What happens when slurmdbd is down?

Rejects happens when you set the flag DenyOnLimit on the QoS, otherwise it is queued
but never runs.

When slurmdbd is down, limits and so on are taken from state files and in-memory structs.
In some edge cases like - slurmdbd up & ctld down, modify limits, then stop slurmdbd and start ctld -
modifying limits could require a 'scontrol reconfig' when slurmdbd comes up.

Comment 26 Alejandro Sanchez 2018-02-23 03:25:19 MST

Hi Ben. Is there anything else we can answer in this bug? Thanks.

Comment 27 Ben Matthews 2018-02-26 11:23:43 MST

(In reply to Alejandro Sanchez from comment #26)
> Hi Ben. Is there anything else we can answer in this bug? Thanks.

No, I think we're ok now. Thanks for all the help and I look forward to the permanent fix in 18.x

Comment 28 Alejandro Sanchez 2018-03-02 07:16:55 MST

All right, closing as resolved/fixed as per the patches provided.

Comment 29 Marshall Garey 2018-09-12 15:39:49 MDT

*** Ticket 5725 has been marked as a duplicate of this ticket. ***