Ticket 6564

Summary: How to enforce user association limit on MaxWall over partition limit without loosing maxwall for all other users
Product: Slurm Reporter: hpc-cs-hd
Component: LimitsAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 18.08.4   
Hardware: Linux   
OS: Linux   
Site: Cineca Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf of CINECA Marconi cluster

Description hpc-cs-hd 2019-02-22 06:54:23 MST
Created attachment 9254 [details]
slurm.conf of CINECA Marconi cluster

Dear All,

what we report here is very closely connected to what reported in bug https://bugs.schedmd.com/show_bug.cgi?id=4681

We have the same need as the one reported by Ben:

"Is there a way to override partition limits for users who are special -- that is, I'd like to set a per-partition limit on walltime, except for a specific user running under a specific account (potentially with a specific QoS)"

We followed Alex's indications, namely:

1) we set a time limit on a specific partition (no MaxWall limit set on the corresponding partition QOS), say 04:00:00
2) we set the PartitionTimeLimit flag to the QOS "normal"
3) we set a specific user (say, afederic) maxwall limit to 08:00:00

In this way, as Alex explained, we indeed allow the specific user afederic to submit jobs up to 8 hours, overriding the partition time limit of 4 hours, but the (not irrelevant...) drawback is that for all other users the maxwall limit is NOT enforced anymore (hence, everyone can submit and run jobs asking for more than the part time limit)!
Hence it seems that, the QOS "normal" having the PartitionTimeLimit flag and no maxwall, unless the limit is imposed on a specific association (like in point 3 above) the partition time limit is ignored. Can you please check such behavior? clearly, we can't loose the walltime limit imposed on a partition in order to allow the enforcing of user association limit. Maybe we are missing something crucial. 

By the way, I can't find any reference in 18.08 RELEASE NOTES concerning what is reported in https://bugs.schedmd.com/show_bug.cgi?id=4750 

We configured Slurm accounting with:

AccountingStorageEnforce = associations,limits,qos,safe

Please find in attachment the slurm.conf applied on CINECA Marconi cluster. 

Thank you for your help,
Isabella
Comment 1 Ben Roberts 2019-02-22 12:17:29 MST
Hi Isabella,

I've been looking at the issue you've described and I have a couple of questions to clarify what you're seeing.  When users submit jobs are they requesting the 'normal' qos, which has the PartitionTimeLimit flag?  Can you submit an example job that requests more than the 4 hour limit and send the output of 'scontrol show job <jobid>' for it?  I would also like to see the output of 'sacctmgr show qos'.

This should help me better understand what's happening.

Thanks,
Ben
Comment 2 Ben Roberts 2019-02-22 15:53:13 MST
Hi Isabella,

I continued to look at this and I believe I've found what the issue is.  I've done the following to try and reproduce the configuration you describe.  I set a time limit on partition debug and assigned it partition qos of 'desktopq':

$ grep "PartitionName=debug" slurm.conf 
PartitionName=debug Nodes=ALL Default=YES MaxTime=4:00:00 State=UP QOS=desktopq



For the 'desktopq' qos I didn't assign a MaxWall time:

$ sacctmgr show qos desktopq format=name,maxwall
      Name     MaxWall 
---------- ----------- 
  desktopq             



I put the PartitionTimeLimit flag on the 'normal QOS:

$ sacctmgr show qos normal format=name,flags,maxwall
      Name                Flags     MaxWall 
---------- -------------------- ----------- 
    normal   PartitionTimeLimit             



I set a maxwall time of 8:00:00 for 'user2':

$ sacctmgr show assoc where user=user2 account=bob format=cluster,account,user,maxwall,qos
   Cluster    Account       User     MaxWall                  QOS 
---------- ---------- ---------- ----------- -------------------- 
   winston        bob      user2    08:00:00         normal,test2 



If I submit a job as 'user1' (who doesn't have a maxwall time defined) I can submit a job with a walltime of 7 hours and it will start:

user1@ben-XPS-15-9570:~$ sbatch -p debug -q normal -t 7:00:00 --wrap="sleep 60"
Submitted batch job 2200
user1@ben-XPS-15-9570:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2200     debug     wrap    user1  R       0:04      1 node01



I then removed the PartitionTimeLimit flag from the normal QOS:

$ sacctmgr modify qos normal set flags=-1
 Modified qos...
  normal
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
$ sacctmgr show qos normal format=name,flags,maxwall
      Name                Flags     MaxWall 
---------- -------------------- ----------- 
    normal                                  



If I then try to submit the same job as user1 I get an error that it exceeds a limit:

user1@ben-XPS-15-9570:~$ sbatch -p debug -q normal -t 7:00:00 --wrap="sleep 60"
sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit)


This flag can be confusing.  I think you probably intended for this flag to have the QOS enforce the settings defined at the partition level, but instead it allows the QOS to override the partition limit.  Since the qos doesn't have a maxwall time defined then it doesn't enforce any limit.  Here's the description from the documentation:
> PartitionTimeLimit
>     If set jobs using this QOS will be able to override the requested 
> partition's TimeLimit. 



However, by removing that flag from the 'normal' QOS, it will cause the limit set on the association for user2 to be ignored and it will enforce the partition limit.  

user2@ben-XPS-15-9570:~$ sbatch -A bob -p debug -q normal -t 7:00:00 --wrap="sleep 60"
sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit)



This is noted in the Resource Limit documentation:

> The precedence order specified above is respected except for the following 
> limits: Max[Time|Wall], [Min|Max]Nodes. For these limits, even if the job is 
> enforced with QOS and/or Association limits, it can't go over the limit 
> imposed at Partition level, even if it listed at the bottom. So the default 
> for these 3 types of limits is that they are upper bound by the Partition one. 
> This Partition level bound can be ignored if the respective QOS 
> PartitionTimeLimit and/or Partition[Max|Min]Nodes flags are set, then the job 
> would be enforced the limits imposed at QOS and/or association level 
> respecting the order above.

Ref: https://slurm.schedmd.com/resource_limits.html



Since this is the case, you would need to create a separate qos for the user that has the PartitionTimeLimit flag to allow him to exceed the maxwall limit specified on the partition.  

Let me know if you have questions about this or if the scenario I set up doesn't align with what you are doing in your environment.

Thanks,
Ben
Comment 3 hpc-cs-hd 2019-02-26 10:38:14 MST
Dear Ben,

thank you for your answer; you reproduced our environment, but that's exactly the problem we were raising. I try to be clearer:

1) as you can read by yourself in bug https://bugs.schedmd.com/show_bug.cgi?id=4681:

[start quoting Alejandro Sanchez, Comment 23]

[...skip...]
So, what you want, is to give users limit an exception. That's where the QOS flag PartitionTimeLimit is useful. Let's set it:

$ sacctmgr -i modify qos normal set flags=partitiontimelimit
 Modified qos...
  normal
$ sacctmgr show qos normal -p | cut -d'|' -f1,6,18
Name|Flags|MaxWall
normal|PartitionTimeLimit|
[...skip...]

Please, let me know if this makes sense to you and if you can accomplish your goal of having a MaxTime at partition level but be able to overhaul it by setting MaxWall at association level together with the PartitionTimeLimit QOS flag.

[end quoting Alejandro Sanchez]

Hence, from what Alejandro wrote, it seems that the only way to have an exception at the lever of a user association is to set the flag PartitionTimeLimit to the job QOS ("normal", unless a specific QOS is not requested).
Do you agree on that? or did we misunderstand what Alex wrote?

2) on the other hand, as you say, the PartitionTimeLimit flag on the "normal" QOS causes what reported at https://slurm.schedmd.com/resource_limits.html, namely the Partition level bound is IGNORED. 
I guess this is the critical point: we would expect the PartitionTimeLimit flag to enforce the time limit defined elsewhere with respect to the partition MaxTime, not to ignore it. 

In this case, I would say that it's useless to allow the time limit specification at the user's association level, because this exception comes to the expense of loosing the limit for the "normal" jobs. We would be obliged to defined a MaxWall at association level for ALL users (equal to the wished MaxTime of the partition unless for special cases). 

3) I know that we can create a "special" QOS, and this is the way we have been acting until now. The problem is that a particular "special" QOS may not be appropriate to all special cases we decide to accept. And we would avoid having to create a (maybe large) set of special QOSs.

Is there a reason for your decision to implement the PartitionTimeLimit flag so that the partition MaxTime is ignored (hence, no limit for all normal cases), and not simply overhauled by QOS/association level limits whenever defined?     

Thank you for your help,
cheers
Isabella
Comment 4 Ben Roberts 2019-02-27 10:02:22 MST
Hi Isabella,

In the example Alejandro put together he shows that the user association limit can override the partition limit by having the PartitionTimeLimit flag on the qos, but he doesn't show what happens to other users who use the same partition/qos combination but don't have a limit set on their association.  

> Hence, from what Alejandro wrote, it seems that the only way to have an 
> exception at the lever of a user association is to set the flag 
> PartitionTimeLimit to the job QOS ("normal", unless a specific QOS is not 
> requested).  Do you agree on that? or did we misunderstand what Alex wrote?

You understand this correctly, it's just that setting this flag on the QOS also allows users who don't have a MaxWall time set on their user association to ignore the partition MaxWall time.  

> I guess this is the critical point: we would expect the PartitionTimeLimit 
> flag to enforce the time limit defined elsewhere with respect to the partition 
> MaxTime, not to ignore it. 

> Is there a reason for your decision to implement the PartitionTimeLimit flag 
> so that the partition MaxTime is ignored (hence, no limit for all normal 
> cases), and not simply overhauled by QOS/association level limits whenever 
> defined?

I'll discuss this with my colleagues to see if I can get some information on why it was done this way and whether there would be any chance the behavior could be changed.  

In the meantime, I do think the workaround would be manageable without creating a large number of QOSs.  You could remove the 'PartitionTimeLimit' flag from the 'normal' qos and create another qos that has the flag:

$ sacctmgr show qos testqos format=name,maxwall,flags,
      Name     MaxWall                Flags 
---------- ----------- -------------------- 
   testqos                                 

$ sacctmgr show qos test2 format=name,maxwall,flags,
      Name     MaxWall                Flags 
---------- ----------- -------------------- 
     test2               PartitionTimeLimit 

$ grep "PartitionName=debug" slurm.conf 
PartitionName=debug Default=YES MaxTime=4:00:00 Nodes=ALL State=UP QOS=desktopq


Then you can have the users without a MaxWall time defined at the association level use the 'normal' QOS and have it's limit imposed.  Users who do have a limit defined can be added to the secondary QOS (and have that made the default) and will then be able to exceed the partition limit:

$ sacctmgr show assoc tree where account=testacct format=account,user,maxwall,qos,defaultqos
             Account       User     MaxWall                  QOS   Def QOS 
-------------------- ---------- ----------- -------------------- --------- 
testacct                                                 testqos   testqos 
 testacct                   ben    08:00:00        test2,testqos     test2 
 testacct                 user2                          testqos   testqos 


My user defaults to using the 'test2' QOS and can submit up to 8 hour jobs:

$ whoami
ben
$ sbatch -Atestacct -pdebug -t8:00:00 --wrap="sleep 300"
Submitted batch job 2264


But user2 gets rejected when trying to submit jobs longer than the limit set on the partition:

$ whoami
user2
$ sbatch -Atestacct -pdebug -t8:00:00 --wrap="sleep 300"
sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit)
$ sbatch -Atestacct -pdebug -t4:00:00 --wrap="sleep 300"
Submitted batch job 2265


I hope this helps and I'll let you know what I find about the history of this change.

Thanks,
Ben
Comment 5 hpc-cs-hd 2019-02-27 10:46:10 MST
Dear Ben,

thank you very much, what you suggest (one QOS "ad hoc" with the PartitionTimeLimit Flag to associate to, and used by, users with special limits) seems an excellent and quite general solution to me, and you can consider my request of info satisfied. 
I will appreciate any additional information you can give me after discussing with your colleagues (and info on eventual plans to change the present behaviour of this flag), but please consider this as optional if you'd wish to close this ticket. 

Cheers

Isabella
Comment 6 Ben Roberts 2019-02-28 10:10:59 MST
Hi Isabella,

I did discuss this with my colleagues and we didn't know the history exactly of why the walltime and min and max nodes were chosen to be exceptions to the normal hierarchy tree.  The enhancement you mention (4750) is still planned to be worked on, but didn't make it into 18.08, which is why you didn't see a mention of it in the release notes.  The intended behavior is to have it follow the normal hierarchy tree by default and have a flag to make it use the current behavior.  

Since you said my recommended setup works for you I'll close this ticket and you can watch that ticket for when we are able to modify the behavior.

Thanks,
Ben