| Summary: | How to enforce user association limit on MaxWall over partition limit without loosing maxwall for all other users | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | hpc-cs-hd |
| Component: | Limits | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 18.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Cineca | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf of CINECA Marconi cluster | ||
|
Description
hpc-cs-hd
2019-02-22 06:54:23 MST
Hi Isabella, I've been looking at the issue you've described and I have a couple of questions to clarify what you're seeing. When users submit jobs are they requesting the 'normal' qos, which has the PartitionTimeLimit flag? Can you submit an example job that requests more than the 4 hour limit and send the output of 'scontrol show job <jobid>' for it? I would also like to see the output of 'sacctmgr show qos'. This should help me better understand what's happening. Thanks, Ben Hi Isabella,
I continued to look at this and I believe I've found what the issue is. I've done the following to try and reproduce the configuration you describe. I set a time limit on partition debug and assigned it partition qos of 'desktopq':
$ grep "PartitionName=debug" slurm.conf
PartitionName=debug Nodes=ALL Default=YES MaxTime=4:00:00 State=UP QOS=desktopq
For the 'desktopq' qos I didn't assign a MaxWall time:
$ sacctmgr show qos desktopq format=name,maxwall
Name MaxWall
---------- -----------
desktopq
I put the PartitionTimeLimit flag on the 'normal QOS:
$ sacctmgr show qos normal format=name,flags,maxwall
Name Flags MaxWall
---------- -------------------- -----------
normal PartitionTimeLimit
I set a maxwall time of 8:00:00 for 'user2':
$ sacctmgr show assoc where user=user2 account=bob format=cluster,account,user,maxwall,qos
Cluster Account User MaxWall QOS
---------- ---------- ---------- ----------- --------------------
winston bob user2 08:00:00 normal,test2
If I submit a job as 'user1' (who doesn't have a maxwall time defined) I can submit a job with a walltime of 7 hours and it will start:
user1@ben-XPS-15-9570:~$ sbatch -p debug -q normal -t 7:00:00 --wrap="sleep 60"
Submitted batch job 2200
user1@ben-XPS-15-9570:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2200 debug wrap user1 R 0:04 1 node01
I then removed the PartitionTimeLimit flag from the normal QOS:
$ sacctmgr modify qos normal set flags=-1
Modified qos...
normal
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
$ sacctmgr show qos normal format=name,flags,maxwall
Name Flags MaxWall
---------- -------------------- -----------
normal
If I then try to submit the same job as user1 I get an error that it exceeds a limit:
user1@ben-XPS-15-9570:~$ sbatch -p debug -q normal -t 7:00:00 --wrap="sleep 60"
sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit)
This flag can be confusing. I think you probably intended for this flag to have the QOS enforce the settings defined at the partition level, but instead it allows the QOS to override the partition limit. Since the qos doesn't have a maxwall time defined then it doesn't enforce any limit. Here's the description from the documentation:
> PartitionTimeLimit
> If set jobs using this QOS will be able to override the requested
> partition's TimeLimit.
However, by removing that flag from the 'normal' QOS, it will cause the limit set on the association for user2 to be ignored and it will enforce the partition limit.
user2@ben-XPS-15-9570:~$ sbatch -A bob -p debug -q normal -t 7:00:00 --wrap="sleep 60"
sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit)
This is noted in the Resource Limit documentation:
> The precedence order specified above is respected except for the following
> limits: Max[Time|Wall], [Min|Max]Nodes. For these limits, even if the job is
> enforced with QOS and/or Association limits, it can't go over the limit
> imposed at Partition level, even if it listed at the bottom. So the default
> for these 3 types of limits is that they are upper bound by the Partition one.
> This Partition level bound can be ignored if the respective QOS
> PartitionTimeLimit and/or Partition[Max|Min]Nodes flags are set, then the job
> would be enforced the limits imposed at QOS and/or association level
> respecting the order above.
Ref: https://slurm.schedmd.com/resource_limits.html
Since this is the case, you would need to create a separate qos for the user that has the PartitionTimeLimit flag to allow him to exceed the maxwall limit specified on the partition.
Let me know if you have questions about this or if the scenario I set up doesn't align with what you are doing in your environment.
Thanks,
Ben
Dear Ben, thank you for your answer; you reproduced our environment, but that's exactly the problem we were raising. I try to be clearer: 1) as you can read by yourself in bug https://bugs.schedmd.com/show_bug.cgi?id=4681: [start quoting Alejandro Sanchez, Comment 23] [...skip...] So, what you want, is to give users limit an exception. That's where the QOS flag PartitionTimeLimit is useful. Let's set it: $ sacctmgr -i modify qos normal set flags=partitiontimelimit Modified qos... normal $ sacctmgr show qos normal -p | cut -d'|' -f1,6,18 Name|Flags|MaxWall normal|PartitionTimeLimit| [...skip...] Please, let me know if this makes sense to you and if you can accomplish your goal of having a MaxTime at partition level but be able to overhaul it by setting MaxWall at association level together with the PartitionTimeLimit QOS flag. [end quoting Alejandro Sanchez] Hence, from what Alejandro wrote, it seems that the only way to have an exception at the lever of a user association is to set the flag PartitionTimeLimit to the job QOS ("normal", unless a specific QOS is not requested). Do you agree on that? or did we misunderstand what Alex wrote? 2) on the other hand, as you say, the PartitionTimeLimit flag on the "normal" QOS causes what reported at https://slurm.schedmd.com/resource_limits.html, namely the Partition level bound is IGNORED. I guess this is the critical point: we would expect the PartitionTimeLimit flag to enforce the time limit defined elsewhere with respect to the partition MaxTime, not to ignore it. In this case, I would say that it's useless to allow the time limit specification at the user's association level, because this exception comes to the expense of loosing the limit for the "normal" jobs. We would be obliged to defined a MaxWall at association level for ALL users (equal to the wished MaxTime of the partition unless for special cases). 3) I know that we can create a "special" QOS, and this is the way we have been acting until now. The problem is that a particular "special" QOS may not be appropriate to all special cases we decide to accept. And we would avoid having to create a (maybe large) set of special QOSs. Is there a reason for your decision to implement the PartitionTimeLimit flag so that the partition MaxTime is ignored (hence, no limit for all normal cases), and not simply overhauled by QOS/association level limits whenever defined? Thank you for your help, cheers Isabella Hi Isabella, In the example Alejandro put together he shows that the user association limit can override the partition limit by having the PartitionTimeLimit flag on the qos, but he doesn't show what happens to other users who use the same partition/qos combination but don't have a limit set on their association. > Hence, from what Alejandro wrote, it seems that the only way to have an > exception at the lever of a user association is to set the flag > PartitionTimeLimit to the job QOS ("normal", unless a specific QOS is not > requested). Do you agree on that? or did we misunderstand what Alex wrote? You understand this correctly, it's just that setting this flag on the QOS also allows users who don't have a MaxWall time set on their user association to ignore the partition MaxWall time. > I guess this is the critical point: we would expect the PartitionTimeLimit > flag to enforce the time limit defined elsewhere with respect to the partition > MaxTime, not to ignore it. > Is there a reason for your decision to implement the PartitionTimeLimit flag > so that the partition MaxTime is ignored (hence, no limit for all normal > cases), and not simply overhauled by QOS/association level limits whenever > defined? I'll discuss this with my colleagues to see if I can get some information on why it was done this way and whether there would be any chance the behavior could be changed. In the meantime, I do think the workaround would be manageable without creating a large number of QOSs. You could remove the 'PartitionTimeLimit' flag from the 'normal' qos and create another qos that has the flag: $ sacctmgr show qos testqos format=name,maxwall,flags, Name MaxWall Flags ---------- ----------- -------------------- testqos $ sacctmgr show qos test2 format=name,maxwall,flags, Name MaxWall Flags ---------- ----------- -------------------- test2 PartitionTimeLimit $ grep "PartitionName=debug" slurm.conf PartitionName=debug Default=YES MaxTime=4:00:00 Nodes=ALL State=UP QOS=desktopq Then you can have the users without a MaxWall time defined at the association level use the 'normal' QOS and have it's limit imposed. Users who do have a limit defined can be added to the secondary QOS (and have that made the default) and will then be able to exceed the partition limit: $ sacctmgr show assoc tree where account=testacct format=account,user,maxwall,qos,defaultqos Account User MaxWall QOS Def QOS -------------------- ---------- ----------- -------------------- --------- testacct testqos testqos testacct ben 08:00:00 test2,testqos test2 testacct user2 testqos testqos My user defaults to using the 'test2' QOS and can submit up to 8 hour jobs: $ whoami ben $ sbatch -Atestacct -pdebug -t8:00:00 --wrap="sleep 300" Submitted batch job 2264 But user2 gets rejected when trying to submit jobs longer than the limit set on the partition: $ whoami user2 $ sbatch -Atestacct -pdebug -t8:00:00 --wrap="sleep 300" sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit) $ sbatch -Atestacct -pdebug -t4:00:00 --wrap="sleep 300" Submitted batch job 2265 I hope this helps and I'll let you know what I find about the history of this change. Thanks, Ben Dear Ben, thank you very much, what you suggest (one QOS "ad hoc" with the PartitionTimeLimit Flag to associate to, and used by, users with special limits) seems an excellent and quite general solution to me, and you can consider my request of info satisfied. I will appreciate any additional information you can give me after discussing with your colleagues (and info on eventual plans to change the present behaviour of this flag), but please consider this as optional if you'd wish to close this ticket. Cheers Isabella Hi Isabella, I did discuss this with my colleagues and we didn't know the history exactly of why the walltime and min and max nodes were chosen to be exceptions to the normal hierarchy tree. The enhancement you mention (4750) is still planned to be worked on, but didn't make it into 18.08, which is why you didn't see a mention of it in the release notes. The intended behavior is to have it follow the normal hierarchy tree by default and have a flag to make it use the current behavior. Since you said my recommended setup works for you I'll close this ticket and you can watch that ticket for when we are able to modify the behavior. Thanks, Ben |