| Summary: | Partition QOS not taking Precedence | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ivan Kovanda <ivan.kovanda> |
| Component: | Limits | Assignee: | Carlos Tripiana Montes <tripiana> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | jvilarru |
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | U Denver | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | debug logs reproducing issue | ||
> I thought when you assign a qos to a partition it overrides the default qos > set in the user’s association? As noted in [1], the order is (from high to low): Part QOS, Job QOS, User/Account/Cluster assocs, Part. So your assumption is correct. Unless [2] is set in the QOS definition. > Do you think the user has to explicitly request the qos in the job script? > #SBATCH --qos=andrei No in this scenario, unless [2] is set in the QOS definition. > Or is an additional config missing? If I'm not missing anything, I think you have the partition QOS set well, so this will override job QOS. And ofc, assoc QOS. [1] https://slurm.schedmd.com/resource_limits.html#hierarchy [2] https://slurm.schedmd.com/sacctmgr.html#OPT_OverPartQOS Hi Carlos, Thanks. For some reason its still not working. Do you think I need to add qos to: AccountingStorageEnforce= So it would be: AccountingStorageEnforce=limits,qos Thanks, Ivan Hi Ivan, Fairly sure you're right with Comment 2, if you are looking for enforcement. See [1]. See [2] as well. Please, let us know if Slurm is behaving as you were looking for now. Cheers, Carlos [1] https://slurm.schedmd.com/accounting.html#limit-enforcement [2] https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-after-build Hi Carlos,
Unfortunately the setting did not fix this issue either:
AccountingStorageEnforce=limit,qos
Made sure to restart slurmctld as well.
Any other thoughts?
Would you be able to test this config to see why?
Thanks,
Ivan
> I thought when you assign a qos to a partition it overrides the default qos set > in the user’s association? > Do you think the user has to explicitly request the qos in the job script? Iva - the job's requested QOS does not change, so you should not expect the name change on the job. What happens in code is that just the limits are imposed on jobs that run in that partition. > Name MaxTRESPU MaxJobsPU MaxSubmitPU > ---------- ------------- --------- ----------- > normal cpu=128 100 200 > andrei Another pain point here is that there are no entries for those limits on the QOS andrei. The code expects something on those limits for andrei, otherwise the requested partitions limits take effect. So, if you want to override say the MaxSubmitPU then you need to set it on andrei. Hi Jason, I don't want any limits set on the 'kutateladze' partition (with qos andrei) - Is there a way to set no limits in the qos? I thought if its blank there are none. - Does this behave differently with qos attached to the partition? - So what your saying is even though the job won't say its using the 'andrei' qos (scontrol show job <jobID> , it really is? I remember before I set these limits I didn't have to put any value and limits weren't enforced. In that scenario the normal qos was attached to the rdac_acct account, not a partition. Thanks, Ivan Ivan, > - Is there a way to set no limits in the qos? I thought if its blank there are none. The documentation could be make to be more clear in this case. The partition limits override the users limits. If none are set, the user limits are still enforced. https://slurm.schedmd.com/slurm.conf.html#OPT_QOS > QOS > Used to extend the limits available to a QOS on a partition. Jobs will not be associated to this QOS outside of > being associated to the partition. They will still be associated to their requested QOS. By default, no QOS is used. > NOTE: If a limit is set in both the Partition's QOS and the Job's QOS the Partition QOS will be honored unless the > Job's QOS has the OverPartQOS flag set in which the Job's QOS will have priority. > - Does this behave differently with qos attached to the partition? The behavior is as documented. It does take a little explanation and trial and error to understand how it operates. The functionality has been there for some time and has not changed in many years. acct_policy_set_qos_order https://github.com/SchedMD/slurm/blob/master/src/slurmctld/acct_policy.c#L4942 > - So what your saying is even though the job won't say its using the 'andrei' qos (scontrol show job <jobID> , it really is? The accounting limit enforcement will first apply the partition limits, if they are there, and override the users limits. Then, if there are no partition limits, the users are then enforced. > I remember before I set these limits I didn't have to put any value and limits weren't enforced. In that scenario the normal qos was attached to the rdac_acct account, not a partition. The behavior for partition QOS'es has been there for some time and is a little different from those of associations in how it is applied (as mentioned above). You will need to set some value (large number) in the partition QOS if you wish to have the user's normal partition limits bypassed. I will have Carlos follow up with you should you have additional questions. Ivan, Guess that you've got all your doubts covered here. I'm going to close the bug as info given, but if you need anything else please let us know. Cheers. Hi Carlos,
Sorry for reopening this bug.
We tested this again after adding a higher cpu limit to the andrei QOS, but it still is not working with reason QOSMaxCpuPerUserLimit
Any idea why this could be occurring? The 4 jobs submitted to the kutateladze partition should not count for resources used in defq .
See below:
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
274571 defq tetracyc akutatel PD 0:00 1 (QOSMaxCpuPerUserLimit)
274572 defq tetracyc akutatel PD 0:00 1 (QOSMaxCpuPerUserLimit)
274573 defq tetracyc akutatel PD 0:00 1 (QOSMaxCpuPerUserLimit)
274574 defq tetracyc akutatel PD 0:00 1 (QOSMaxCpuPerUserLimit)
274567 kutatelad tetracyc akutatel R 11:18 1 node012
274568 kutatelad tetracyc akutatel R 11:18 1 node013
274569 kutatelad tetracyc akutatel R 11:18 1 node001
274570 kutatelad tetracyc akutatel R 11:18 1 node002
So currently we have the following set:
# sacctmgr show qos format=Name,MaxTRESPU,MaxJobsPU,MaxSubmitJobsPerUser
Name MaxTRESPU MaxJobsPU MaxSubmitPU
---------- ------------- --------- -----------
normal cpu=128 100 200
andrei cpu=248
Users Association:
# sacctmgr show assoc format=Cluster,Account,User,QOS,DefaultQOS
Cluster Account User QOS Def QOS
---------- ---------- ---------- -------------------- ---------
slurm_clu+ rdac_acct normal normal
slurm_clu+ rdac_acct akutatel andrei,normal normal
Partition Config:
PartitionName=kutateladze
AllowGroups=akutatel AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=andrei
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=node[001,002,012,013]
PriorityJobFactor=1 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=0 PreemptMode=OFF
State=UP TotalCPUs=128 TotalNodes=4 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
Please, Cancel this test. Then, increase debug level with: scontrol setdebug debug3 scontrol setdebugflags +Gres,JobAccountGather,Priority Reproduce the issue again by sending the jobs. Next, post output from: scontrol -ad show config scontrol -ad show assoc_mgr scontrol -ad show partition scontrol -ad show node scontrol -ad show job [jobid] #x2 times For "show job", it is 1 job running in that partition like 274567, and other for a job QOSMaxCpuPerUserLimit like 274571. Finally, revert back to default debug level with: scontrol setdebug 0 scontrol setdebugflags -Gres,JobAccountGather,Priority Send us back all the information printed by the requested commands, and the slurmdctld.log file, so we can track with increased detail which is the imposed limit affecting these jobs. In fact, I hope that the requested information will be illustrative enough to serve as self-explanatory of the issue. Otherwise, it will help us find it. Cheers, Carlos. Created attachment 26183 [details]
debug logs reproducing issue
Hi Carlos, I've zipped up all the logs request. The debug flag named JobAccountGather was not valid: # scontrol setdebugflags +Gres,JobAccountGather,Priority scontrol: error: Invalid DebugFlag: JobAccountGather invalid debug flag: +Gres,JobAccountGather,Priority So I just ran : # scontrol setdebugflags +Gres,Priority I hope that was enough information to tell what is happening. Best, Ivan > The debug flag named JobAccountGather was not valid: Sorry, I've forgot: > Slurm version: 18.08.9 I'm going to have a look at the logs/details. Please, remember that we can give support to you to understand how things work and how things get configured. But as you are running very old version, if I hit a bug, there's 2 options: it's already fixed, or needs to be fixed in future releases. No matter which one, you won't get that fix. Please, consider upgrading to a supported version. If possible, to the latest stable release. Don't forget you need to jump upgrades from latest minor to latest minor: 18.08.9 -> 19.05.8 -> 20.02.7 -> 20.11.9 -> 21.08.8 -> 22.05.2. You see now there's some distance between versions. Don't forget to make a backup copy of the state save folder and database before starting to upgrade, and between successful upgrades, if you do so. I'll keep you posted of my findings. Regards. Hi Ivan, What is happening is the following, and it's expected behaviour: 1. When you sent a job like "JobId=274657 JobName=k32_1", the chosen partition was "Partition=kutateladze". I guess no QOS had been set to that job, so it defaulted to "something". This "something" was "normal" because "UserId=akutatel(1003)" has "DefaultQOS=normal". 2. Yes, Partition QOS is not necessarily the same as Job QOS. You ended with "JobId=274657 JobName=k32_1" having Job QOS "normal" but "Partition=kutateladze" QOS was "andrei". 3. Then, relating to accounting, there's 2 separated stages: enforcement (if enabled), and billing (charging alloc'ed resources). 3.1. The enforcement is done as explained earlier in this bug, and "Partition=kutateladze" QOS "andrei" applies with higher precedence than Job QOS "normal", and whatever other levels involved. 3.2. Regarding the specific limit, as QOS "andrei" has "MaxTRESPU=cpu=248", and "normal" has "MaxTRESPU=cpu=128", both have the same limit respectively. For "JobId=274657 JobName=k32_1", the "andrei" was applied: "MaxTRESPU=cpu=248". 3.3 After the job is set to run, the billing for "MaxTRESPU=cpu" limit is increased. And the core point is: it is charged both in the Partition QOS and the Job QOS, even though you didn't specify a Job QOS. The Partition QOS is not automatically assigned as the Job QOS in this situation. The Partition QOS is intended to impose a partition restriction but not meant to be charged separately from the Job QOS, which is the real one storing the billing for the users/accounts/... 3.4 The fact that this billing gets accounted twice in this scenario, responds to the need of keeping track of the partition limit that needs to be imposed as well. But the real QOS for that job was the Job QOS, the Partition QOS was just another -more global, limit we wanted to impose. 4. As you might guess now, once "UserId=akutatel(1003)" sent as much jobs this way to reach "MaxTRESPU=cpu=128(128)" in QOS "normal", they had "MaxTRESPU=cpu=248(128)" in QOS "andrei". So in that situation, they couldn't send more jobs to QOS "normal", which is the one "JobId=274661 JobName=d32_1", and this job didn't have any Partition QOS in action, so the enforced value came from Job QOS. 5. Probably, if "Partition=kutateladze" had more than 248 CPUs (which is the actual partition size: 4 nodes), more jobs like "JobId=274657 JobName=k32_1" could have been sent to QOS "andrei". I hope to have been clear enough following your test case to fully cover your doubts. Regards. Hi Ivan, Do you think it could be closed as info given? Do you need further assistance? Thanks. Closing as info given by now. Don't hesitate to reopen it if needed. |
Slurm version: 18.08.9 We have a researcher that owns 4 nodes in our cluster and I’m having difficulties figuring out the correct config for excluding cpu core and max jobs limits when he uses his own partition. But, I still want these limits to apply to our other partitions. So far I’ve created a new QOS named Andrei that doesn’t have any limits : sacctmgr show qos format=Name,MaxTRESPU,MaxJobsPU,MaxSubmitJobsPerUser Name MaxTRESPU MaxJobsPU MaxSubmitPU ---------- ------------- --------- ----------- normal cpu=128 100 200 andrei Then, added the qos named “andrei” to the user’s association: The way we set it up is the normal QOS is attached to the rdac_acct, instead of each individuals user. sacctmgr show assoc format=Cluster,Account,User,QOS,DefaultQOS Cluster Account User QOS Def QOS ---------- ---------- ---------- -------------------- --------- slurm_clu+ rdac_acct normal normal slurm_clu+ rdac_acct akutatel andrei,normal normal Then I added the qos “andrei” to the partition’s configuration: PartitionName=kutateladze AllowGroups=akutatel AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=andrei DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=node[001,002,012,013] PriorityJobFactor=1 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=0 PreemptMode=OFF State=UP TotalCPUs=128 TotalNodes=4 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED I thought when you assign a qos to a partition it overrides the default qos set in the user’s association? Do you think the user has to explicitly request the qos in the job script? #SBATCH --qos=andrei Or is an additional config missing? Thank you! Ivan