Ticket 14587

Summary: Partition QOS not taking Precedence
Product: Slurm Reporter: Ivan Kovanda <ivan.kovanda>
Component: LimitsAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: jvilarru
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: U Denver Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: debug logs reproducing issue

Description Ivan Kovanda 2022-07-21 11:01:41 MDT
Slurm version: 18.08.9

We have a researcher that owns 4 nodes in our cluster and I’m having difficulties figuring out the correct config for excluding cpu core and max jobs limits when he uses his own partition.
But, I still want these limits to apply to our other partitions.
 
So far I’ve created a new QOS named Andrei that doesn’t have any limits  :
 
sacctmgr show qos format=Name,MaxTRESPU,MaxJobsPU,MaxSubmitJobsPerUser
      Name     MaxTRESPU MaxJobsPU MaxSubmitPU
---------- ------------- --------- -----------
    normal       cpu=128       100         200
    andrei
 
 
 
Then, added the qos named “andrei” to the user’s association:
The way we set it up is the normal QOS is attached to the rdac_acct, instead of each individuals user.
 
sacctmgr show assoc format=Cluster,Account,User,QOS,DefaultQOS
   Cluster    Account       User                  QOS   Def QOS
---------- ---------- ---------- -------------------- ---------
slurm_clu+  rdac_acct                          normal    normal
slurm_clu+  rdac_acct   akutatel        andrei,normal    normal
 
 
 
Then I added the qos “andrei” to the partition’s configuration:
PartitionName=kutateladze
   AllowGroups=akutatel AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=andrei
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=node[001,002,012,013]
   PriorityJobFactor=1 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=0 PreemptMode=OFF
   State=UP TotalCPUs=128 TotalNodes=4 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
 
 
I thought when you assign a qos to a partition it overrides the default qos set in the user’s association?
Do you think the user has to explicitly request the qos in the job script?
#SBATCH --qos=andrei
 
Or is an additional config missing?
 
Thank you!
Ivan
Comment 1 Carlos Tripiana Montes 2022-07-22 01:53:09 MDT
> I thought when you assign a qos to a partition it overrides the default qos
> set in the user’s association?

As noted in [1], the order is (from high to low): Part QOS, Job QOS, User/Account/Cluster assocs, Part. So your assumption is correct. Unless [2] is set in the QOS definition.


> Do you think the user has to explicitly request the qos in the job script?
> #SBATCH --qos=andrei

No in this scenario, unless [2] is set in the QOS definition.


> Or is an additional config missing?

If I'm not missing anything, I think you have the partition QOS set well, so this will override job QOS. And ofc, assoc QOS.

[1] https://slurm.schedmd.com/resource_limits.html#hierarchy
[2] https://slurm.schedmd.com/sacctmgr.html#OPT_OverPartQOS
Comment 2 Ivan Kovanda 2022-07-22 10:13:46 MDT
Hi Carlos,

Thanks. For some reason its still not working.

Do you think I need to add qos to:
AccountingStorageEnforce=

So it would be:
AccountingStorageEnforce=limits,qos


Thanks,
Ivan
Comment 3 Carlos Tripiana Montes 2022-07-25 00:02:32 MDT
Hi Ivan,

Fairly sure you're right with Comment 2, if you are looking for enforcement. See [1]. See [2] as well.

Please, let us know if Slurm is behaving as you were looking for now.

Cheers,
Carlos

[1] https://slurm.schedmd.com/accounting.html#limit-enforcement
[2] https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-after-build
Comment 4 Ivan Kovanda 2022-07-25 10:29:32 MDT
Hi Carlos,

Unfortunately the setting did not fix this issue either:
    AccountingStorageEnforce=limit,qos 

Made sure to restart slurmctld as well.

Any other thoughts?
Would you be able to test this config to see why?

Thanks,
Ivan
Comment 5 Jason Booth 2022-07-25 10:45:40 MDT
> I thought when you assign a qos to a partition it overrides the default qos set 
>  in the user’s association?
> Do you think the user has to explicitly request the qos in the job script?

Iva - the job's requested QOS does not change, so you should not expect the name change on the job. What happens in code is that just the limits are imposed on jobs that run in that partition. 


>      Name     MaxTRESPU MaxJobsPU MaxSubmitPU
> ---------- ------------- --------- -----------
>    normal       cpu=128       100         200
>    andrei

Another pain point here is that there are no entries for those limits on the QOS andrei. The code expects something on those limits for andrei, otherwise the requested partitions limits take effect. So, if you  want to override say the MaxSubmitPU then you need to set it on andrei.
Comment 6 Ivan Kovanda 2022-07-25 10:59:46 MDT
Hi Jason,

I don't want any limits set on the 'kutateladze' partition (with qos andrei)

- Is there a way to set no limits in the qos? I thought if its blank there are none. 
- Does this behave differently with qos attached to the partition?
- So what your saying is even though the job won't say its using the 'andrei' qos (scontrol show job <jobID> ,  it really is?

I remember before I set these limits I didn't have to put any value and limits weren't enforced. In that scenario the normal qos was attached to the rdac_acct account, not a partition.

Thanks,
Ivan
Comment 7 Jason Booth 2022-07-25 13:05:07 MDT
Ivan,

> - Is there a way to set no limits in the qos? I thought if its blank there are none. 

The documentation could be make to be more clear in this case. The partition limits override the users limits. If none are set, the user limits are still enforced.

https://slurm.schedmd.com/slurm.conf.html#OPT_QOS

> QOS
> Used to extend the limits available to a QOS on a partition. Jobs will not be associated to this QOS outside of 
> being associated to the partition. They will still be associated to their requested QOS. By default, no QOS is used. 
> NOTE: If a limit is set in both the Partition's QOS and the Job's QOS the Partition QOS will be honored unless the 
> Job's QOS has the OverPartQOS flag set in which the Job's QOS will have priority.


> - Does this behave differently with qos attached to the partition?

The behavior is as documented. It does take a little explanation and trial and error to understand how it operates.
The functionality has been there for some time and has not changed in many years.

acct_policy_set_qos_order

https://github.com/SchedMD/slurm/blob/master/src/slurmctld/acct_policy.c#L4942



> - So what your saying is even though the job won't say its using the 'andrei' qos (scontrol show job <jobID> ,  it really is?

The accounting limit enforcement will first apply the partition limits, if they are there, and override the users limits. Then, if there are no partition limits, the users are then enforced.

> I remember before I set these limits I didn't have to put any value and limits weren't enforced. In that scenario the normal qos was attached to the rdac_acct account, not a partition.

The behavior for partition QOS'es has been there for some time and is a little different from those of associations in how it is applied (as mentioned above).

You will need to set some value (large number) in the partition QOS if you wish to have the user's normal partition limits bypassed.

I will have Carlos follow up with you should you have additional questions.
Comment 8 Carlos Tripiana Montes 2022-08-02 04:48:48 MDT
Ivan,

Guess that you've got all your doubts covered here. I'm going to close the bug as info given, but if you need anything else please let us know.

Cheers.
Comment 9 Ivan Kovanda 2022-08-04 15:40:12 MDT
Hi Carlos,

Sorry for reopening this bug.
We tested this again after adding a higher cpu limit to the andrei QOS, but it still is not working with reason QOSMaxCpuPerUserLimit
Any idea why this could be occurring? The 4 jobs submitted to the kutateladze partition should not count for resources used in defq .

See below:

# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            274571      defq tetracyc akutatel PD       0:00      1 (QOSMaxCpuPerUserLimit)
            274572      defq tetracyc akutatel PD       0:00      1 (QOSMaxCpuPerUserLimit)
            274573      defq tetracyc akutatel PD       0:00      1 (QOSMaxCpuPerUserLimit)
            274574      defq tetracyc akutatel PD       0:00      1 (QOSMaxCpuPerUserLimit)
            274567 kutatelad tetracyc akutatel  R      11:18      1 node012
            274568 kutatelad tetracyc akutatel  R      11:18      1 node013
            274569 kutatelad tetracyc akutatel  R      11:18      1 node001
            274570 kutatelad tetracyc akutatel  R      11:18      1 node002


So currently we have the following set:

# sacctmgr show qos format=Name,MaxTRESPU,MaxJobsPU,MaxSubmitJobsPerUser
      Name     MaxTRESPU MaxJobsPU MaxSubmitPU
---------- ------------- --------- -----------
    normal       cpu=128       100         200
    andrei       cpu=248

Users Association:

# sacctmgr show assoc format=Cluster,Account,User,QOS,DefaultQOS
   Cluster    Account       User                  QOS   Def QOS
---------- ---------- ---------- -------------------- ---------
slurm_clu+  rdac_acct                          normal    normal
slurm_clu+  rdac_acct   akutatel        andrei,normal    normal




Partition Config:

PartitionName=kutateladze
   AllowGroups=akutatel AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=andrei
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=node[001,002,012,013]
   PriorityJobFactor=1 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=0 PreemptMode=OFF
   State=UP TotalCPUs=128 TotalNodes=4 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
Comment 10 Carlos Tripiana Montes 2022-08-05 00:05:53 MDT
Please,

Cancel this test. Then, increase debug level with:

scontrol setdebug debug3
scontrol setdebugflags +Gres,JobAccountGather,Priority

Reproduce the issue again by sending the jobs. Next, post output from:

scontrol -ad show config
scontrol -ad show assoc_mgr
scontrol -ad show partition
scontrol -ad show node
scontrol -ad show job [jobid] #x2 times

For "show job", it is 1 job running in that partition like 274567, and other for a job QOSMaxCpuPerUserLimit like 274571.

Finally, revert back to default debug level with:

scontrol setdebug 0
scontrol setdebugflags -Gres,JobAccountGather,Priority

Send us back all the information printed by the requested commands, and the slurmdctld.log file, so we can track with increased detail which is the imposed limit affecting these jobs. In fact, I hope that the requested information will be illustrative enough to serve as self-explanatory of the issue. Otherwise, it will help us find it.

Cheers,
Carlos.
Comment 11 Ivan Kovanda 2022-08-05 11:49:39 MDT
Created attachment 26183 [details]
debug logs reproducing issue
Comment 12 Ivan Kovanda 2022-08-05 11:51:56 MDT
Hi Carlos,

I've zipped up all the logs request.

The debug flag named JobAccountGather was not valid:

# scontrol setdebugflags +Gres,JobAccountGather,Priority
scontrol: error: Invalid DebugFlag: JobAccountGather
invalid debug flag: +Gres,JobAccountGather,Priority


So I just ran :
# scontrol setdebugflags +Gres,Priority

I hope that was enough information to tell what is happening.

Best,
Ivan
Comment 13 Carlos Tripiana Montes 2022-08-08 00:04:11 MDT
> The debug flag named JobAccountGather was not valid:

Sorry, I've forgot:

> Slurm version: 18.08.9

I'm going to have a look at the logs/details. Please, remember that we can give support to you to understand how things work and how things get configured. But as you are running very old version, if I hit a bug, there's 2 options: it's already fixed, or needs to be fixed in future releases. No matter which one, you won't get that fix.

Please, consider upgrading to a supported version. If possible, to the latest stable release. Don't forget you need to jump upgrades from latest minor to latest minor: 18.08.9 -> 19.05.8 -> 20.02.7 -> 20.11.9 -> 21.08.8 -> 22.05.2. You see now there's some distance between versions. Don't forget to make a backup copy of the state save folder and database before starting to upgrade, and between successful upgrades, if you do so.

I'll keep you posted of my findings.

Regards.
Comment 16 Carlos Tripiana Montes 2022-08-10 01:27:53 MDT
Hi Ivan,

What is happening is the following, and it's expected behaviour:

1. When you sent a job like "JobId=274657 JobName=k32_1", the chosen partition was "Partition=kutateladze". I guess no QOS had been set to that job, so it defaulted to "something". This "something" was "normal" because "UserId=akutatel(1003)" has "DefaultQOS=normal".

2. Yes, Partition QOS is not necessarily the same as Job QOS. You ended with "JobId=274657 JobName=k32_1" having Job QOS "normal" but "Partition=kutateladze" QOS was "andrei".

3. Then, relating to accounting, there's 2 separated stages: enforcement (if enabled), and billing (charging alloc'ed resources).
  3.1. The enforcement is done as explained earlier in this bug, and "Partition=kutateladze" QOS "andrei" applies with higher precedence than Job QOS "normal", and whatever other levels involved.
  3.2. Regarding the specific limit, as QOS "andrei" has "MaxTRESPU=cpu=248", and "normal" has "MaxTRESPU=cpu=128", both have the same limit respectively. For "JobId=274657 JobName=k32_1", the "andrei" was applied: "MaxTRESPU=cpu=248".

  3.3 After the job is set to run, the billing for "MaxTRESPU=cpu" limit is increased. And the core point is: it is charged both in the Partition QOS and the Job QOS, even though you didn't specify a Job QOS. The Partition QOS is not automatically assigned as the Job QOS in this situation. The Partition QOS is intended to impose a partition restriction but not meant to be charged separately from the Job QOS, which is the real one storing the billing for the users/accounts/...
  3.4 The fact that this billing gets accounted twice in this scenario, responds to the need of keeping track of the partition limit that needs to be imposed as well. But the real QOS for that job was the Job QOS, the Partition QOS was just another -more global, limit we wanted to impose.

4. As you might guess now, once "UserId=akutatel(1003)" sent as much jobs this way to reach "MaxTRESPU=cpu=128(128)" in QOS "normal", they had "MaxTRESPU=cpu=248(128)" in QOS "andrei". So in that situation, they couldn't send more jobs to QOS "normal", which is the one "JobId=274661 JobName=d32_1", and this job didn't have any Partition QOS in action, so the enforced value came from Job QOS.

5. Probably, if "Partition=kutateladze" had more than 248 CPUs (which is the actual partition size: 4 nodes), more jobs like "JobId=274657 JobName=k32_1" could have been sent to QOS "andrei".

I hope to have been clear enough following your test case to fully cover your doubts.

Regards.
Comment 17 Carlos Tripiana Montes 2022-08-11 01:31:54 MDT
Hi Ivan,

Do you think it could be closed as info given? Do you need further assistance?

Thanks.
Comment 18 Carlos Tripiana Montes 2022-08-12 04:26:47 MDT
Closing as info given by now. Don't hesitate to reopen it if needed.