Ticket 9163

Summary: QOS confusion
Product: Slurm Reporter: Todd Merritt <tmerritt>
Component: ConfigurationAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 19.05.6   
Hardware: Linux   
OS: Linux   
Site: U of AZ Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Todd Merritt 2020-06-04 09:25:59 MDT
I imagine this is confusion on my part but I cannot figure out for the life of me what could be wrong with this configuration. Trying to submit through srun with a QOS that differs from the QOS on the partition I get

tmerritt@junonia:~/puma $ srun --nodes=4 --ntasks=4 --ntasks-per-node=1 --mem-per-cpu=1GB --time=01:00:00 --job-name=interactive --account=tmerritt --partition=standard --qos=user_qos_bjoyce3 --pty bash -i
srun: error: Unable to allocate resources: Invalid qos specification

Slurm log shows

[2020-06-04T08:14:54.483] error: This association 37(account='tmerritt', user='tmerritt', partition='standard') does not have access to qos user_qos_bjoyce3
[2020-06-04T08:14:54.483] _slurm_rpc_allocate_resources: Invalid qos specification 

but

root@ericidle:~ # sacctmgr -p list account tmerritt withassoc
Account|Descr|Org|Cluster|Par Name|User|Share|Priority|GrpJobs|GrpNodes|GrpCPUs|GrpMem|GrpSubmit|GrpWall|GrpCPUMins|MaxJobs|MaxNodes|MaxCPUs|MaxSubmit|MaxWall|MaxCPUMins|QOS|Def QOS|
tmerritt|sub group for parent_49|parent_49|puma|parent_49||1|||||||||500|||3000|||normal,user_qos_bjoyce3||
tmerritt|sub group for parent_49|parent_49|puma||tmerritt|1|||||||||500|||3000|||part_qos_standard||
tmerritt|sub group for parent_49|parent_49|puma||ric|1|||||||||500|||3000|||part_qos_standard||
tmerritt|sub group for parent_49|parent_49|puma||chrisreidy|1|||||||||500|||3000|||part_qos_standard||
tmerritt|sub group for parent_49|parent_49|puma||amichel|1|||||||||500|||3000|||part_qos_standard||

I created the QOS after I added all of the user/account associations, do I really need to go back in and manually touch all of them ? Even if that's the case, sacctmgr modfiy user doesn't list QOS as a settable option in help.

Thanks!
Todd
Comment 1 Ben Roberts 2020-06-04 13:55:58 MDT
Hi Todd,

The default behavior is for users in an account to inherit the QOs that is defined for the account.  In your case it looks like the QOS has been explicitly set for your users in that account.  When the QOS is explicitly set for a user then it will no longer inherit the QOS information from the parent account.  You should be able to unset the QOS information for the users, at which point it will list the same QOS' that the account itself shows.  Here's an example from my environment.  You can see that user3 has two extra QOS' defined:
$ sacctmgr show assoc tree account=test_lab format=cluster,account,user,qos,defaultqos
   Cluster              Account       User                  QOS   Def QOS 
---------- -------------------- ---------- -------------------- --------- 
   winston test_lab                                     testqos   testqos 
   winston  test_lab                 user1              testqos   testqos 
   winston  test_lab                 user2              testqos   testqos 
   winston  test_lab                 user3    red,test2,testqos   testqos 



I use this command to unset the defined QOS' so that it inherits them again from the parent:
$ sacctmgr modify user user3 account=test_lab set qos=''
 Modified user associations...
  C = winston    A = test_lab             U = user3    
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y


Now user3 has the same QOS as the rest of the account/users:
$ sacctmgr show assoc tree account=test_lab format=cluster,account,user,qos,defaultqos
   Cluster              Account       User                  QOS   Def QOS 
---------- -------------------- ---------- -------------------- --------- 
   winston test_lab                                     testqos   testqos 
   winston  test_lab                 user1              testqos   testqos 
   winston  test_lab                 user2              testqos   testqos 
   winston  test_lab                 user3              testqos   testqos 





To unset it for your user you would run this:
sacctmgr modify user tmerritt account=tmerritt set qos=''


Unfortunately, in this case you would need to unset the QOS for each user association that had the QOS defined manually.  About the sacctmgr command, the reason you don't see QOS as a modifiable option for users is because what you're actually modifying is an 'association'.  An association is an entity that consists of a specific cluster, account and user combination, potentially with a partition included in the mix.  

Let me know if you have any problems or questions about this.

Thanks,
Ben
Comment 2 Todd Merritt 2020-06-04 15:05:09 MDT
Thanks! That cleared it up and it's working as expected after clearing the QOS from the user.