There are some problems in accounting when a job is submitted to multiple partitions, where one or more partitions have a QOS. In the code we call acct_policy_set_qos_order() with two QOS but we're assuming that there are only up to two QOS - the job's QOS and the partition's (singular) QOS. But in reality there can be as many QOS as partitions plus one for the job. Bug 6659 is one example of this which reported problems with accrue_cnt. Below is another example of this problem in acct_policy_validate(). Rather than fix each problem individually, we're going to address all the problems at once. I'm making this bug public and marking bug 6659 as a duplicate of this bug. I'm also marking this ticket as an enhancement. We're targeting these fixes/changes for 20.11. Example of this problem in acct_policy_validate(): PartitionName=debug State=UP Nodes=snowflake[0-5] qos=test PartitionName=debug2 State=UP Nodes=snowflake[6-10] qos=test2 sacctmgr mod qos test2 set maxsubmit=1 one terminal... salloc -pdebug2 salloc: Granted job allocation 93267 other terminal... salloc -pdebug,debug2 -wsnowflake7 salloc: Granted job allocation 93266 (Clearly this is wrong as we are clearly running in debug2 on both jobs... squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 93267 debug2 bash da R 1:37 1 snowflake6 93266 debug2 bash da R 4:59 1 snowflake7 ) Here we now look at what we see in the slurmctld... scontrol show assoc flag=qos QOS=normal(1) ... User Limits 7558 MaxJobsPU=N(2) MaxJobsAccruePU=N(0) MaxSubmitJobsPU=N(2) ... QOS=test(958) ... User Limits 7558 MaxJobsPU=N(0) MaxJobsAccruePU=N(0) MaxSubmitJobsPU=N(0) ... QOS=test2(959) ... User Limits 7558 MaxJobsPU=N(2) MaxJobsAccruePU=N(0) MaxSubmitJobsPU=1(1) So, where is that second job accounted for in test2? No idea. We don't account for it in 'test' either.
*** Ticket 6659 has been marked as a duplicate of this ticket. ***
This is affecting Ohio Supercomputer Center on version 20.02.4. If this could get patched in the 20.02 series that would really help us out.
*** Ticket 9788 has been marked as a duplicate of this ticket. ***
I mentioned this on a duplicate bug but I'll mention it here. Right now I'm targeting a fix for 20.11 (hopefully I'll get it done before 20.11 is released), but I could probably provide a patch for testing on 20.02 when it's ready.
*** Ticket 10745 has been marked as a duplicate of this ticket. ***
*** Ticket 11475 has been marked as a duplicate of this ticket. ***
To all the sites looking at this bug - We pushed two commits to fix this issue. 47e46a45e6 Do not use accrue limits for partition QOS There were issues with accrue limits for partition QOS when submitting jobs to multiple partitions. There wasn't a good way to fix this for multiple partitions, so we made it so accrue limits don't work on partition QOS at all. Accrue limits do still work on job QOS. This is a feature change. 9125409e12 Fix acct_policy_validate() to consider all partition QOS This ensures that we loop through all partitions when validating a job at job submission time. These have been pushed to master and will be part of the 21.08 release.