Hello SchedMD! We have just found a quirk that seems to allow users to bypass partition QOS limits. We have two partitions, each with their own partition QOS: -- 8< ----------------------------------------------------------------------------------- # for p in pritch owners; do scontrol show partition $p | grep -E "Partition| QoS="; done PartitionName=pritch AllocNodes=ALL Default=NO QoS=owner PartitionName=owners AllocNodes=ALL Default=NO QoS=owners -- 8< ----------------------------------------------------------------------------------- The partition QOS are as follows: -- 8< ----------------------------------------------------------------------------------- # sacctmgr list qos format=name,MaxSubmitPU,MaxSubmitPA owner,owners Name MaxSubmitPU MaxSubmitPA ---------- ----------- ----------- owner 3000 5000 owners 3000 5000 -- 8< ----------------------------------------------------------------------------------- We found a user who was able to submit more than 3,000 jobs to the "owners" partition, by submitting jobs to both partitions at once. -- 8< ----------------------------------------------------------------------------------- # squeue -u daphna -p owners -h | wc -l 4252 # squeue -u daphna -p pritch -h | wc -l 1125 -- 8< ----------------------------------------------------------------------------------- When submitting jobs to "owners" only, submission is correctly rejected with QOSMaxSubmitJobPerUserLimit: -- 8< ----------------------------------------------------------------------------------- daphna $ sbatch -p owners --wrap="sleep 1" sbatch: error: QOSMaxSubmitJobPerUserLimit sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) -- 8< ----------------------------------------------------------------------------------- But when submitting to both partitions, the job is accepted (probably because the limit in the "pritch" partition is not reached yet): -- 8< ----------------------------------------------------------------------------------- daphna $ sbatch -p pritch,owners --wrap="sleep 1" Submitted batch job 23181981 -- 8< ----------------------------------------------------------------------------------- And ultimately, the job is allowed to run in the "owners" partition and exceed the MaxSubmitJobPerUser QOS limit. Here are excerpts of "scontrol show assoc" for that user: -- 8< ----------------------------------------------------------------------------------- UserName=daphna(363649) DefAccount=pritch DefWckey= AdminLevel=None -- ClusterName=sherlock Account=pritch UserName=daphna(363649) Partition= Priority=0 ID=28144 SharesRaw/Norm/Level/Factor=100/0.05/2100/0.01 UsageRaw/Norm/Efctv=978238931.63/0.01/0.47 ParentAccount= Lft=14656 DefAssoc=Yes GrpJobs=N(5319) GrpJobsAccrue=N(0) GrpSubmitJobs=N(5380) GrpWall=N(4623923.57) GrpTRES=cpu=N(5319),mem=N(54466560),energy=N(0),node=N(666),billing=N(15957),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(0),ic/ofed=N(0) GrpTRESMins=cpu=N(4679194),mem=N(47616939166),energy=N(0),node=N(4623923),billing=N(13996598),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(0),ic/ofed=N(0) GrpTRESRunMins=cpu=N(12587703),mem=N(128898085546),energy=N(0),node=N(12587703),billing=N(37763111),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(0),ic/ofed=N(0) MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESMinsPJ= MinPrioThresh= -- QOS=normal(3) -- 363649 MaxJobsPU=N(5319) MaxJobsAccruePU=2(0) MaxSubmitJobsPU=1000(5380) MaxTRESPU=cpu=512(5319),mem=N(54466560),energy=N(0),node=N(666),billing=N(15957),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(0),ic/ofed=N(0) -- QOS=owner(53) -- 363649 MaxJobsPU=N(1116) MaxJobsAccruePU=2(0) MaxSubmitJobsPU=3000(1177) MaxTRESPU=cpu=99999(1116),mem=N(11427840),energy=N(0),node=N(57),billing=N(3348),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(0),ic/ofed=N(0) -- QOS=owners(58) -- 363649 MaxJobsPU=N(4203) MaxJobsAccruePU=5(20) MaxSubmitJobsPU=3000(4264) MaxTRESPU=cpu=8192(4203),mem=N(43038720),energy=N(0),node=N(609),billing=N(12609),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(0),ic/ofed=N(0) Shouldn't the multi-partition jobs be prevented from running in a partition if QOSMaxSubmitJobPerUserLimit is reached already for that user? Because right now, that limit is not enforced, and that user has been able to submit more jobs than the partition QOS's MaxSubmitJobPerUser limit. Thanks! -- Kilian
Hi Kilian, I'm pretty sure I know what is causing this. I'm going to do some more research to see if I'm correct and see how I can fix it. I'll let you know when I have more info.
(In reply to Marshall Garey from comment #2) > I'm pretty sure I know what is causing this. I'm going to do some more > research to see if I'm correct and see how I can fix it. I'll let you know > when I have more info. Excellent, glad to hear! Thanks for letting me know. Cheers, -- Kilian
Hi Marshall, Just a quick ping to see if you had any updates on this? Right now, we have a number of users that are capable of exceeding their configured limits, and sometimes by a wide margin: we have a user with over 16,000 jobs in queue while the highest MaxSubmitPerUser QOS limit is 3,000. It looks like the more partitions they submit their jobs to, the more jobs they can enqueue. This is obviously causing a strong usage imbalance on our system, so any way to restore proper limit enforcement would be very much appreciated. :) Thanks! -- Kilian
I've looked into how to fix this and it's really tricky. Unfortunately I don't have any more information than that, but thank you for your update and for checking in.
(In reply to Marshall Garey from comment #5) > I've looked into how to fix this and it's really tricky. Unfortunately I > don't have any more information than that Argh, that's unfortunate. :( I don't remember seeing this before 20.11, and we always had users heavily relying on multi-partition submissions. This actually reminds me of https://bugs.schedmd.com/show_bug.cgi?id=3849, where we had the opposite problems of user jobs being rejected for reaching a limit they were actually not even close to. So, is this a 20.11 regression? Is there a way to escalate this bug to get it prioritized? Thanks, -- Kilian
> I don't remember seeing this before 20.11, and we always had users heavily relying on multi-partition submissions. You probably got lucky. This definitely exists in all Slurm versions. The issue is that Slurm only looks at the job's QOS and the first partition's QOS when enforcing/applying limits. > Is there a way to escalate this bug to get it prioritized? I'll prioritize this bug for the next couple of weeks (hopefully in which time I can get it fixed). I wanted to get this bug fixed by 21.08 anyway. I'll let you know how it's going later in the week.
Created attachment 19755 [details] 20.11 - patch for Stanford Kilian, This patch checks the job QOS and all partition QOS on job submission for limits. This patch will prevent jobs from being submitted that would violate the MaxSubmitJobs limit for any partition QOS. It will also prevent jobs from being submitted that would violate any other QOS limit if the QOS has the flag DenyOnLimit. This was actually the easier part to fix. There are still other things that I need to fix that will be harder, but I wanted to get you a patch that would at least help the specific situation you're hitting. This patch does NOT fix the issue with heterogeneous jobs, since they go through a different function to check for limits and I didn't fix that one just yet. Also, this patch will not cancel any jobs that have already been submitted that exceed the limit. Can you try it out and let me know how it works? In the meantime I will continue my work fixing all the other cases. If it works well enough for you, then I will have you continue using this local patch for now and close this bug as a duplicate of bug 7375. (I'll continue making 7375 the top priority for myself.) - Marshall
Hi Marshall, (In reply to Marshall Garey from comment #8) > This patch checks the job QOS and all partition QOS on job submission for > limits. This patch will prevent jobs from being submitted that would violate > the MaxSubmitJobs limit for any partition QOS. It will also prevent jobs > from being submitted that would violate any other QOS limit if the QOS has > the flag DenyOnLimit. > > This was actually the easier part to fix. There are still other things that > I need to fix that will be harder, but I wanted to get you a patch that > would at least help the specific situation you're hitting. Thank you! I will give it a try and let you know how it goes. > This patch does NOT fix the issue with heterogeneous jobs, since they go > through a different function to check for limits and I didn't fix that one > just yet. > > Also, this patch will not cancel any jobs that have already been submitted > that exceed the limit. Both points noted, and that shouldn't be a problem, thanks! > Can you try it out and let me know how it works? In the meantime I will > continue my work fixing all the other cases. If it works well enough for > you, then I will have you continue using this local patch for now and close > this bug as a duplicate of bug 7375. (I'll continue making 7375 the top > priority for myself.) Sounds great. I'll report back shortly. Cheers, -- Kilian
Since I haven't heard back, I'm assuming testing the patch went fine. I'm closing this bug as a duplicate of 7375. In other news, I've now basically done with bug 7375 - I submitted a set of patches to our review team which includes a slight variant of the patch I uploaded in this bug. Please let me know if you have any problems with this patch. *** This ticket has been marked as a duplicate of ticket 7375 ***
(In reply to Marshall Garey from comment #10) > Since I haven't heard back, I'm assuming testing the patch went fine. I'm > closing this bug as a duplicate of 7375. In other news, I've now basically > done with bug 7375 - I submitted a set of patches to our review team which > includes a slight variant of the patch I uploaded in this bug. > > Please let me know if you have any problems with this patch. Thanks Marshall and sorry for the lack of updates. :\ The patch has been in production for about a week now, and it's working great: the limits are now enforced for multi-partition jobs, exactly as expected. Thanks a lot for the patch! We'll keep running with it until #7375 is merged. Cheers, -- Kilian