11475 – Submitting job to multiple partitions allowd to bypass QOS limits

Ticket 11475 - Submitting job to multiple partitions allowd to bypass QOS limits

Summary: Submitting job to multiple partitions allowd to bypass QOS limits

Status:	RESOLVED DUPLICATE of ticket 7375

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Limits (show other tickets)
Version:	20.11.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-04-28 13:06 MDT by Kilian Cavalotti
Modified:	2021-06-08 15:44 MDT (History)
CC List:	0 users

See Also:
Site:	Stanford
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
20.11 - patch for Stanford (7.50 KB, patch) 2021-06-02 11:12 MDT, Marshall Garey	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kilian Cavalotti 2021-04-28 13:06:37 MDT

Hello SchedMD!

We have just found a quirk that seems to allow users to bypass partition QOS limits.


We have two partitions, each with their own partition QOS:
-- 8< -----------------------------------------------------------------------------------
# for p in pritch owners; do scontrol show partition $p | grep -E "Partition| QoS="; done
PartitionName=pritch
   AllocNodes=ALL Default=NO QoS=owner
PartitionName=owners
   AllocNodes=ALL Default=NO QoS=owners
-- 8< -----------------------------------------------------------------------------------

The partition QOS are as follows:
-- 8< -----------------------------------------------------------------------------------
# sacctmgr list qos format=name,MaxSubmitPU,MaxSubmitPA owner,owners
      Name MaxSubmitPU MaxSubmitPA
---------- ----------- -----------
     owner        3000        5000
    owners        3000        5000
-- 8< -----------------------------------------------------------------------------------

We found a user who was able to submit more than 3,000 jobs to the "owners" partition, by submitting jobs to both partitions at once.
-- 8< -----------------------------------------------------------------------------------
# squeue -u daphna -p owners -h | wc -l
4252
# squeue -u daphna -p pritch -h | wc -l
1125
-- 8< -----------------------------------------------------------------------------------

When submitting jobs to "owners" only, submission is correctly rejected with QOSMaxSubmitJobPerUserLimit:
-- 8< -----------------------------------------------------------------------------------
daphna $ sbatch -p owners --wrap="sleep 1"
sbatch: error: QOSMaxSubmitJobPerUserLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
-- 8< -----------------------------------------------------------------------------------

But when submitting to both partitions, the job is accepted (probably because the limit in the "pritch" partition is not reached yet):
-- 8< -----------------------------------------------------------------------------------
daphna $ sbatch -p pritch,owners --wrap="sleep 1"
Submitted batch job 23181981
-- 8< -----------------------------------------------------------------------------------
And ultimately, the job is allowed to run in the "owners" partition and exceed the MaxSubmitJobPerUser QOS limit.


Here are excerpts of "scontrol show assoc" for that user:
-- 8< -----------------------------------------------------------------------------------
UserName=daphna(363649) DefAccount=pritch DefWckey= AdminLevel=None
--
ClusterName=sherlock Account=pritch UserName=daphna(363649) Partition= Priority=0 ID=28144
    SharesRaw/Norm/Level/Factor=100/0.05/2100/0.01
    UsageRaw/Norm/Efctv=978238931.63/0.01/0.47
    ParentAccount= Lft=14656 DefAssoc=Yes
    GrpJobs=N(5319) GrpJobsAccrue=N(0)
    GrpSubmitJobs=N(5380) GrpWall=N(4623923.57)
    GrpTRES=cpu=N(5319),mem=N(54466560),energy=N(0),node=N(666),billing=N(15957),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(0),ic/ofed=N(0)
    GrpTRESMins=cpu=N(4679194),mem=N(47616939166),energy=N(0),node=N(4623923),billing=N(13996598),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(0),ic/ofed=N(0)
    GrpTRESRunMins=cpu=N(12587703),mem=N(128898085546),energy=N(0),node=N(12587703),billing=N(37763111),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(0),ic/ofed=N(0)
    MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh=
--
QOS=normal(3)
--
      363649
        MaxJobsPU=N(5319) MaxJobsAccruePU=2(0) MaxSubmitJobsPU=1000(5380)
        MaxTRESPU=cpu=512(5319),mem=N(54466560),energy=N(0),node=N(666),billing=N(15957),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(0),ic/ofed=N(0)
--
QOS=owner(53)
--
      363649
        MaxJobsPU=N(1116) MaxJobsAccruePU=2(0) MaxSubmitJobsPU=3000(1177)
        MaxTRESPU=cpu=99999(1116),mem=N(11427840),energy=N(0),node=N(57),billing=N(3348),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(0),ic/ofed=N(0)
--
QOS=owners(58)
--
      363649
        MaxJobsPU=N(4203) MaxJobsAccruePU=5(20) MaxSubmitJobsPU=3000(4264)
        MaxTRESPU=cpu=8192(4203),mem=N(43038720),energy=N(0),node=N(609),billing=N(12609),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(0),ic/ofed=N(0)


Shouldn't the multi-partition jobs be prevented from running in a partition if QOSMaxSubmitJobPerUserLimit is reached already for that user? Because right now, that limit is not enforced, and that user has been able to submit more jobs than the partition QOS's MaxSubmitJobPerUser limit.

Thanks!
--
Kilian

Comment 2 Marshall Garey 2021-04-29 14:54:34 MDT

Hi Kilian,

I'm pretty sure I know what is causing this. I'm going to do some more research to see if I'm correct and see how I can fix it. I'll let you know when I have more info.

Comment 3 Kilian Cavalotti 2021-04-29 14:56:06 MDT

(In reply to Marshall Garey from comment #2)
> I'm pretty sure I know what is causing this. I'm going to do some more
> research to see if I'm correct and see how I can fix it. I'll let you know
> when I have more info.

Excellent, glad to hear! Thanks for letting me know.

Cheers,
--
Kilian

Comment 4 Kilian Cavalotti 2021-05-24 12:41:27 MDT

Hi Marshall, 

Just a quick ping to see if you had any updates on this?

Right now, we have a number of users that are capable of exceeding their configured limits, and sometimes by a wide margin: we have a user with over 16,000 jobs in queue while the highest MaxSubmitPerUser QOS limit is 3,000. It looks like the more partitions they submit their jobs to, the more jobs they can enqueue.

This is obviously causing a strong usage imbalance on our system, so any way to restore proper limit enforcement would be very much appreciated. :)

Thanks!
--
Kilian

Comment 5 Marshall Garey 2021-05-24 12:50:34 MDT

I've looked into how to fix this and it's really tricky. Unfortunately I don't have any more information than that, but thank you for your update and for checking in.

Comment 6 Kilian Cavalotti 2021-05-24 13:03:31 MDT

(In reply to Marshall Garey from comment #5)
> I've looked into how to fix this and it's really tricky. Unfortunately I
> don't have any more information than that

Argh, that's unfortunate. :(

I don't remember seeing this before 20.11, and we always had users heavily relying on multi-partition submissions.

This actually reminds me of https://bugs.schedmd.com/show_bug.cgi?id=3849, where we had the opposite problems of user jobs being rejected for reaching a limit they were actually not even close to.

So, is this a 20.11 regression?
Is there a way to escalate this bug to get it prioritized?

Thanks,
--
Kilian

Comment 7 Marshall Garey 2021-06-01 08:43:32 MDT

> I don't remember seeing this before 20.11, and we always had users heavily relying on multi-partition submissions.

You probably got lucky. This definitely exists in all Slurm versions. The issue is that Slurm only looks at the job's QOS and the first partition's QOS when enforcing/applying limits.

> Is there a way to escalate this bug to get it prioritized?

I'll prioritize this bug for the next couple of weeks (hopefully in which time I can get it fixed). I wanted to get this bug fixed by 21.08 anyway. I'll let you know how it's going later in the week.

Comment 8 Marshall Garey 2021-06-02 11:12:12 MDT

Created attachment 19755 [details]
20.11 - patch for Stanford

Kilian,

This patch checks the job QOS and all partition QOS on job submission for limits. This patch will prevent jobs from being submitted that would violate the MaxSubmitJobs limit for any partition QOS. It will also prevent jobs from being submitted that would violate any other QOS limit if the QOS has the flag DenyOnLimit.

This was actually the easier part to fix. There are still other things that I need to fix that will be harder, but I wanted to get you a patch that would at least help the specific situation you're hitting.

This patch does NOT fix the issue with heterogeneous jobs, since they go through a different function to check for limits and I didn't fix that one just yet.

Also, this patch will not cancel any jobs that have already been submitted that exceed the limit.

Can you try it out and let me know how it works? In the meantime I will continue my work fixing all the other cases. If it works well enough for you, then I will have you continue using this local patch for now and close this bug as a duplicate of bug 7375. (I'll continue making 7375 the top priority for myself.)

- Marshall

Comment 9 Kilian Cavalotti 2021-06-02 13:01:54 MDT

Hi Marshall, 

(In reply to Marshall Garey from comment #8)
> This patch checks the job QOS and all partition QOS on job submission for
> limits. This patch will prevent jobs from being submitted that would violate
> the MaxSubmitJobs limit for any partition QOS. It will also prevent jobs
> from being submitted that would violate any other QOS limit if the QOS has
> the flag DenyOnLimit.
> 
> This was actually the easier part to fix. There are still other things that
> I need to fix that will be harder, but I wanted to get you a patch that
> would at least help the specific situation you're hitting.

Thank you! I will give it a try and let you know how it goes.

> This patch does NOT fix the issue with heterogeneous jobs, since they go
> through a different function to check for limits and I didn't fix that one
> just yet.
> 
> Also, this patch will not cancel any jobs that have already been submitted
> that exceed the limit.

Both points noted, and that shouldn't be a problem, thanks!

> Can you try it out and let me know how it works? In the meantime I will
> continue my work fixing all the other cases. If it works well enough for
> you, then I will have you continue using this local patch for now and close
> this bug as a duplicate of bug 7375. (I'll continue making 7375 the top
> priority for myself.)

Sounds great. I'll report back shortly.

Cheers,
--
Kilian

Comment 10 Marshall Garey 2021-06-08 15:21:40 MDT

Since I haven't heard back, I'm assuming testing the patch went fine. I'm closing this bug as a duplicate of 7375. In other news, I've now basically done with bug 7375 - I submitted a set of patches to our review team which includes a slight variant of the patch I uploaded in this bug.

Please let me know if you have any problems with this patch.

*** This ticket has been marked as a duplicate of ticket 7375 ***

Comment 11 Kilian Cavalotti 2021-06-08 15:44:32 MDT

(In reply to Marshall Garey from comment #10)
> Since I haven't heard back, I'm assuming testing the patch went fine. I'm
> closing this bug as a duplicate of 7375. In other news, I've now basically
> done with bug 7375 - I submitted a set of patches to our review team which
> includes a slight variant of the patch I uploaded in this bug.
> 
> Please let me know if you have any problems with this patch.

Thanks Marshall and sorry for the lack of updates. :\

The patch has been in production for about a week now, and it's working great: the limits are now enforced for multi-partition jobs, exactly as expected.

Thanks a lot for the patch! We'll keep running with it until #7375 is merged. 

Cheers,
--
Kilian