Summary: | Submitting job to multiple partitions allowd to bypass QOS limits | ||
---|---|---|---|
Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
Component: | Limits | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 20.11.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Stanford | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | 20.11 - patch for Stanford |
Description
Kilian Cavalotti
2021-04-28 13:06:37 MDT
Hi Kilian, I'm pretty sure I know what is causing this. I'm going to do some more research to see if I'm correct and see how I can fix it. I'll let you know when I have more info. (In reply to Marshall Garey from comment #2) > I'm pretty sure I know what is causing this. I'm going to do some more > research to see if I'm correct and see how I can fix it. I'll let you know > when I have more info. Excellent, glad to hear! Thanks for letting me know. Cheers, -- Kilian Hi Marshall, Just a quick ping to see if you had any updates on this? Right now, we have a number of users that are capable of exceeding their configured limits, and sometimes by a wide margin: we have a user with over 16,000 jobs in queue while the highest MaxSubmitPerUser QOS limit is 3,000. It looks like the more partitions they submit their jobs to, the more jobs they can enqueue. This is obviously causing a strong usage imbalance on our system, so any way to restore proper limit enforcement would be very much appreciated. :) Thanks! -- Kilian I've looked into how to fix this and it's really tricky. Unfortunately I don't have any more information than that, but thank you for your update and for checking in. (In reply to Marshall Garey from comment #5) > I've looked into how to fix this and it's really tricky. Unfortunately I > don't have any more information than that Argh, that's unfortunate. :( I don't remember seeing this before 20.11, and we always had users heavily relying on multi-partition submissions. This actually reminds me of https://bugs.schedmd.com/show_bug.cgi?id=3849, where we had the opposite problems of user jobs being rejected for reaching a limit they were actually not even close to. So, is this a 20.11 regression? Is there a way to escalate this bug to get it prioritized? Thanks, -- Kilian > I don't remember seeing this before 20.11, and we always had users heavily relying on multi-partition submissions. You probably got lucky. This definitely exists in all Slurm versions. The issue is that Slurm only looks at the job's QOS and the first partition's QOS when enforcing/applying limits. > Is there a way to escalate this bug to get it prioritized? I'll prioritize this bug for the next couple of weeks (hopefully in which time I can get it fixed). I wanted to get this bug fixed by 21.08 anyway. I'll let you know how it's going later in the week. Created attachment 19755 [details] 20.11 - patch for Stanford Kilian, This patch checks the job QOS and all partition QOS on job submission for limits. This patch will prevent jobs from being submitted that would violate the MaxSubmitJobs limit for any partition QOS. It will also prevent jobs from being submitted that would violate any other QOS limit if the QOS has the flag DenyOnLimit. This was actually the easier part to fix. There are still other things that I need to fix that will be harder, but I wanted to get you a patch that would at least help the specific situation you're hitting. This patch does NOT fix the issue with heterogeneous jobs, since they go through a different function to check for limits and I didn't fix that one just yet. Also, this patch will not cancel any jobs that have already been submitted that exceed the limit. Can you try it out and let me know how it works? In the meantime I will continue my work fixing all the other cases. If it works well enough for you, then I will have you continue using this local patch for now and close this bug as a duplicate of bug 7375. (I'll continue making 7375 the top priority for myself.) - Marshall Hi Marshall, (In reply to Marshall Garey from comment #8) > This patch checks the job QOS and all partition QOS on job submission for > limits. This patch will prevent jobs from being submitted that would violate > the MaxSubmitJobs limit for any partition QOS. It will also prevent jobs > from being submitted that would violate any other QOS limit if the QOS has > the flag DenyOnLimit. > > This was actually the easier part to fix. There are still other things that > I need to fix that will be harder, but I wanted to get you a patch that > would at least help the specific situation you're hitting. Thank you! I will give it a try and let you know how it goes. > This patch does NOT fix the issue with heterogeneous jobs, since they go > through a different function to check for limits and I didn't fix that one > just yet. > > Also, this patch will not cancel any jobs that have already been submitted > that exceed the limit. Both points noted, and that shouldn't be a problem, thanks! > Can you try it out and let me know how it works? In the meantime I will > continue my work fixing all the other cases. If it works well enough for > you, then I will have you continue using this local patch for now and close > this bug as a duplicate of bug 7375. (I'll continue making 7375 the top > priority for myself.) Sounds great. I'll report back shortly. Cheers, -- Kilian Since I haven't heard back, I'm assuming testing the patch went fine. I'm closing this bug as a duplicate of 7375. In other news, I've now basically done with bug 7375 - I submitted a set of patches to our review team which includes a slight variant of the patch I uploaded in this bug. Please let me know if you have any problems with this patch. *** This ticket has been marked as a duplicate of ticket 7375 *** (In reply to Marshall Garey from comment #10) > Since I haven't heard back, I'm assuming testing the patch went fine. I'm > closing this bug as a duplicate of 7375. In other news, I've now basically > done with bug 7375 - I submitted a set of patches to our review team which > includes a slight variant of the patch I uploaded in this bug. > > Please let me know if you have any problems with this patch. Thanks Marshall and sorry for the lack of updates. :\ The patch has been in production for about a week now, and it's working great: the limits are now enforced for multi-partition jobs, exactly as expected. Thanks a lot for the patch! We'll keep running with it until #7375 is merged. Cheers, -- Kilian |