Ticket 9707 - Jobs going to nodes that are not members of the selected partition
Summary: Jobs going to nodes that are not members of the selected partition
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.02.3
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-08-31 12:26 MDT by Adam
Modified: 2020-11-12 10:12 MST (History)
1 user (show)

See Also:
Site: Yale
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
bug8847_2002_v12.patch (904 bytes, patch)
2020-09-01 10:17 MDT, Felip Moll
Details | Diff
bug8847_2002_v13.patch (1.05 KB, patch)
2020-09-01 10:28 MDT, Felip Moll
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Adam 2020-08-31 12:26:09 MDT
Since we upgraded to 20.02.3 we have seen jobs submitted to one partition end up running on nodes that are not a member of the partition the jobs were submitted to. 

For example, all of these:

    User        JobID  Partition      State              Submit               Start        NodeList 
--------- ------------ ---------- ---------- ------------------- ------------------- --------------- 
    ga254 65507956            day  COMPLETED 2020-08-29T11:24:43 2020-08-29T11:24:44          c31n06 
    ga254 65507958            day  COMPLETED 2020-08-29T11:24:45 2020-08-29T11:24:46          c31n06 
    ch2229 62930363     pi_econ_io  COMPLETED 2020-08-11T08:22:02 2020-08-11T08:22:48       p08r02n40 
    ch2229 62967916     pi_econ_io  COMPLETED 2020-08-11T11:56:25 2020-08-11T11:57:04       p08r02n40 
    ch2229 63006219     pi_econ_io  COMPLETED 2020-08-11T17:24:47 2020-08-11T17:24:48       p08r02n44 
    ch2229 63292472     pi_econ_io  COMPLETED 2020-08-13T17:38:50 2020-08-13T17:39:11       p08r02n36 
     lf468 62468450     pi_econ_lp     FAILED 2020-08-06T18:43:30 2020-08-06T18:44:11       p08r02n40 
    fd338 64246551     pi_polima+  COMPLETED 2020-08-21T15:27:16 2020-08-25T14:55:09       p08r02n36 

..none of the above nodes are/were members of any of the listed partitions (eg: c31n06 is not a member of "day", etc). 

This does not happen very frequently, but it is a big problem because the owners of the nodes are unhappy with other user's jobs running on their nodes.

Thank you,
Adam
Comment 1 Jason Booth 2020-08-31 12:41:56 MDT
Hi Adam. This looks like a duplicate of an issue reported by two other sites. I will have Felip give you more details.
Comment 2 Adam 2020-08-31 15:11:17 MDT
Glad it is not just us: is there something that we can do about it?

Thanks,
Adam

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Monday, August 31, 2020 at 2:42 PM
To: "Munro, Adam" <adam.munro@yale.edu>
Subject: [Bug 9707] Jobs going to nodes that are not members of the selected partition

Jason Booth<mailto:jbooth@schedmd.com> changed bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7Cf0f3efaced924ceda87d08d84ddd8acd%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637344961197635700&sdata=LfjTsbHUf0Dl6Obca%2B96DkQ%2Frhg4YuimKqzBIbz8upU%3D&reserved=0>
What
Removed
Added
Assignee
support@schedmd.com
felip.moll@schedmd.com
Comment # 1<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707%23c1&data=02%7C01%7Cadam.munro%40yale.edu%7Cf0f3efaced924ceda87d08d84ddd8acd%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637344961197635700&sdata=%2BAJ2fMngMIdReODp1aqp3Lxg9s3bhN9XlvNUkOhHIjg%3D&reserved=0> on bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7Cf0f3efaced924ceda87d08d84ddd8acd%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637344961197645691&sdata=jCUuawZU8p6dcDaRRYPxoh6KStnSUDAli8CstQv%2BPro%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

Hi Adam. This looks like a duplicate of an issue reported by two other sites. I

will have Felip give you more details.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 3 Michael Hinton 2020-08-31 15:31:21 MDT
Hi Adam,

We do have a patch that will probably fix these partition issues. Would you be interested in trying it out? The only drawback would be that the patch partially reverts a heterogeneous job preemption enhancement in 20.02 (if you don't have many het jobs, all the better). The patch is a one-liner that I feel confident is safe to apply, even if it doesn't end up fixing the issue.

It's been a bit difficult for us to reproduce these partition issues, so we'd be grateful if you decided to try the patch and could verify that it fixes them.

Thanks,
-Michael
Comment 4 Adam 2020-09-01 07:54:17 MDT
Hi Michael,

Sure, I don’t think we mind trying this out. Just need the instructions about how to apply the patch (in addition to the patch itself).

We should have high confidence after about a week of job activity that the patch corrects the problem (unless it does not).

Thanks!
Adam

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Monday, August 31, 2020 at 5:31 PM
To: "Munro, Adam" <adam.munro@yale.edu>
Subject: [Bug 9707] Jobs going to nodes that are not members of the selected partition

Comment # 3<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707%23c3&data=02%7C01%7Cadam.munro%40yale.edu%7C316c930930ad4781668e08d84df53655%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345062861507545&sdata=cUZuEQUf4Q69W5DUtlxAtNUlaGBRQI%2B6RlNPruVpGjM%3D&reserved=0> on bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7C316c930930ad4781668e08d84df53655%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345062861517540&sdata=4xSItVZXTAllto%2F01I%2BbK7R3SeesbgQJnIBSEvGHjR4%3D&reserved=0> from Michael Hinton<mailto:hinton@schedmd.com>

Hi Adam,



We do have a patch that will probably fix these partition issues. Would you be

interested in trying it out? The only drawback would be that the patch

partially reverts a heterogeneous job preemption enhancement in 20.02 (if you

don't have many het jobs, all the better). The patch is a one-liner that I feel

confident is safe to apply, even if it doesn't end up fixing the issue.



It's been a bit difficult for us to reproduce these partition issues, so we'd

be grateful if you decided to try the patch and could verify that it fixes

them.



Thanks,

-Michael

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 5 Felip Moll 2020-09-01 10:17:57 MDT
Created attachment 15676 [details]
bug8847_2002_v12.patch

Hi,

Attached you can find the patch for 20.02.

Assuming you are using our source and not the spec files, what you need to do is just to apply this patch file from the slurm's source directory, i.e.:

]$ ls
aclocal.m4             config.h.in      COPYING       etc              META           slurm
AUTHORS                configure        cscope.files  INSTALL          NEWS           slurm.spec
autom4te.cache         configure.ac     cscope.out    LICENSE.OpenSSL  NEWS.orig      src
auxdir                 contribs         DISCLAIMER    Makefile.am      README.rst     testsuite
file.out               CONTRIBUTING.md  doc           Makefile.in      RELEASE_NOTES
]$ patch -p1 < /tmp/bug8847_2002_v12.patch 
patching file src/plugins/select/cons_common/job_test.c
]$

Then build and install as usual.
Then restart slurmctld.

> We should have high confidence after about a week of job activity that the patch corrects the problem (unless it does not).

Knowing about the effect would be great. We will keep waiting for feedback.
Also, let us know if you have more questions applying the patch.
Comment 6 Felip Moll 2020-09-01 10:28:59 MDT
Created attachment 15677 [details]
bug8847_2002_v13.patch

Sorry, attaching version 13 which is the latest one.
Comment 7 Adam 2020-09-02 11:47:13 MDT
We have applied the patch and will wait for about a week to see if the problem is still ongoing (ether way I’ll get back to you with the results).

Thank you,
Adam

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, September 1, 2020 at 12:29 PM
To: "Munro, Adam" <adam.munro@yale.edu>
Subject: [Bug 9707] Jobs going to nodes that are not members of the selected partition

Felip Moll<mailto:felip.moll@schedmd.com> changed bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432303940&sdata=9Fxj1BcoiilFAV%2FdWy6FgcIAdztnqlWel9F2AP51OXc%3D&reserved=0>
What
Removed
Added
Attachment #15676 [details] is obsolete

1
Comment # 6<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707%23c6&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432303940&sdata=%2F166kD1sDHXeEsy6WdaP2gqe9eiY72XvRX4gcu%2B7EFA%3D&reserved=0> on bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432313935&sdata=LPPAP2VpggHpwKUr5oX7kDgGNgbvrN0Q1dClNogqLeY%3D&reserved=0> from Felip Moll<mailto:felip.moll@schedmd.com>

Created attachment 15677 [details]<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D15677%26action%3Ddiff&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432313935&sdata=saHNZdmGQLXW0yoQPbp%2FQ2%2Fscu3lBLQJv4A1t26P2QQ%3D&reserved=0> [details]<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D15677%26action%3Dedit&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432313935&sdata=tB5hO2jO9ucwEqaBr2w3TVSpkBZBJjVUIgaJP4yGjgw%3D&reserved=0>

bug8847<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D8847&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432323927&sdata=Qk0WdSlK%2FwifH%2BfcjOdL%2FoLIFdppNCUcYw2VSegnLfc%3D&reserved=0>_2002_v13.patch



Sorry, attaching version 13 which is the latest one.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 8 Adam 2020-09-09 11:51:15 MDT
So far so good on this one. I’m going to wait for another week of data as further confirmation before closing the ticket on our side.

No cases of jobs running on any nodes that they should not have been able to run on:
- Time period of 6 days after the patch had been applied (Sept 3-9)
- Checked every single PI-node
- Sample size of 63,414 jobs

Best,
Adam



From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, September 1, 2020 at 12:29 PM
To: "Munro, Adam" <adam.munro@yale.edu>
Subject: [Bug 9707] Jobs going to nodes that are not members of the selected partition

Felip Moll<mailto:felip.moll@schedmd.com> changed bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432303940&sdata=9Fxj1BcoiilFAV%2FdWy6FgcIAdztnqlWel9F2AP51OXc%3D&reserved=0>
What
Removed
Added
Attachment #15676 [details] is obsolete

1
Comment # 6<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707%23c6&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432303940&sdata=%2F166kD1sDHXeEsy6WdaP2gqe9eiY72XvRX4gcu%2B7EFA%3D&reserved=0> on bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432313935&sdata=LPPAP2VpggHpwKUr5oX7kDgGNgbvrN0Q1dClNogqLeY%3D&reserved=0> from Felip Moll<mailto:felip.moll@schedmd.com>

Created attachment 15677 [details]<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D15677%26action%3Ddiff&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432313935&sdata=saHNZdmGQLXW0yoQPbp%2FQ2%2Fscu3lBLQJv4A1t26P2QQ%3D&reserved=0> [details]<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D15677%26action%3Dedit&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432313935&sdata=tB5hO2jO9ucwEqaBr2w3TVSpkBZBJjVUIgaJP4yGjgw%3D&reserved=0>

bug8847<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D8847&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432323927&sdata=Qk0WdSlK%2FwifH%2BfcjOdL%2FoLIFdppNCUcYw2VSegnLfc%3D&reserved=0>_2002_v13.patch



Sorry, attaching version 13 which is the latest one.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 9 Michael Hinton 2020-09-09 11:59:42 MDT
Ok, great. A modified version of this patch will land in 20.02.5, which should be released soon.

-Michael
Comment 10 Michael Hinton 2020-10-08 11:53:02 MDT
Hi Adam, how is this issue looking? Can we close this out?

Thanks,
-Michael
Comment 11 Adam 2020-10-08 14:46:27 MDT
The one about the jobs being sent to the wrong queue? Yes for the 20.02.3 patched version (no problems over two weeks), we _just_ upgraded to 20.02.5 on another system: I’m sure that one is fine, but we’ll double check it as well after a few weeks.

Best,
Adam

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Thursday, October 8, 2020 at 1:53 PM
To: "Munro, Adam" <adam.munro@yale.edu>
Subject: [Bug 9707] Jobs going to nodes that are not members of the selected partition

Comment # 10<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707%23c10&data=02%7C01%7Cadam.munro%40yale.edu%7Ccb17444d084f42fe9c7608d86bb30234%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637377763873197095&sdata=5I%2FuwFld%2Fr8U6oRd4LS3A%2Fu8y7GTzveJ4fm4V8icm4w%3D&reserved=0> on bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7Ccb17444d084f42fe9c7608d86bb30234%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637377763873197095&sdata=vBrXSsWOtwSSyl0vlo5W05z2EhiCm8sV8YcZ7MhbqR4%3D&reserved=0> from Michael Hinton<mailto:hinton@schedmd.com>

Hi Adam, how is this issue looking? Can we close this out?



Thanks,

-Michael

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 12 Michael Hinton 2020-11-02 12:01:53 MST
Hi Adam, what's the current status?

-Michael
Comment 13 Adam 2020-11-11 16:01:44 MST
Hi Michael,

I haven’t looked in a long time, but the last verification period covered 2 weeks and I’ve (since then) had no reason to believe that this is still an issue. If it were still a problem someone on our team probably would have noticed by now.

Thanks!
Adam

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Monday, November 2, 2020 at 2:01 PM
To: "Munro, Adam" <adam.munro@yale.edu>
Subject: [Bug 9707] Jobs going to nodes that are not members of the selected partition

Comment # 12<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707%23c12&data=04%7C01%7Cadam.munro%40yale.edu%7C1f860a0b789d44b1f27e08d87f61c4d5%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637399405176946117%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wqq7jgs6DtM%2FABhiC8YUO%2BJ4mRguM3Hz%2FRPtRgG9mZg%3D&reserved=0> on bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=04%7C01%7Cadam.munro%40yale.edu%7C1f860a0b789d44b1f27e08d87f61c4d5%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637399405176956110%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nv7RCR9hagVwOFXG17oVoLEcqUWdu%2B%2BhzVCYJij%2F4gc%3D&reserved=0> from Michael Hinton<mailto:hinton@schedmd.com>

Hi Adam, what's the current status?



-Michael

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 14 Michael Hinton 2020-11-12 10:12:39 MST
Great! Closing out, then.

Thanks,
-Michael