Since we upgraded to 20.02.3 we have seen jobs submitted to one partition end up running on nodes that are not a member of the partition the jobs were submitted to. For example, all of these: User JobID Partition State Submit Start NodeList --------- ------------ ---------- ---------- ------------------- ------------------- --------------- ga254 65507956 day COMPLETED 2020-08-29T11:24:43 2020-08-29T11:24:44 c31n06 ga254 65507958 day COMPLETED 2020-08-29T11:24:45 2020-08-29T11:24:46 c31n06 ch2229 62930363 pi_econ_io COMPLETED 2020-08-11T08:22:02 2020-08-11T08:22:48 p08r02n40 ch2229 62967916 pi_econ_io COMPLETED 2020-08-11T11:56:25 2020-08-11T11:57:04 p08r02n40 ch2229 63006219 pi_econ_io COMPLETED 2020-08-11T17:24:47 2020-08-11T17:24:48 p08r02n44 ch2229 63292472 pi_econ_io COMPLETED 2020-08-13T17:38:50 2020-08-13T17:39:11 p08r02n36 lf468 62468450 pi_econ_lp FAILED 2020-08-06T18:43:30 2020-08-06T18:44:11 p08r02n40 fd338 64246551 pi_polima+ COMPLETED 2020-08-21T15:27:16 2020-08-25T14:55:09 p08r02n36 ..none of the above nodes are/were members of any of the listed partitions (eg: c31n06 is not a member of "day", etc). This does not happen very frequently, but it is a big problem because the owners of the nodes are unhappy with other user's jobs running on their nodes. Thank you, Adam
Hi Adam. This looks like a duplicate of an issue reported by two other sites. I will have Felip give you more details.
Glad it is not just us: is there something that we can do about it? Thanks, Adam From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Monday, August 31, 2020 at 2:42 PM To: "Munro, Adam" <adam.munro@yale.edu> Subject: [Bug 9707] Jobs going to nodes that are not members of the selected partition Jason Booth<mailto:jbooth@schedmd.com> changed bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7Cf0f3efaced924ceda87d08d84ddd8acd%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637344961197635700&sdata=LfjTsbHUf0Dl6Obca%2B96DkQ%2Frhg4YuimKqzBIbz8upU%3D&reserved=0> What Removed Added Assignee support@schedmd.com felip.moll@schedmd.com Comment # 1<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707%23c1&data=02%7C01%7Cadam.munro%40yale.edu%7Cf0f3efaced924ceda87d08d84ddd8acd%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637344961197635700&sdata=%2BAJ2fMngMIdReODp1aqp3Lxg9s3bhN9XlvNUkOhHIjg%3D&reserved=0> on bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7Cf0f3efaced924ceda87d08d84ddd8acd%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637344961197645691&sdata=jCUuawZU8p6dcDaRRYPxoh6KStnSUDAli8CstQv%2BPro%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> Hi Adam. This looks like a duplicate of an issue reported by two other sites. I will have Felip give you more details. ________________________________ You are receiving this mail because: * You reported the bug.
Hi Adam, We do have a patch that will probably fix these partition issues. Would you be interested in trying it out? The only drawback would be that the patch partially reverts a heterogeneous job preemption enhancement in 20.02 (if you don't have many het jobs, all the better). The patch is a one-liner that I feel confident is safe to apply, even if it doesn't end up fixing the issue. It's been a bit difficult for us to reproduce these partition issues, so we'd be grateful if you decided to try the patch and could verify that it fixes them. Thanks, -Michael
Hi Michael, Sure, I don’t think we mind trying this out. Just need the instructions about how to apply the patch (in addition to the patch itself). We should have high confidence after about a week of job activity that the patch corrects the problem (unless it does not). Thanks! Adam From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Monday, August 31, 2020 at 5:31 PM To: "Munro, Adam" <adam.munro@yale.edu> Subject: [Bug 9707] Jobs going to nodes that are not members of the selected partition Comment # 3<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707%23c3&data=02%7C01%7Cadam.munro%40yale.edu%7C316c930930ad4781668e08d84df53655%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345062861507545&sdata=cUZuEQUf4Q69W5DUtlxAtNUlaGBRQI%2B6RlNPruVpGjM%3D&reserved=0> on bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7C316c930930ad4781668e08d84df53655%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345062861517540&sdata=4xSItVZXTAllto%2F01I%2BbK7R3SeesbgQJnIBSEvGHjR4%3D&reserved=0> from Michael Hinton<mailto:hinton@schedmd.com> Hi Adam, We do have a patch that will probably fix these partition issues. Would you be interested in trying it out? The only drawback would be that the patch partially reverts a heterogeneous job preemption enhancement in 20.02 (if you don't have many het jobs, all the better). The patch is a one-liner that I feel confident is safe to apply, even if it doesn't end up fixing the issue. It's been a bit difficult for us to reproduce these partition issues, so we'd be grateful if you decided to try the patch and could verify that it fixes them. Thanks, -Michael ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 15676 [details] bug8847_2002_v12.patch Hi, Attached you can find the patch for 20.02. Assuming you are using our source and not the spec files, what you need to do is just to apply this patch file from the slurm's source directory, i.e.: ]$ ls aclocal.m4 config.h.in COPYING etc META slurm AUTHORS configure cscope.files INSTALL NEWS slurm.spec autom4te.cache configure.ac cscope.out LICENSE.OpenSSL NEWS.orig src auxdir contribs DISCLAIMER Makefile.am README.rst testsuite file.out CONTRIBUTING.md doc Makefile.in RELEASE_NOTES ]$ patch -p1 < /tmp/bug8847_2002_v12.patch patching file src/plugins/select/cons_common/job_test.c ]$ Then build and install as usual. Then restart slurmctld. > We should have high confidence after about a week of job activity that the patch corrects the problem (unless it does not). Knowing about the effect would be great. We will keep waiting for feedback. Also, let us know if you have more questions applying the patch.
Created attachment 15677 [details] bug8847_2002_v13.patch Sorry, attaching version 13 which is the latest one.
We have applied the patch and will wait for about a week to see if the problem is still ongoing (ether way I’ll get back to you with the results). Thank you, Adam From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, September 1, 2020 at 12:29 PM To: "Munro, Adam" <adam.munro@yale.edu> Subject: [Bug 9707] Jobs going to nodes that are not members of the selected partition Felip Moll<mailto:felip.moll@schedmd.com> changed bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432303940&sdata=9Fxj1BcoiilFAV%2FdWy6FgcIAdztnqlWel9F2AP51OXc%3D&reserved=0> What Removed Added Attachment #15676 [details] is obsolete 1 Comment # 6<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707%23c6&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432303940&sdata=%2F166kD1sDHXeEsy6WdaP2gqe9eiY72XvRX4gcu%2B7EFA%3D&reserved=0> on bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432313935&sdata=LPPAP2VpggHpwKUr5oX7kDgGNgbvrN0Q1dClNogqLeY%3D&reserved=0> from Felip Moll<mailto:felip.moll@schedmd.com> Created attachment 15677 [details]<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D15677%26action%3Ddiff&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432313935&sdata=saHNZdmGQLXW0yoQPbp%2FQ2%2Fscu3lBLQJv4A1t26P2QQ%3D&reserved=0> [details]<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D15677%26action%3Dedit&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432313935&sdata=tB5hO2jO9ucwEqaBr2w3TVSpkBZBJjVUIgaJP4yGjgw%3D&reserved=0> bug8847<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D8847&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432323927&sdata=Qk0WdSlK%2FwifH%2BfcjOdL%2FoLIFdppNCUcYw2VSegnLfc%3D&reserved=0>_2002_v13.patch Sorry, attaching version 13 which is the latest one. ________________________________ You are receiving this mail because: * You reported the bug.
So far so good on this one. I’m going to wait for another week of data as further confirmation before closing the ticket on our side. No cases of jobs running on any nodes that they should not have been able to run on: - Time period of 6 days after the patch had been applied (Sept 3-9) - Checked every single PI-node - Sample size of 63,414 jobs Best, Adam From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, September 1, 2020 at 12:29 PM To: "Munro, Adam" <adam.munro@yale.edu> Subject: [Bug 9707] Jobs going to nodes that are not members of the selected partition Felip Moll<mailto:felip.moll@schedmd.com> changed bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432303940&sdata=9Fxj1BcoiilFAV%2FdWy6FgcIAdztnqlWel9F2AP51OXc%3D&reserved=0> What Removed Added Attachment #15676 [details] is obsolete 1 Comment # 6<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707%23c6&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432303940&sdata=%2F166kD1sDHXeEsy6WdaP2gqe9eiY72XvRX4gcu%2B7EFA%3D&reserved=0> on bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432313935&sdata=LPPAP2VpggHpwKUr5oX7kDgGNgbvrN0Q1dClNogqLeY%3D&reserved=0> from Felip Moll<mailto:felip.moll@schedmd.com> Created attachment 15677 [details]<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D15677%26action%3Ddiff&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432313935&sdata=saHNZdmGQLXW0yoQPbp%2FQ2%2Fscu3lBLQJv4A1t26P2QQ%3D&reserved=0> [details]<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D15677%26action%3Dedit&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432313935&sdata=tB5hO2jO9ucwEqaBr2w3TVSpkBZBJjVUIgaJP4yGjgw%3D&reserved=0> bug8847<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D8847&data=02%7C01%7Cadam.munro%40yale.edu%7C9e1b4f9f1ab94c15ac2708d84e9422bf%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637345745432323927&sdata=Qk0WdSlK%2FwifH%2BfcjOdL%2FoLIFdppNCUcYw2VSegnLfc%3D&reserved=0>_2002_v13.patch Sorry, attaching version 13 which is the latest one. ________________________________ You are receiving this mail because: * You reported the bug.
Ok, great. A modified version of this patch will land in 20.02.5, which should be released soon. -Michael
Hi Adam, how is this issue looking? Can we close this out? Thanks, -Michael
The one about the jobs being sent to the wrong queue? Yes for the 20.02.3 patched version (no problems over two weeks), we _just_ upgraded to 20.02.5 on another system: I’m sure that one is fine, but we’ll double check it as well after a few weeks. Best, Adam From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Thursday, October 8, 2020 at 1:53 PM To: "Munro, Adam" <adam.munro@yale.edu> Subject: [Bug 9707] Jobs going to nodes that are not members of the selected partition Comment # 10<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707%23c10&data=02%7C01%7Cadam.munro%40yale.edu%7Ccb17444d084f42fe9c7608d86bb30234%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637377763873197095&sdata=5I%2FuwFld%2Fr8U6oRd4LS3A%2Fu8y7GTzveJ4fm4V8icm4w%3D&reserved=0> on bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=02%7C01%7Cadam.munro%40yale.edu%7Ccb17444d084f42fe9c7608d86bb30234%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637377763873197095&sdata=vBrXSsWOtwSSyl0vlo5W05z2EhiCm8sV8YcZ7MhbqR4%3D&reserved=0> from Michael Hinton<mailto:hinton@schedmd.com> Hi Adam, how is this issue looking? Can we close this out? Thanks, -Michael ________________________________ You are receiving this mail because: * You reported the bug.
Hi Adam, what's the current status? -Michael
Hi Michael, I haven’t looked in a long time, but the last verification period covered 2 weeks and I’ve (since then) had no reason to believe that this is still an issue. If it were still a problem someone on our team probably would have noticed by now. Thanks! Adam From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Monday, November 2, 2020 at 2:01 PM To: "Munro, Adam" <adam.munro@yale.edu> Subject: [Bug 9707] Jobs going to nodes that are not members of the selected partition Comment # 12<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707%23c12&data=04%7C01%7Cadam.munro%40yale.edu%7C1f860a0b789d44b1f27e08d87f61c4d5%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637399405176946117%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wqq7jgs6DtM%2FABhiC8YUO%2BJ4mRguM3Hz%2FRPtRgG9mZg%3D&reserved=0> on bug 9707<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9707&data=04%7C01%7Cadam.munro%40yale.edu%7C1f860a0b789d44b1f27e08d87f61c4d5%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637399405176956110%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nv7RCR9hagVwOFXG17oVoLEcqUWdu%2B%2BhzVCYJij%2F4gc%3D&reserved=0> from Michael Hinton<mailto:hinton@schedmd.com> Hi Adam, what's the current status? -Michael ________________________________ You are receiving this mail because: * You reported the bug.
Great! Closing out, then. Thanks, -Michael