Created attachment 4955 [details] slurm.conf we still get occasionally errors in the slurmctld log of the type Jul 20 11:19:55 adm01 slurmctld[3793]: _slurm_rpc_submit_batch_job JobId=650600 usec=798 Jul 20 16:55:28 adm01 slurmctld[3793]: select_p_job_begin: job 650600 could not find partition cfel for node max-wna036 Jul 20 16:55:28 adm01 slurmctld[3793]: select_p_job_begin: job 650600 could not find partition cfel for node max-wng003 Jul 20 16:55:28 adm01 slurmctld[3793]: error: select_g_job_begin(650600): No error See also https://bugs.schedmd.com/show_bug.cgi?id=3551 For example job-id 650600 had a scheduled nodelist max-cfel017-18, max-wng003, and got executed on max-cfel[003,017-018]. The node max-wng003 is however not part of partition cfel and never was. node max-wng003 was however temporarily (and a long time ago) the only member of a shared partition (oversubscription forced). Questions: Why appears a node - not belonging to the selected partition - in the nodelist? Does the error have any consequences other than reporting an obviously incorrectly selected node? Is there a way/need to clean the state of the scheduler/db? We see some problems with scheduling jobs in combination with preemption. It might or might not be related to the problem here, so will post it separately.
Hi Could you send me this job (650600) description? "scontrol show job" is probably not available now but maybe something has been saved in log file or you have batch script. Dominik
Hi, I captured the job-header: #!/bin/bash #SBATCH -p cfel #partition #SBATCH -n 128 #SBATCH --time 1-13:00 #time (D-hh:MM) #SBATCH -o 5x5x5_CG.%N.%j.out #STDOUT #SBATCH -e 5x5x5_CG.%N.%j.err #STDERR #SBATCH --job-name 1x1x1_CG #job_name module load mpi/openmpi-x86_64 mpirun `which mdrun_openmpi` ... logs are not available anymore. There where no particular nodes or features requested as far as I can see. Cheers, Frank. > From: bugs@schedmd.com > To: "frank schluenzen" <frank.schluenzen@desy.de> > Sent: Tuesday, 25 July, 2017 12:40:02 > Subject: [Bug 4019] could not find partition > [ mailto:bart@schedmd.com | Dominik Bartkiewicz ] changed [ > https://bugs.schedmd.com/show_bug.cgi?id=4019 | bug 4019 ] > What Removed Added > CC bart@schedmd.com > Assignee support@schedmd.com bart@schedmd.com > [ https://bugs.schedmd.com/show_bug.cgi?id=4019#c1 | Comment # 1 ] on [ > https://bugs.schedmd.com/show_bug.cgi?id=4019 | bug 4019 ] from [ > mailto:bart@schedmd.com | Dominik Bartkiewicz ] > Hi > Could you send me this job (650600) description? > "scontrol show job" is probably not available now but maybe > something has been saved in log file or you have batch script. > Dominik > You are receiving this mail because: > * You reported the bug. > From: bugs@schedmd.com > To: "frank schluenzen" <frank.schluenzen@desy.de> > Sent: Tuesday, 25 July, 2017 12:40:02 > Subject: [Bug 4019] could not find partition > [ mailto:bart@schedmd.com | Dominik Bartkiewicz ] changed [ > https://bugs.schedmd.com/show_bug.cgi?id=4019 | bug 4019 ] > What Removed Added > CC bart@schedmd.com > Assignee support@schedmd.com bart@schedmd.com > [ https://bugs.schedmd.com/show_bug.cgi?id=4019#c1 | Comment # 1 ] on [ > https://bugs.schedmd.com/show_bug.cgi?id=4019 | bug 4019 ] from [ > mailto:bart@schedmd.com | Dominik Bartkiewicz ] > Hi > Could you send me this job (650600) description? > "scontrol show job" is probably not available now but maybe > something has been saved in log file or you have batch script. > Dominik > You are receiving this mail because: > * You reported the bug.
Hi Yesterday we fixed some bug which could couse this error. https://github.com/SchedMD/slurm/commit/13b78dd2064c8bc7 This will be in next 17.02 release. Dominik
Hi Let me know if this patch solves your issue. Dominik
I'm marking this resolved by commit 13b78dd2064c8bc7, which was included in the 17.02.7 maintenance release last week. Please reopen if you're still seeing issues, or there is anything further we can address. - Tim