Ticket 4019

Summary: could not find partition
Product: Slurm Reporter: frank.schluenzen
Component: slurmctldAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart, sergey.yakubov, sven.sternberger
Version: 17.02.2   
Hardware: Linux   
OS: Linux   
Site: DESY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 17.02.7 17.11.0-pre2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description frank.schluenzen 2017-07-24 07:35:57 MDT
Created attachment 4955 [details]
slurm.conf

we still get occasionally errors in the slurmctld log of the type

Jul 20 11:19:55 adm01 slurmctld[3793]: _slurm_rpc_submit_batch_job JobId=650600 usec=798
Jul 20 16:55:28 adm01 slurmctld[3793]: select_p_job_begin: job 650600 could not find partition cfel for node max-wna036
Jul 20 16:55:28 adm01 slurmctld[3793]: select_p_job_begin: job 650600 could not find partition cfel for node max-wng003
Jul 20 16:55:28 adm01 slurmctld[3793]: error: select_g_job_begin(650600): No error

See also https://bugs.schedmd.com/show_bug.cgi?id=3551 


For example job-id 650600 had a scheduled nodelist max-cfel017-18, max-wng003, and got executed on max-cfel[003,017-018]. 

The node max-wng003 is however not part of partition cfel and never was. node max-wng003 was however temporarily (and a long time ago) the only member of a shared partition (oversubscription forced).

Questions:
Why appears a node - not belonging to the selected partition - in the nodelist?
Does the error have any consequences other than reporting an obviously incorrectly selected node?
Is there a way/need to clean the state of the scheduler/db?

We see some problems with scheduling jobs in combination with preemption. It might or might not be related to the problem here, so will post it separately.
Comment 1 Dominik Bartkiewicz 2017-07-25 04:40:02 MDT
Hi


Could you send me this job (650600) description?
"scontrol show job" is probably not available now but maybe
something has been saved in log file or you have batch script.


Dominik
Comment 2 frank.schluenzen 2017-07-25 04:50:12 MDT
Hi, 

I captured the job-header: 

#!/bin/bash 
#SBATCH -p cfel #partition 
#SBATCH -n 128 
#SBATCH --time 1-13:00 #time (D-hh:MM) 
#SBATCH -o 5x5x5_CG.%N.%j.out #STDOUT 
#SBATCH -e 5x5x5_CG.%N.%j.err #STDERR 
#SBATCH --job-name 1x1x1_CG #job_name 
module load mpi/openmpi-x86_64 
mpirun `which mdrun_openmpi` ... 

logs are not available anymore. There where no particular nodes or features requested as far as I can see. 

Cheers, Frank. 

> From: bugs@schedmd.com
> To: "frank schluenzen" <frank.schluenzen@desy.de>
> Sent: Tuesday, 25 July, 2017 12:40:02
> Subject: [Bug 4019] could not find partition

> [ mailto:bart@schedmd.com |  Dominik Bartkiewicz ] changed [
> https://bugs.schedmd.com/show_bug.cgi?id=4019 | bug 4019 ]
> What Removed Added
> CC 		bart@schedmd.com
> Assignee 	support@schedmd.com 	bart@schedmd.com

> [ https://bugs.schedmd.com/show_bug.cgi?id=4019#c1 | Comment # 1 ] on [
> https://bugs.schedmd.com/show_bug.cgi?id=4019 | bug 4019 ] from [
> mailto:bart@schedmd.com |  Dominik Bartkiewicz ]
> Hi

> Could you send me this job (650600) description?
> "scontrol show job" is probably not available now but maybe
> something has been saved in log file or you have batch script.

> Dominik

> You are receiving this mail because:

>     * You reported the bug.

> From: bugs@schedmd.com
> To: "frank schluenzen" <frank.schluenzen@desy.de>
> Sent: Tuesday, 25 July, 2017 12:40:02
> Subject: [Bug 4019] could not find partition

> [ mailto:bart@schedmd.com |  Dominik Bartkiewicz ] changed [
> https://bugs.schedmd.com/show_bug.cgi?id=4019 | bug 4019 ]
> What Removed Added
> 	CC 		bart@schedmd.com
> 	Assignee 	support@schedmd.com 	bart@schedmd.com

> [ https://bugs.schedmd.com/show_bug.cgi?id=4019#c1 | Comment # 1 ] on [
> https://bugs.schedmd.com/show_bug.cgi?id=4019 | bug 4019 ] from [
> mailto:bart@schedmd.com |  Dominik Bartkiewicz ]
> Hi

> Could you send me this job (650600) description?
> "scontrol show job" is probably not available now but maybe
> something has been saved in log file or you have batch script.

> Dominik

> You are receiving this mail because:

>     * You reported the bug.
Comment 10 Dominik Bartkiewicz 2017-08-08 08:18:34 MDT
Hi

Yesterday we fixed some bug which could couse this error.
https://github.com/SchedMD/slurm/commit/13b78dd2064c8bc7
This will be in next 17.02 release.

Dominik
Comment 11 Dominik Bartkiewicz 2017-08-14 02:39:59 MDT
Hi

Let me know if this patch solves your issue.

Dominik
Comment 12 Tim Wickberg 2017-08-22 23:56:01 MDT
I'm marking this resolved by commit 13b78dd2064c8bc7, which was included in the 17.02.7 maintenance release last week.

Please reopen if you're still seeing issues, or there is anything further we can address.

- Tim