Evening We are seeing jobs occassionally go incorrectly. Consider the definition of 2 partitions PartitionName=teamfraser Nodes=clus[001-040] DefaultTime=60 MaxTime=INFINITE State=UP AllowGroups=teamfraser,geodev default=no Priority=10 PartitionName=idle Nodes=clus[001-225,227-326,328-362,373,375,418-573,578-581,586-589,598-665] DefaultTime=60 MaxTime=INFINITE State=UP default=no Priority=5 Now, jobs on the idle queue should not be going onto clus[001-040] unless their are no jobs in teamfraser... BUT 20140704162831 bud30:Downloads> squeue -aw 'clus[001-040]' PARTITION PRIORITY NAME USER ST TIME NODES NODELIST(REASON JOBID teamfraser 1000 dp_conv_WB_mute_fina kianchee R 3:37 1 clus030 3961285_10125 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 3:18 1 clus006 3961285_10150 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 3:18 1 clus012 3961285_10175 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 3:17 1 clus025 3961285_10200 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 3:17 1 clus027 3961285_10225 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 3:17 1 clus029 3961285_10250 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:57 1 clus015 3961285_10275 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:38 1 clus007 3961285_10300 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:38 1 clus017 3961285_10325 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:18 1 clus011 3961285_10350 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:18 1 clus021 3961285_10375 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:17 1 clus028 3961285_10400 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:17 1 clus018 3961285_10425 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:17 1 clus019 3961285_10450 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:17 1 clus002 3961285_10475 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:17 1 clus003 3961285_10500 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:17 1 clus004 3961285_10525 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:17 1 clus005 3961285_10550 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:03 1 clus009 3961285_10575 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:03 1 clus023 3961285_10600 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 2:03 1 clus026 3961285_10625 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 1:18 1 clus020 3961285_10650 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 1:18 1 clus036 3961285_10675 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 0:18 1 clus014 3961285_10700 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 0:18 1 clus022 3961285_10725 idle 800 dp_clsmth bjornm R 28:53 1 clus038 3946711_3922 idle 800 dp_clsmth bjornm R 1:42 1 clus010 3946821_4114 idle 800 dp_clsmth bjornm R 34:20 1 clus034 3947184_4780 idle 800 dp_clsmth bjornm R 7:03 1 clus033 3947481_5312 idle 800 dp_clsmth bjornm R 18:17 1 clus001 3947888_6062 idle 800 dp_clsmth bjornm R 39:50 1 clus035 3947998_6250 idle 800 dp_clsmth bjornm R 18:14 1 clus039 3948394_6982 idle 800 dp_clsmth bjornm R 1:39 1 clus024 3948548_7252 idle 500 rt_320_tomo2A michaeld R 2:55:31 1 clus008 3952980_41 idle 500 rt_320_tomo2A michaeld R 2:49:06 1 clus013 3952980_185 idle 500 rt_320_tomo2A michaeld R 2:49:06 1 clus016 3952980_186 idle 500 rt_320_tomo2A michaeld R 2:46:25 1 clus032 3952980_217 idle 500 rt_320_tomo2A michaeld R 2:23:02 1 clus031 3952980_325 teamfraser 100 tomo4_tomo1_refl justinh R 1:45 1 clus037 3966887_3 and teamfraser definitely has jobs pending 20140704162836 bud30:Downloads> squeue -p teamfraser PARTITION PRIORITY NAME USER ST TIME NODES NODELIST(REASON JOBID teamfraser 1000 dp_conv_WB_mute_fina kianchee PD 0:00 1 (Resources) 3961285_[10775,10800,10825,10850,10875,10900,10925,10950,10975,11000,11025,11050,11075,11100,11125,11150,11175,11200,11225,11250,11275,11300,11325,11350,11375,11400,11425,11450,11475,11500,11525,11550,11575,11600,11625,11650,11675,11700,11725,11750,11775,11800,11825,11850] teamfraser 1000 dm_conv_WB_mute_fina kianchee PD 0:00 1 (Dependency) 3961357 teamfraser, 500 tomo4_tomo_shim justinh PD 0:00 1 (Dependency) 3966987 teamfraser, 500 tomo4_tomo_shim justinh PD 0:00 1 (Dependency) 3967088 teamfraser, 100 tomo4_tomo1_refl justinh PD 0:00 1 (Resources) 3966887_[5-100] teamfraser, 100 tomo4_tomo1_refl justinh PD 0:00 1 (Priority) 3966988_[1-100] teamfraser 100 dp_LC_TFDN3x kianchee PD 0:00 1 (Resources) 3970801_[1-1024] teamfraser 1000 dp_conv_WB_mute_fina kianchee R 3:44 1 clus030 3961285_10125 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 3:25 1 clus006 3961285_10150 teamfraser 1000 dp_conv_WB_mute_fina kianchee R 3:25 1 clus012 3961285_10175 Bjorn's dp_clsmth jobs should not be getting on. Now, is this the work of the backfill scheduler? Has a job just finished as the backfill is traversing the empty nodes for the idle queue and it finds a node? I notice that they appear to get on ~ever 5mins, which fits with SchedulerParameters=bf_continue,bf_max_job_test=50000,bf_interval=300,bf_resolution=300,defer,max_depend_depth=3,sched_interval=20,batch_sched_delay=10 It isn't a massive issue, but our users do notice and some sort of explanation would be useful.
I spent about a week working with Brigham Young University on a backfill scheduling problem that may be the same as this (see bug 911, there are patches that add more debugging logic and fix some minor problems, but attachment 1011 [details] seems to be the root problem). This has patch has not yet committed to our code base and I am attaching it here too. As you say "jobs on the idle queue should not be going onto clus[001-040] unless their are no jobs in teamfraser...", except when jobs are blocked by hitting some limit (e.g. maximum running jobs for some user), waiting for a dependency, requesting specific nodes that are allocated for some other long running job, etc.
Created attachment 1027 [details] likely fix for backfill scheduling bug BYU has been running with this patch for about a week and their backfill scheduling problems have ceased.
Thanks, I'll get the patch installed. I agree about jobs "can" get blocked, but this wasn't that case :)
We have been running with this patch for the last few days. No complaints from users and I haven't noticed anything going wrong.
(In reply to Stuart Midgley from comment #4) > We have been running with this patch for the last few days. No complaints > from users and I haven't noticed anything going wrong. BYU has also reported the problem fixed with this change, which will be in version 14.03.5. Closing the ticket.