Ticket 748 - sched/basic: Pending jobs with Reason=None
Summary: sched/basic: Pending jobs with Reason=None
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 14.11.x
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-04-23 21:48 MDT by Phil Schwan
Modified: 2014-05-15 09:31 MDT (History)
2 users (show)

See Also:
Site: DownUnder GeoSolutions
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.03.4
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Phil Schwan 2014-04-23 21:48:53 MDT
Please consider this abridged squeue output:

>             JOBID PARTITION PRIORITY     NAME     USER ST       TIME  NODES NODELIST(REASON)
>      1055291_4890    teambm 4294884871 dc_CLsmo   bjornm PD       0:00      1 (Dependency)
>           1055292    teambm 4294884870 dm_CLsmo   bjornm PD       0:00      1 (Dependency)
>    1055304_[1-25]    teambm 4294884839 cvg_DRpr    yanaz PD       0:00      1 (Resources)
>  1055329_[1-1150]    teambm 4294884837 ortho_DR    yanaz PD       0:00      1 (Dependency)
>     1058562_[2-3]    teambm 4294884806 dp_01_De    olgab PD       0:00      1 (Resources)
>    1058459_[1-18]    teambm 4294884101 har_rms_ shahrilz PD       0:00      1 (Resources)
>           1058477    teambm 4294884098 foldplot shahrilz PD       0:00      1 (Dependency)
> 1058478_[250,325,    teambm 4294884095 har_rms_ shahrilz PD       0:00      1 (Dependency)
>           1058557    teambm 4294884092 har_dugi shahrilz PD       0:00      1 (Dependency)
>           1058558    teambm 4294884089 histogra shahrilz PD       0:00      1 (Dependency)
>           1058559    teambm 4294884086 RMS_surv shahrilz PD       0:00      1 (Dependency)
>           1058560    teambm 4294884083 Mk_offse shahrilz PD       0:00      1 (Dependency)
>           1068533    teambm 4294879705     bash michaeld PD       0:00      1 (None)
>           1068829    teambm 4294879701     bash michaeld PD       0:00      1 (Resources)
>         776363_75    teambm 4294878859 a_harFM_    yanaz PD       0:00      1 (Priority)
>        778577_113    teambm 4294878858 a_harFM_    yanaz PD       0:00      1 (Priority)
>        781703_167    teambm 4294878809 a_harFM_    yanaz PD       0:00      1 (None)
>        783657_201    teambm 4294878806 a_harFM_    yanaz PD       0:00      1 (None)
>        787108_260    teambm 4294878804 a_harFM_    yanaz PD       0:00      1 (None)
>        788411_283    teambm 4294878797 a_harFM_    yanaz PD       0:00      1 (None)
>        790823_320    teambm 4294878796 a_harFM_    yanaz PD       0:00      1 (None)
>        791149_326    teambm 4294878795 a_harFM_    yanaz PD       0:00      1 (None)
>      1053527_3307    teambm 4294878701 dp_CLsmo   bjornm PD       0:00      1 (None)
>   1070541_[1-972]    teambm 4294878666 dp_LNR2_   karend PD       0:00      1 (None)

Question #1: does Reason=None imply that the scheduler isn't reaching those jobs during its scheduling loop?

Unlike before, we're not seeing slurmctld take 100% CPU (not even close -- under 30%).  And we see nothing in the log to indicate that it's breaking out of the loop due to a timeout (currently at 10 seconds).


Question #2: if it's not reaching those jobs, how is it reaching other jobs that are lower-priority by every measure?

e.g. 1068533 = None.  But the lower-priority 1068829, 776363_75, and 778577_113 all have legit reasons?

Or if it's walking the list of jobs in submit order: 781703_167 = None, but 1058562 has a legit reason.


Example #2 -- these are the only jobs that aren't blocked by Dependencies in the whole partition; created with
squeue -p teambm -o "%.18i %.9P %Q %.8j %.8u %.2t %.10M %.6D %R" -t PD | grep -v Dependency

>             JOBID PARTITION PRIORITY     NAME     USER ST       TIME  NODES NODELIST(REASON)
>         776363_75    teambm 4294878859 a_harFM_    yanaz PD       0:00      1 (Priority)
>        778577_113    teambm 4294878858 a_harFM_    yanaz PD       0:00      1 (Priority)
>        781703_167    teambm 4294878809 a_harFM_    yanaz PD       0:00      1 (None)
>        783657_201    teambm 4294878806 a_harFM_    yanaz PD       0:00      1 (None)
>        787108_260    teambm 4294878804 a_harFM_    yanaz PD       0:00      1 (None)
>        788411_283    teambm 4294878797 a_harFM_    yanaz PD       0:00      1 (None)
>        790823_320    teambm 4294878796 a_harFM_    yanaz PD       0:00      1 (None)
>        791149_326    teambm 4294878795 a_harFM_    yanaz PD       0:00      1 (None)
>   1070541_[1-972]    teambm 4294878666 dp_LNR2_   karend PD       0:00      1 (None)
>           1076470    teambm 4294878453 expt9_to     timb PD       0:00      4 (Resources)
>        1055304_23    teambm 4294878450 cvg_DRpr    yanaz PD       0:00      1 (None)
> 1044325_[228,398, teambm,te 4294879688 cvg_DReg    yanaz PD       0:00      1 (None)

Question #3: am I correct in thinking that the "highest-priority" job -- the one that should go on next -- will have Reason=Resources.  While tasks that are lower in the priority queue will have Reason=Priority?

If so, the fact that it's chosen 1076470 to run next seems wrong?  (We have only one QOS type, so it's not that.)

I'm befuddled (which is not unusual)
Comment 1 Moe Jette 2014-04-24 09:32:41 MDT
(In reply to Phil Schwan from comment #0)
> Please consider this abridged squeue output:
> 
> >             JOBID PARTITION PRIORITY     NAME     USER ST       TIME  NODES NODELIST(REASON)
> >      1055291_4890    teambm 4294884871 dc_CLsmo   bjornm PD       0:00      1 (Dependency)
> >           1055292    teambm 4294884870 dm_CLsmo   bjornm PD       0:00      1 (Dependency)
> >    1055304_[1-25]    teambm 4294884839 cvg_DRpr    yanaz PD       0:00      1 (Resources)
> >  1055329_[1-1150]    teambm 4294884837 ortho_DR    yanaz PD       0:00      1 (Dependency)
> >     1058562_[2-3]    teambm 4294884806 dp_01_De    olgab PD       0:00      1 (Resources)
> >    1058459_[1-18]    teambm 4294884101 har_rms_ shahrilz PD       0:00      1 (Resources)
> >           1058477    teambm 4294884098 foldplot shahrilz PD       0:00      1 (Dependency)
> > 1058478_[250,325,    teambm 4294884095 har_rms_ shahrilz PD       0:00      1 (Dependency)
> >           1058557    teambm 4294884092 har_dugi shahrilz PD       0:00      1 (Dependency)
> >           1058558    teambm 4294884089 histogra shahrilz PD       0:00      1 (Dependency)
> >           1058559    teambm 4294884086 RMS_surv shahrilz PD       0:00      1 (Dependency)
> >           1058560    teambm 4294884083 Mk_offse shahrilz PD       0:00      1 (Dependency)
> >           1068533    teambm 4294879705     bash michaeld PD       0:00      1 (None)
> >           1068829    teambm 4294879701     bash michaeld PD       0:00      1 (Resources)
> >         776363_75    teambm 4294878859 a_harFM_    yanaz PD       0:00      1 (Priority)
> >        778577_113    teambm 4294878858 a_harFM_    yanaz PD       0:00      1 (Priority)
> >        781703_167    teambm 4294878809 a_harFM_    yanaz PD       0:00      1 (None)
> >        783657_201    teambm 4294878806 a_harFM_    yanaz PD       0:00      1 (None)
> >        787108_260    teambm 4294878804 a_harFM_    yanaz PD       0:00      1 (None)
> >        788411_283    teambm 4294878797 a_harFM_    yanaz PD       0:00      1 (None)
> >        790823_320    teambm 4294878796 a_harFM_    yanaz PD       0:00      1 (None)
> >        791149_326    teambm 4294878795 a_harFM_    yanaz PD       0:00      1 (None)
> >      1053527_3307    teambm 4294878701 dp_CLsmo   bjornm PD       0:00      1 (None)
> >   1070541_[1-972]    teambm 4294878666 dp_LNR2_   karend PD       0:00      1 (None)
> 
> Question #1: does Reason=None imply that the scheduler isn't reaching those
> jobs during its scheduling loop?

Correct.


> Unlike before, we're not seeing slurmctld take 100% CPU (not even close --
> under 30%).  And we see nothing in the log to indicate that it's breaking
> out of the loop due to a timeout (currently at 10 seconds).

My best guess is that "SchedulerParameters=bf_max_job_user=100" is causing some jobs to not be tested. Note that each task from a job array counts as a job in this limit calculation, although there is some logic to speed up scheduling of job arrays that should arguably set reasons for the skipped jobs. I'll do that soon. We really don't have any reason currently configured to match this condition, but it could be added rather easily. Perhaps something like "SchedulerLimit". What do you think?


> Question #2: if it's not reaching those jobs, how is it reaching other jobs
> that are lower-priority by every measure?
>
> e.g. 1068533 = None.  But the lower-priority 1068829, 776363_75, and
> 778577_113 all have legit reasons?
> 
> Or if it's walking the list of jobs in submit order: 781703_167 = None, but
> 1058562 has a legit reason.

Jobs are checked in priority order.
Some of those jobs have a lower job ID and were submitted earlier, so those jobs may have had their reasons set in earlier scheduler runs. After a job's reason is set, it is not cleared until reset by the scheduling logic. So if a job was blocked by a dependency and that dependency is satisfied, the reason is not cleared until the scheduler gets around to it.
Some of the jobs belong to other users and are are not subject to the same "SchedulerParameters=bf_max_job_user=100".
It is also possible that job 1068533 was held, restarted, or something else happened to prevent its reason from being set at the same time as job 1068829.


> Example #2 -- these are the only jobs that aren't blocked by Dependencies in
> the whole partition; created with
> squeue -p teambm -o "%.18i %.9P %Q %.8j %.8u %.2t %.10M %.6D %R" -t PD |
> grep -v Dependency
> 
> >             JOBID PARTITION PRIORITY     NAME     USER ST       TIME  NODES NODELIST(REASON)
> >         776363_75    teambm 4294878859 a_harFM_    yanaz PD       0:00      1 (Priority)
> >        778577_113    teambm 4294878858 a_harFM_    yanaz PD       0:00      1 (Priority)
> >        781703_167    teambm 4294878809 a_harFM_    yanaz PD       0:00      1 (None)
> >        783657_201    teambm 4294878806 a_harFM_    yanaz PD       0:00      1 (None)
> >        787108_260    teambm 4294878804 a_harFM_    yanaz PD       0:00      1 (None)
> >        788411_283    teambm 4294878797 a_harFM_    yanaz PD       0:00      1 (None)
> >        790823_320    teambm 4294878796 a_harFM_    yanaz PD       0:00      1 (None)
> >        791149_326    teambm 4294878795 a_harFM_    yanaz PD       0:00      1 (None)
> >   1070541_[1-972]    teambm 4294878666 dp_LNR2_   karend PD       0:00      1 (None)
> >           1076470    teambm 4294878453 expt9_to     timb PD       0:00      4 (Resources)
> >        1055304_23    teambm 4294878450 cvg_DRpr    yanaz PD       0:00      1 (None)
> > 1044325_[228,398, teambm,te 4294879688 cvg_DReg    yanaz PD       0:00      1 (None)

My best guess is that "SchedulerParameters=bf_max_job_user=100" is causing some jobs to not be tested. Note that each task from a job array counts as a job in this limit calculation.


> Question #3: am I correct in thinking that the "highest-priority" job -- the
> one that should go on next -- will have Reason=Resources.  While tasks that
> are lower in the priority queue will have Reason=Priority?

Ignoring job dependencies, begin times in the future, resource limts, etc. then in the simple case your answer is yet.


> If so, the fact that it's chosen 1076470 to run next seems wrong?  (We have
> only one QOS type, so it's not that.)
> 
> I'm befuddled (which is not unusual)

My best guess is that "SchedulerParameters=bf_max_job_user=100" is causing some jobs to not be tested. Note that each task from a job array counts as a job in this limit calculation.
Comment 2 Moe Jette 2014-04-24 10:13:16 MDT
For job arrays, once one task is NOT scheduled, then all of the others get skipped for performance reasons. I've added logic to propagate the "Reason" field to all of the job array tasks. That will probably eliminate all of the "None" that you reported.

https://github.com/SchedMD/slurm/commit/e4193dda1e0f86595ba78c15626f0942cef3977f
Comment 3 Moe Jette 2014-04-24 10:28:27 MDT
Based upon David's information about use of the sched/builtin plugin, I believe the change that I made to set the reason for all elements of a job array will probably make all of these issues go away with a couple of exceptions.


(In reply to Phil Schwan from comment #0)
> Please consider this abridged squeue output:
> 
> >             JOBID PARTITION PRIORITY     NAME     USER ST       TIME  NODES NODELIST(REASON)
> >      1055291_4890    teambm 4294884871 dc_CLsmo   bjornm PD       0:00      1 (Dependency)
> >           1055292    teambm 4294884870 dm_CLsmo   bjornm PD       0:00      1 (Dependency)
> >    1055304_[1-25]    teambm 4294884839 cvg_DRpr    yanaz PD       0:00      1 (Resources)
> >  1055329_[1-1150]    teambm 4294884837 ortho_DR    yanaz PD       0:00      1 (Dependency)
> >     1058562_[2-3]    teambm 4294884806 dp_01_De    olgab PD       0:00      1 (Resources)
> >    1058459_[1-18]    teambm 4294884101 har_rms_ shahrilz PD       0:00      1 (Resources)
> >           1058477    teambm 4294884098 foldplot shahrilz PD       0:00      1 (Dependency)
> > 1058478_[250,325,    teambm 4294884095 har_rms_ shahrilz PD       0:00      1 (Dependency)
> >           1058557    teambm 4294884092 har_dugi shahrilz PD       0:00      1 (Dependency)
> >           1058558    teambm 4294884089 histogra shahrilz PD       0:00      1 (Dependency)
> >           1058559    teambm 4294884086 RMS_surv shahrilz PD       0:00      1 (Dependency)
> >           1058560    teambm 4294884083 Mk_offse shahrilz PD       0:00      1 (Dependency)
> >           1068533    teambm 4294879705     bash michaeld PD       0:00      1 (None)
> >           1068829    teambm 4294879701     bash michaeld PD       0:00      1 (Resources)
> >         776363_75    teambm 4294878859 a_harFM_    yanaz PD       0:00      1 (Priority)
> >        778577_113    teambm 4294878858 a_harFM_    yanaz PD       0:00      1 (Priority)
> >        781703_167    teambm 4294878809 a_harFM_    yanaz PD       0:00      1 (None)
> >        783657_201    teambm 4294878806 a_harFM_    yanaz PD       0:00      1 (None)
> >        787108_260    teambm 4294878804 a_harFM_    yanaz PD       0:00      1 (None)
> >        788411_283    teambm 4294878797 a_harFM_    yanaz PD       0:00      1 (None)
> >        790823_320    teambm 4294878796 a_harFM_    yanaz PD       0:00      1 (None)
> >        791149_326    teambm 4294878795 a_harFM_    yanaz PD       0:00      1 (None)
> >      1053527_3307    teambm 4294878701 dp_CLsmo   bjornm PD       0:00      1 (None)
> >   1070541_[1-972]    teambm 4294878666 dp_LNR2_   karend PD       0:00      1 (None)
> 
> Question #1: does Reason=None imply that the scheduler isn't reaching those
> jobs during its scheduling loop?
> 
> Unlike before, we're not seeing slurmctld take 100% CPU (not even close --
> under 30%).  And we see nothing in the log to indicate that it's breaking
> out of the loop due to a timeout (currently at 10 seconds).

It's just not setting the reason for all tasks of a job array.


> Question #2: if it's not reaching those jobs, how is it reaching other jobs
> that are lower-priority by every measure?
> 
> e.g. 1068533 = None.  But the lower-priority 1068829, 776363_75, and
> 778577_113 all have legit reasons?
> 
> Or if it's walking the list of jobs in submit order: 781703_167 = None, but
> 1058562 has a legit reason.

It's working in priority order, but some of the information may be vestigial and not current. The only issue I see above is for job 1068533. My best guess is that it has not yet been tested because it is new or the reason it was blocked has recently been removed (e.g. a job dependency was recently satisfied). The scheduler only runs once per minute by default and quits after 4 seconds of run time.


> Example #2 -- these are the only jobs that aren't blocked by Dependencies in
> the whole partition; created with
> squeue -p teambm -o "%.18i %.9P %Q %.8j %.8u %.2t %.10M %.6D %R" -t PD |
> grep -v Dependency
> 
> >             JOBID PARTITION PRIORITY     NAME     USER ST       TIME  NODES NODELIST(REASON)
> >         776363_75    teambm 4294878859 a_harFM_    yanaz PD       0:00      1 (Priority)
> >        778577_113    teambm 4294878858 a_harFM_    yanaz PD       0:00      1 (Priority)
> >        781703_167    teambm 4294878809 a_harFM_    yanaz PD       0:00      1 (None)
> >        783657_201    teambm 4294878806 a_harFM_    yanaz PD       0:00      1 (None)
> >        787108_260    teambm 4294878804 a_harFM_    yanaz PD       0:00      1 (None)
> >        788411_283    teambm 4294878797 a_harFM_    yanaz PD       0:00      1 (None)
> >        790823_320    teambm 4294878796 a_harFM_    yanaz PD       0:00      1 (None)
> >        791149_326    teambm 4294878795 a_harFM_    yanaz PD       0:00      1 (None)
> >   1070541_[1-972]    teambm 4294878666 dp_LNR2_   karend PD       0:00      1 (None)
> >           1076470    teambm 4294878453 expt9_to     timb PD       0:00      4 (Resources)
> >        1055304_23    teambm 4294878450 cvg_DRpr    yanaz PD       0:00      1 (None)
> > 1044325_[228,398, teambm,te 4294879688 cvg_DReg    yanaz PD       0:00      1 (None)

It's just not setting the reason for all tasks of a job array.


> Question #3: am I correct in thinking that the "highest-priority" job -- the
> one that should go on next -- will have Reason=Resources.  While tasks that
> are lower in the priority queue will have Reason=Priority?
> 
> If so, the fact that it's chosen 1076470 to run next seems wrong?  (We have
> only one QOS type, so it's not that.)
> 
> I'm befuddled (which is not unusual)

If the resources are not available, you might see Resources rather than priority. Does the queue have 4 nodes up?
Comment 4 Phil Schwan 2014-04-24 12:13:24 MDT
(In reply to Moe Jette from comment #3)
> 
> It's just not setting the reason for all tasks of a job array.

Although these a_* jobs have the same name, I'm 99.9% sure they were actually NOT submitted as an array (because our scripts predate the ability to have different dependencies for each task -- so they were almost certainly each submitted as separate jobs)

Also, if they were all part of a single job array, wouldn't they print on a single line?  This was not an squeue -r

> The only issue I see above is for job 1068533. My best
> guess is that it has not yet been tested because it is new or the reason it
> was blocked has recently been removed (e.g. a job dependency was recently
> satisfied).

It's a good theory, but I don't think it holds water.  I happen to know about that job because a user emailed me when it was submitted.  When I took that squeue snapshot it had been around for a while, and had never had dependencies (he was just trying to get a shell on a big-memory machine)

> > >             JOBID PARTITION PRIORITY     NAME     USER ST       TIME  NODES NODELIST(REASON)
> > >         776363_75    teambm 4294878859 a_harFM_    yanaz PD       0:00      1 (Priority)
> > >        778577_113    teambm 4294878858 a_harFM_    yanaz PD       0:00      1 (Priority)
> > >        781703_167    teambm 4294878809 a_harFM_    yanaz PD       0:00      1 (None)
> > >        783657_201    teambm 4294878806 a_harFM_    yanaz PD       0:00      1 (None)
> > >        787108_260    teambm 4294878804 a_harFM_    yanaz PD       0:00      1 (None)
> > >        788411_283    teambm 4294878797 a_harFM_    yanaz PD       0:00      1 (None)
> > >        790823_320    teambm 4294878796 a_harFM_    yanaz PD       0:00      1 (None)
> > >        791149_326    teambm 4294878795 a_harFM_    yanaz PD       0:00      1 (None)
> > >   1070541_[1-972]    teambm 4294878666 dp_LNR2_   karend PD       0:00      1 (None)
> > >           1076470    teambm 4294878453 expt9_to     timb PD       0:00      4 (Resources)
> > >        1055304_23    teambm 4294878450 cvg_DRpr    yanaz PD       0:00      1 (None)
> > > 1044325_[228,398, teambm,te 4294879688 cvg_DReg    yanaz PD       0:00      1 (None)
> 
> If the resources are not available, you might see Resources rather than
> priority. Does the queue have 4 nodes up?

It most certainly does.  There weren't 4 nodes available to run at that instant, but there were ~220 allocated to other work.
Comment 5 Stuart Midgley 2014-04-24 14:42:45 MDT
FWIW our scheduler parameters are

SchedulerParameters=default_queue_depth=50000,max_depend_depth=3,defer,sched_interval=30,max_sched_time=10,partition_job_depth=1000
Comment 6 Phil Schwan 2014-04-24 14:48:33 MDT
(In reply to Moe Jette from comment #1)
> 
> We really don't have any reason currently
> configured to match this condition, but it could be added rather easily.
> Perhaps something like "SchedulerLimit". What do you think?

I missed this question the first time around -- it sounds good to me.  I support any additional information that promotes better understanding of why something behaves a certain way!
Comment 7 Moe Jette 2014-04-25 03:52:09 MDT
(In reply to Phil Schwan from comment #4)
> (In reply to Moe Jette from comment #3)
> > 
> > It's just not setting the reason for all tasks of a job array.
> 
> Although these a_* jobs have the same name, I'm 99.9% sure they were
> actually NOT submitted as an array (because our scripts predate the ability
> to have different dependencies for each task -- so they were almost
> certainly each submitted as separate jobs)

Either they were submitted as job arrays or something is corrupting memory. Also note that if someone re-runs a specific task of a job array (e.g. "sbatch -a 123 ..."), that will appear as a job array and look like your squeue output.


> Also, if they were all part of a single job array, wouldn't they print on a
> single line?  This was not an squeue -r

Note there are a multitude of job IDs. Each looks like a different task of a different job array. The job arrays tasks only get printed on a single line if they are for the same job ID and have an identical state (i.e. typically pending since once they start each job record will differ, I do see these are pending).
Comment 8 Stuart Midgley 2014-04-25 03:59:19 MDT
Moe/Phil, I think your sort of on different but the same plane...

Yes, each of those a_ jobs was submitted as a job array... but each with a single task.

>        778577_113    teambm 4294878858 a_harFM_    yanaz PD       0:00      1 (Priority)
>        781703_167    teambm 4294878809 a_harFM_    yanaz PD       0:00      1 (None)


Here, there are 2 job arrays 778577_113 and 781703_167.

So, does the explanation about not setting the reason correctly still hold if their is a single task in a job array?
Comment 9 Moe Jette 2014-04-25 04:12:51 MDT
(In reply to Stuart Midgley from comment #8)
> Moe/Phil, I think your sort of on different but the same plane...
> 
> Yes, each of those a_ jobs was submitted as a job array... but each with a
> single task.
> 
> >        778577_113    teambm 4294878858 a_harFM_    yanaz PD       0:00      1 (Priority)
> >        781703_167    teambm 4294878809 a_harFM_    yanaz PD       0:00      1 (None)
> 
> 
> Here, there are 2 job arrays 778577_113 and 781703_167.

That makes sense.


> So, does the explanation about not setting the reason correctly still hold
> if their is a single task in a job array?

Why the jobs are not being tested then is a big question then.
Comment 10 Moe Jette 2014-04-25 04:29:03 MDT
(In reply to Phil Schwan from comment #6)
> (In reply to Moe Jette from comment #1)
> > 
> > We really don't have any reason currently
> > configured to match this condition, but it could be added rather easily.
> > Perhaps something like "SchedulerLimit". What do you think?
> 
> I missed this question the first time around -- it sounds good to me.  I
> support any additional information that promotes better understanding of why
> something behaves a certain way!

Done. This will not fix any scheduling problems, but will provide more information to help diagnose what is happening.
Fri Apr 25 09:24:49 2014
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               191     debug      tmp    jette PD       0:00      1 (SchedTimeout)
               194     debug      tmp    jette PD       0:00      1 (SchedTimeout)
               186     debug      tmp    jette  R       0:59      1 tux1
               187     debug      tmp    jette  R       0:59      1 tux1
Comment 11 Moe Jette 2014-04-25 09:16:56 MDT
(In reply to Stuart Midgley from comment #5)
> FWIW our scheduler parameters are
> 
> SchedulerParameters=default_queue_depth=50000,max_depend_depth=3,defer,
> sched_interval=30,max_sched_time=10,partition_job_depth=1000

Assuming that you enable backfill scheduling, I would comment this line out and set some different parameters as follows

default_queue_depth=10  (or just remove this)
You probably want to do most of the job scheduling in the backfill logic rather than the main scheduling loop. The reason for that is the backfill scheduler will relinquish locks every couple of second to avoid blocking user requests. This change will greatly reduce the time consumed by the main scheduling loop when it runs on events that can permit jobs to start.

sched_interval=30,partition_job_depth=1000
Probably remove these too for the same reason.

max_sched_time=10
Leave this so that when the scheduler does run through all of the jobs (every minute or so), it can get all of jobs.

bf_continue
Allow the backfill scheduler to relinquish locks and resume scheduling after processing pending commands.

bf_max_job_test=50000
Some big number so the backfill scheduler runs through all jobs

bf_resolution=300
Reduces overhead in book keeping.

The net result is this
SchedulerParameters=bf_continue,bf_max_job_test=50000,
bf_resolution=300,defer,max_sched_time=10
SchedulerType=sched/backfill
Comment 12 Phil Schwan 2014-04-25 14:21:33 MDT
> Either they were submitted as job arrays or something is corrupting memory.

Sorry for the confusion -- Stu is 100% right.  They're submitted as single-task arrays.

...

I think we're going to take your advice and enable the backfill scheduler.

But I thought this particular issue would be simpler to understand on the basic scheduler, with way fewer moving parts, before we cut over?
Comment 13 Moe Jette 2014-04-28 04:33:57 MDT
(In reply to Phil Schwan from comment #12)
> I think we're going to take your advice and enable the backfill scheduler.
> 
> But I thought this particular issue would be simpler to understand on the
> basic scheduler, with way fewer moving parts, before we cut over?

Perhaps too few moving parts. The only computers using sched/builtin that I know of are running a bunch of almost identical jobs where FIFO makes sense (e.g. seismic data processing), but not many computers really want FIFO (or perhaps I should say scheduling in strict priority order).
Comment 14 Stuart Midgley 2014-04-28 11:50:54 MDT
Of course... the industry you singled out is exactly what we do! Which is why we are basically using the FIFO scheduler as well.
Comment 15 Moe Jette 2014-04-29 13:26:57 MDT
(In reply to Stuart Midgley from comment #14)
> Of course... the industry you singled out is exactly what we do! Which is
> why we are basically using the FIFO scheduler as well.

Except that your system is heterogeneous (different memory sizes), so strict FIFO is likely not really what you want.
Comment 16 Moe Jette 2014-05-12 06:43:57 MDT
Is this still a problem?
Comment 17 Phil Schwan 2014-05-13 00:09:53 MDT
Not that I can see at first glance, anyway.  I'm willing to call it fixed by your "set the reason for all tasks" patch, unless I can prove otherwise.

Thanks
Comment 18 Moe Jette 2014-05-15 09:31:50 MDT
Fixed in supplied patches