Ticket 1829

Summary: Lower priority jobs get scheduled before the higher ones
Product: Slurm Reporter: Akmal Madzlan <akmalm>
Component: SchedulingAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: brian, da
Version: 14.11.8   
Hardware: Linux   
OS: Linux   
Site: DownUnder GeoSolutions Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 14.11.9
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: config & log
Bug fix

Description Akmal Madzlan 2015-07-27 21:42:45 MDT
Created attachment 2074 [details]
config & log

Two jobs submitted to multiple queue (lud36,teamregent,fastio,idle)
4347312 with priority 401 and 4347330 with priority 400
Ocassionally, job 4347330 get scheduled before 4347312

slurm config and logs attached

[akmalm@lud34 ~]$ squeue -u username -o "%10P %10Q %10i %t"
PARTITION  PRIORITY   JOBID      ST
lud36,team 401        4347312_[9 PD
lud36,team 400        4347330_[1 PD
teamregent 401        4347312_98 R
teamregent 401        4347312_98 R
teamregent 401        4347312_98 R
teamregent 401        4347312_98 R
teamregent 401        4347312_99 R
fastio     400        4347330_15 R
fastio     400        4347330_15 R
lud36      400        4347330_15 R


JobId=4347312 ArrayJobId=4347312 ArrayTaskId=997-1212 JobName=2012_neartracecube
   UserId=michaelp(1309) GroupId=teamregent(2113)
   Priority=401 Nice=0 Account=(null) QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2015-07-28T08:57:26 EligibleTime=2015-07-28T08:57:36
   StartTime=2016-07-27T10:33:02 EndTime=2015-07-28T11:33:02
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=lud36,teamregent,fastio,idle AllocNode:Sid=lud36:22465
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=50G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2012_neartracecube/200jobs/rj.2012_neartracecube.R3zxQg
   WorkDir=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2012_neartracecube/200jobs
   Comment=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2012_neartracecube/200jobs/workflow.job 
   StdErr=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2012_neartracecube/000scratch/logs/2012_neartracecube.o4347312.4294967294
   StdIn=/dev/null
   StdOut=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2012_neartracecube/000scratch/logs/2012_neartracecube.o4347312.4294967294


JobId=4347330 ArrayJobId=4347330 ArrayTaskId=158-1080 JobName=2014_neartracecube
   UserId=michaelp(1309) GroupId=teamregent(2113)
   Priority=400 Nice=0 Account=(null) QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2015-07-28T09:00:44 EligibleTime=2015-07-28T09:00:46
   StartTime=Unknown EndTime=2015-07-28T11:33:42
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=lud36,teamregent,fastio,idle AllocNode:Sid=lud36:22465
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=50G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2014_neartracecube/200jobs/rj.2014_neartracecube.nbperf
   WorkDir=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2014_neartracecube/200jobs
   Comment=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2014_neartracecube/200jobs/workflow.job 
   StdErr=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2014_neartracecube/000scratch/logs/2014_neartracecube.o4347330.4294967294
   StdIn=/dev/null
   StdOut=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2014_neartracecube/000scratch/logs/2014_neartracecube.o4347330.4294967294
Comment 1 David Bigagli 2015-07-27 21:54:28 MDT
This output shows that 4347312 has eligible time 2015-07-28T08:57:36
earlier then eligible time of 4347330 which is 2015-07-28T09:00:46, this is
correct. Do you know which array element started in an order that you think
is incorrect?

David
Comment 2 Akmal Madzlan 2015-07-28 14:03:24 MDT
here

PARTITION  PRIORITY   JOBID      ST
lud36,team 401        4347312_[9 PD
lud36,team 400        4347330_[1 PD
teamregent 401        4347312_98 R
teamregent 401        4347312_98 R
teamregent 401        4347312_98 R
teamregent 401        4347312_98 R
teamregent 401        4347312_99 R
fastio     400        4347330_15 R
fastio     400        4347330_15 R
lud36      400        4347330_15 R

I thought 4347330 will only start when 4347312 finished?
Comment 3 Moe Jette 2015-07-28 15:16:28 MDT
(In reply to Akmal Madzlan from comment #2)
> here
> 
> PARTITION  PRIORITY   JOBID      ST
> lud36,team 401        4347312_[9 PD
> lud36,team 400        4347330_[1 PD
> teamregent 401        4347312_98 R
> teamregent 401        4347312_98 R
> teamregent 401        4347312_98 R
> teamregent 401        4347312_98 R
> teamregent 401        4347312_99 R
> fastio     400        4347330_15 R
> fastio     400        4347330_15 R
> lud36      400        4347330_15 R
> 
> I thought 4347330 will only start when 4347312 finished?

There is logically a separate queue for each partition. Jobs that can run in multiple partitions have an entry in the queue for each of those partitions. None of that explains what you see though unless their are priority differences for the various tasks of the job array.

David will be able to review the logs in a few hours.
Comment 4 David Bigagli 2015-07-28 20:14:58 MDT
The log file indeed shows the jobs are started together in a mixed order, 
by both the backfill and the ordinary scheduler. As the scheduler goes through
the list of jobs it starts them the order of start follows the array id.
Did you observe this with jobs in one partition only?

David
Comment 5 Akmal Madzlan 2015-07-28 22:34:13 MDT
For this job, I only see it happen in partition fastio and lud36. Other partition seems fine
Comment 6 David Bigagli 2015-07-28 22:38:49 MDT
I can reproduce the behaviour configuring multiple partitions and not overlapping
hosts. In other words if a partition A has hosts a[1-2] and partition B has 
hosts b[1-2] two jobs from two users submitted to -p A,B can both ran in 
different partitions. This seems ok to me since the partition resources do not
overlap. This appears to be your case as well as most of the host in your 
partitions do not overlap as well, except some in idle and fastio.

If I change the partition configuration so that all partitions use the same
hosts I got strict first come first serve behaviour.

David
Comment 7 Akmal Madzlan 2015-07-29 14:58:16 MDT
So is it a bug or designed to be like that?
Comment 8 Moe Jette 2015-07-29 15:46:42 MDT
All other things being equal, jobs should start in priority order in each queue. David is looking for a bug.

On July 29, 2015 7:58:16 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1829
>
>Akmal Madzlan <akmalm@dugeo.com> changed:
>
>           What    |Removed                     |Added
>----------------------------------------------------------------------------
>           Assignee|david@schedmd.com           |brian@schedmd.com
>
>--- Comment #7 from Akmal Madzlan <akmalm@dugeo.com> ---
>So is it a bug or designed to be like that?
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 9 David Bigagli 2015-08-02 21:56:37 MDT
Assign to the Scheduler team, Support has provided a clear reproducer, sent 
in separate email.

David
Comment 10 Moe Jette 2015-08-03 10:31:03 MDT
If a job array is submitted to multiple partitions and some, but not all tasks of the job array are started in one partition, then the additional tasks in the job array would not be considered for scheduling in another partition. This is fixed with this commit:
https://github.com/SchedMD/slurm/commit/0a51f0ecad6219824090bb22299f26b413b6dcb7
Comment 11 Moe Jette 2015-08-03 10:56:13 MDT
Created attachment 2092 [details]
Bug fix

I needed to make a second patch to correct a problem in the previously cited commit. This patch contains the final version of the change (i.e. both commits, merged).