| Summary: | Lower priority jobs get scheduled before the higher ones | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Akmal Madzlan <akmalm> |
| Component: | Scheduling | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 14.11.9 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
config & log
Bug fix |
||
This output shows that 4347312 has eligible time 2015-07-28T08:57:36 earlier then eligible time of 4347330 which is 2015-07-28T09:00:46, this is correct. Do you know which array element started in an order that you think is incorrect? David here PARTITION PRIORITY JOBID ST lud36,team 401 4347312_[9 PD lud36,team 400 4347330_[1 PD teamregent 401 4347312_98 R teamregent 401 4347312_98 R teamregent 401 4347312_98 R teamregent 401 4347312_98 R teamregent 401 4347312_99 R fastio 400 4347330_15 R fastio 400 4347330_15 R lud36 400 4347330_15 R I thought 4347330 will only start when 4347312 finished? (In reply to Akmal Madzlan from comment #2) > here > > PARTITION PRIORITY JOBID ST > lud36,team 401 4347312_[9 PD > lud36,team 400 4347330_[1 PD > teamregent 401 4347312_98 R > teamregent 401 4347312_98 R > teamregent 401 4347312_98 R > teamregent 401 4347312_98 R > teamregent 401 4347312_99 R > fastio 400 4347330_15 R > fastio 400 4347330_15 R > lud36 400 4347330_15 R > > I thought 4347330 will only start when 4347312 finished? There is logically a separate queue for each partition. Jobs that can run in multiple partitions have an entry in the queue for each of those partitions. None of that explains what you see though unless their are priority differences for the various tasks of the job array. David will be able to review the logs in a few hours. The log file indeed shows the jobs are started together in a mixed order, by both the backfill and the ordinary scheduler. As the scheduler goes through the list of jobs it starts them the order of start follows the array id. Did you observe this with jobs in one partition only? David For this job, I only see it happen in partition fastio and lud36. Other partition seems fine I can reproduce the behaviour configuring multiple partitions and not overlapping hosts. In other words if a partition A has hosts a[1-2] and partition B has hosts b[1-2] two jobs from two users submitted to -p A,B can both ran in different partitions. This seems ok to me since the partition resources do not overlap. This appears to be your case as well as most of the host in your partitions do not overlap as well, except some in idle and fastio. If I change the partition configuration so that all partitions use the same hosts I got strict first come first serve behaviour. David So is it a bug or designed to be like that? All other things being equal, jobs should start in priority order in each queue. David is looking for a bug. On July 29, 2015 7:58:16 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1829 > >Akmal Madzlan <akmalm@dugeo.com> changed: > > What |Removed |Added >---------------------------------------------------------------------------- > Assignee|david@schedmd.com |brian@schedmd.com > >--- Comment #7 from Akmal Madzlan <akmalm@dugeo.com> --- >So is it a bug or designed to be like that? > >-- >You are receiving this mail because: >You are on the CC list for the bug. Assign to the Scheduler team, Support has provided a clear reproducer, sent in separate email. David If a job array is submitted to multiple partitions and some, but not all tasks of the job array are started in one partition, then the additional tasks in the job array would not be considered for scheduling in another partition. This is fixed with this commit: https://github.com/SchedMD/slurm/commit/0a51f0ecad6219824090bb22299f26b413b6dcb7 Created attachment 2092 [details]
Bug fix
I needed to make a second patch to correct a problem in the previously cited commit. This patch contains the final version of the change (i.e. both commits, merged).
|
Created attachment 2074 [details] config & log Two jobs submitted to multiple queue (lud36,teamregent,fastio,idle) 4347312 with priority 401 and 4347330 with priority 400 Ocassionally, job 4347330 get scheduled before 4347312 slurm config and logs attached [akmalm@lud34 ~]$ squeue -u username -o "%10P %10Q %10i %t" PARTITION PRIORITY JOBID ST lud36,team 401 4347312_[9 PD lud36,team 400 4347330_[1 PD teamregent 401 4347312_98 R teamregent 401 4347312_98 R teamregent 401 4347312_98 R teamregent 401 4347312_98 R teamregent 401 4347312_99 R fastio 400 4347330_15 R fastio 400 4347330_15 R lud36 400 4347330_15 R JobId=4347312 ArrayJobId=4347312 ArrayTaskId=997-1212 JobName=2012_neartracecube UserId=michaelp(1309) GroupId=teamregent(2113) Priority=401 Nice=0 Account=(null) QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2015-07-28T08:57:26 EligibleTime=2015-07-28T08:57:36 StartTime=2016-07-27T10:33:02 EndTime=2015-07-28T11:33:02 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=lud36,teamregent,fastio,idle AllocNode:Sid=lud36:22465 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=50G MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2012_neartracecube/200jobs/rj.2012_neartracecube.R3zxQg WorkDir=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2012_neartracecube/200jobs Comment=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2012_neartracecube/200jobs/workflow.job StdErr=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2012_neartracecube/000scratch/logs/2012_neartracecube.o4347312.4294967294 StdIn=/dev/null StdOut=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2012_neartracecube/000scratch/logs/2012_neartracecube.o4347312.4294967294 JobId=4347330 ArrayJobId=4347330 ArrayTaskId=158-1080 JobName=2014_neartracecube UserId=michaelp(1309) GroupId=teamregent(2113) Priority=400 Nice=0 Account=(null) QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2015-07-28T09:00:44 EligibleTime=2015-07-28T09:00:46 StartTime=Unknown EndTime=2015-07-28T11:33:42 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=lud36,teamregent,fastio,idle AllocNode:Sid=lud36:22465 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=50G MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2014_neartracecube/200jobs/rj.2014_neartracecube.nbperf WorkDir=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2014_neartracecube/200jobs Comment=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2014_neartracecube/200jobs/workflow.job StdErr=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2014_neartracecube/000scratch/logs/2014_neartracecube.o4347330.4294967294 StdIn=/dev/null StdOut=/l3/maersk/tpONotPr_011/seiTimeProc/prod/2014_neartracecube/000scratch/logs/2014_neartracecube.o4347330.4294967294