Ticket 3708 - Job priority & scheduling
Summary: Job priority & scheduling
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 16.05.7
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-04-18 00:39 MDT by James Powell
Modified: 2017-05-18 04:16 MDT (History)
0 users

See Also:
Site: CSIRO
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.02.4
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
version, slurm.conf & sdiag output (30.24 KB, text/plain)
2017-04-18 00:39 MDT, James Powell
Details
config info as requested (11.93 KB, text/plain)
2017-04-19 17:27 MDT, James Powell
Details
slurmctld log from time of job #3745846 submit (32.75 MB, application/x-xz)
2017-04-19 17:31 MDT, James Powell
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description James Powell 2017-04-18 00:39:44 MDT
Created attachment 4366 [details]
version, slurm.conf & sdiag output

Hi SchedMD,

We're experiencing job scheduling issues, particularly with very small jobs getting ahead of larger jobs when we'd not expect them to. As a starting point, is the following expected behavior? Conflicting priority numbers 10446 vs 1 

e.g.

pow114@cm01:~> sprio -u ho033
          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
        3745846    ho033      10446          9        395         41          1       9000
pow114@cm01:~> scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
   UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
   Priority=1 Nice=-1000 Account=root QOS=gpu
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=g[087,089,094,098]
   NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=32768,node=4
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
   Features=(null) Gres=gpu:3 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
   WorkDir=/flush2/ho033/namd/hil32k/no-moieties
   StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   StdIn=/dev/null
   StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   Power=

The system seems to be behaving as though the priority reported from "scontrol show job" is the value being not the much higher value reported from "sprio" 

Will attach slurm.conf & sdiag output

Cheers

James
Comment 1 James Powell 2017-04-18 18:39:20 MDT
Trying to get some jobs that have been long queued running, I've experimented with manually updating the priority, which has only a temporary effect on the priority reported by scontrol e.g. (all commands within a few seconds)

pow114@cm01:~> sprio -u ho033
          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
        3745846    ho033      10510         62        406         41          1       9000
pow114@cm01:~> scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
   UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
   Priority=1 Nice=-1000 Account=root QOS=gpu
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
   StartTime=2017-04-26T10:10:00 EndTime=2017-04-27T10:10:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=g[044,046,048,085]
   NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=32768,node=4
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
   Features=(null) Gres=gpu:3 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
   WorkDir=/flush2/ho033/namd/hil32k/no-moieties
   StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   StdIn=/dev/null
   StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   Power=

pow114@cm01:~> scontrol update job 3745846 priority=20000
pow114@cm01:~> scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
   UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
   Priority=20000 Nice=0 Account=root QOS=gpu
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=g[044,046,048,085]
   NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=32768,node=4
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
   Features=(null) Gres=gpu:3 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
   WorkDir=/flush2/ho033/namd/hil32k/no-moieties
   StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   StdIn=/dev/null
   StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   Power=
pow114@cm01:~> sprio -u ho033
          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
pow114@cm01:~> scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
   UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
   Priority=1 Nice=0 Account=root QOS=gpu
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
   StartTime=2017-04-26T10:10:00 EndTime=2017-04-27T10:10:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=g[044,046,048,085]
   NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=32768,node=4
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
   Features=(null) Gres=gpu:3 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
   WorkDir=/flush2/ho033/namd/hil32k/no-moieties
   StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   StdIn=/dev/null
   StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   Power=
pow114@cm01:~> squeue -u ho033 -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %Q"
             JOBID PARTITION     NAME     USER ST       TIME  NODES PRIORITY
           3745846 h24gpu,gp namd-4no    ho033 PD       0:00      4 1001
pow114@cm01:~> squeue -u ho033 -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %Q"
             JOBID PARTITION     NAME     USER ST       TIME  NODES PRIORITY
           3745846 h24gpu,gp namd-4no    ho033 PD       0:00      4 1

Priority reported with squeue appeared to drain back to 1

From the scontrol man page I see "Explicitly setting a job's priority clears any previously set nice value and removes the priority/multifactor plugin's ability to manage a job's priority" This appears to be not the case? 

As a "work around" we're using MaxJobs to limit savvy users that are breaking their jobs into many many tiny pieces.
Comment 2 Dominik Bartkiewicz 2017-04-19 04:05:34 MDT
Hi

Could you send me your lua submit plugin and slurmctld log containing job 3745846?
Output from "sacctmgr show qos" will be also useful.


Dominik
Comment 3 James Powell 2017-04-19 17:25:51 MDT
Hi Dominik,

Uploading attachments as requested.

Slurmctld log from submit time of job #3745846. Please note I turned on "DebugFlags=Backfill,Priority" a few hours after the job was submitted. I also tried manually setting the priority & nice values too e.g.

cm01:~ # history | grep update | grep 3745846
 1007  2017-04-19 15:24:26 scontrol update  job 3745846 priority=20000
 1019  2017-04-19 15:48:54 scontrol update  job 3745846 priority=20000
 1021  2017-04-19 15:52:35 scontrol update  job 3745846 nice=-20000
 1025  2017-04-19 16:04:58 scontrol update  job 3745846 priority=20000
 1026  2017-04-19 16:05:04 scontrol update  job 3745846 nice=-20000

Current state of job:

cm01:~ # scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
   UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
   Priority=1 Nice=-1000 Account=root QOS=gpu
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
   StartTime=2017-04-24T13:30:11 EndTime=2017-04-25T13:30:11 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=g[033-034,051-052]
   NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=32768,node=4
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
   Features=(null) Gres=gpu:3 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
   WorkDir=/flush2/ho033/namd/hil32k/no-moieties
   StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   StdIn=/dev/null
   StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   Power=


Cheers

James
Comment 4 James Powell 2017-04-19 17:27:34 MDT
Created attachment 4378 [details]
config info as requested
Comment 5 James Powell 2017-04-19 17:31:40 MDT
Created attachment 4379 [details]
slurmctld log from time of job #3745846 submit
Comment 6 James Powell 2017-04-20 00:15:00 MDT
I should mention that we have a script run from cron that changes the 'NICE" factor of jobs, so that each users gets raised priority for a few jobs. The idea is that people are happier if something is running - and particularly unhappy when nothing is running.

This has been disabled to remove it from consideration - we're still experiencing the odd behaviour

e.g.

pow114@cm01:~> squeue -u mcg130 -o "%.9i %.13P %.8j %.8u %.2t %.6D %Q"
    JOBID     PARTITION     NAME     USER ST  NODES PRIORITY
  3824062   h2,h24,defq   run6_G   mcg130 PD      6 1
pow114@cm01:~> scontrol update job 3824062 priority=17000
pow114@cm01:~> squeue -u mcg130 -o "%.9i %.13P %.8j %.8u %.2t %.6D %Q"
    JOBID     PARTITION     NAME     USER ST  NODES PRIORITY
  3824062   h2,h24,defq   run6_G   mcg130 PD      6 17000

...few minutes later...

pow114@cm01:~> squeue -u mcg130 -o "%.9i %.13P %.8j %.8u %.2t %.6D %Q"
    JOBID     PARTITION     NAME     USER ST  NODES PRIORITY
  3824062   h2,h24,defq   run6_G   mcg130 PD      6 1

Cheers

James
Comment 7 Dominik Bartkiewicz 2017-04-21 04:39:02 MDT
Thanks  for log.

In log there is something which modify all jobs every 5 minutes, I suspect that this was done by cron and If I understand correctly: your problem with floating priority won't disappear after disabled this script.

Correct way to get scheduler work like you expect is to increase fairshare factor.
In slurm.conf PriorityWeightJobSize 600 time is greater than rest of the factors.

I have some suspicion what cause problem and I will try to reproduce this.

I also noticed multiple reconfiguration, do you have any automatic script which could do this? 

Dominik
Comment 8 Dominik Bartkiewicz 2017-04-21 05:59:25 MDT
I forgot.
Could you set debugflags Priority and try to catch this?

Dominik
Comment 9 James Powell 2017-04-23 22:33:30 MDT
Hi Dominik,

Yes, we had a cron running every 5 minutes - now disabled (it only altered NICE values). We haven't been able to get the desired behaviour using fairshare factor - but will look into it again shortly.


There were multiple reconfigurations, all either initiated by me (after adding logging flags to slurm.conf for example) or by Bright Cluster Manager, nothing scripted to my knowledge.


We're running with these debug flags;

cm01:~ # grep ^Debug /etc/slurm/slurm.conf 
DebugFlags=Backfill,Priority


The problem jobs have now moved through the queue and executed (was able to get a higher priority to "stick" after a few days). We continue to monitor for priority "1" jobs

Cheers

James
Comment 19 Dominik Bartkiewicz 2017-05-02 07:38:06 MDT
Hi

Do you have any new log?

Dominik
Comment 20 James Powell 2017-05-02 18:49:04 MDT
Hi Dominik,

The jobs we had priority issues with completed a week or so ago & we've not seen a recurrence of the issue since. I've left debugging flags in place & am monitoring for jobs reporting priority "1" from "squeue".

cheers

James
Comment 23 Dominik Bartkiewicz 2017-05-18 04:16:32 MDT
Hi

We add 2 patches that fix this issue:
https://github.com/SchedMD/slurm/commit/bf7e0e7b1ca89
https://github.com/SchedMD/slurm/commit/a116884059250

I'm going to close this because I believe this problem has been solved.

Dominik