Ticket 3708

Summary:	Job priority & scheduling
Product:	Slurm	Reporter:	James Powell <James.Powell>
Component:	Scheduling	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	16.05.7
Hardware:	Linux
OS:	Linux
Site:	CSIRO	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.02.4
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	version, slurm.conf & sdiag output config info as requested slurmctld log from time of job #3745846 submit

Description James Powell 2017-04-18 00:39:44 MDT

Created attachment 4366 [details]
version, slurm.conf & sdiag output

Hi SchedMD,

We're experiencing job scheduling issues, particularly with very small jobs getting ahead of larger jobs when we'd not expect them to. As a starting point, is the following expected behavior? Conflicting priority numbers 10446 vs 1 

e.g.

pow114@cm01:~> sprio -u ho033
          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
        3745846    ho033      10446          9        395         41          1       9000
pow114@cm01:~> scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
   UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
   Priority=1 Nice=-1000 Account=root QOS=gpu
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=g[087,089,094,098]
   NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=32768,node=4
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
   Features=(null) Gres=gpu:3 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
   WorkDir=/flush2/ho033/namd/hil32k/no-moieties
   StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   StdIn=/dev/null
   StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   Power=

The system seems to be behaving as though the priority reported from "scontrol show job" is the value being not the much higher value reported from "sprio" 

Will attach slurm.conf & sdiag output

Cheers

James

Comment 1 James Powell 2017-04-18 18:39:20 MDT

Trying to get some jobs that have been long queued running, I've experimented with manually updating the priority, which has only a temporary effect on the priority reported by scontrol e.g. (all commands within a few seconds)

pow114@cm01:~> sprio -u ho033
          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
        3745846    ho033      10510         62        406         41          1       9000
pow114@cm01:~> scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
   UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
   Priority=1 Nice=-1000 Account=root QOS=gpu
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
   StartTime=2017-04-26T10:10:00 EndTime=2017-04-27T10:10:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=g[044,046,048,085]
   NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=32768,node=4
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
   Features=(null) Gres=gpu:3 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
   WorkDir=/flush2/ho033/namd/hil32k/no-moieties
   StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   StdIn=/dev/null
   StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   Power=

pow114@cm01:~> scontrol update job 3745846 priority=20000
pow114@cm01:~> scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
   UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
   Priority=20000 Nice=0 Account=root QOS=gpu
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=g[044,046,048,085]
   NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=32768,node=4
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
   Features=(null) Gres=gpu:3 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
   WorkDir=/flush2/ho033/namd/hil32k/no-moieties
   StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   StdIn=/dev/null
   StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   Power=
pow114@cm01:~> sprio -u ho033
          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
pow114@cm01:~> scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
   UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
   Priority=1 Nice=0 Account=root QOS=gpu
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
   StartTime=2017-04-26T10:10:00 EndTime=2017-04-27T10:10:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=g[044,046,048,085]
   NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=32768,node=4
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
   Features=(null) Gres=gpu:3 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
   WorkDir=/flush2/ho033/namd/hil32k/no-moieties
   StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   StdIn=/dev/null
   StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   Power=
pow114@cm01:~> squeue -u ho033 -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %Q"
             JOBID PARTITION     NAME     USER ST       TIME  NODES PRIORITY
           3745846 h24gpu,gp namd-4no    ho033 PD       0:00      4 1001
pow114@cm01:~> squeue -u ho033 -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %Q"
             JOBID PARTITION     NAME     USER ST       TIME  NODES PRIORITY
           3745846 h24gpu,gp namd-4no    ho033 PD       0:00      4 1

Priority reported with squeue appeared to drain back to 1

From the scontrol man page I see "Explicitly setting a job's priority clears any previously set nice value and removes the priority/multifactor plugin's ability to manage a job's priority" This appears to be not the case? 

As a "work around" we're using MaxJobs to limit savvy users that are breaking their jobs into many many tiny pieces.

Comment 2 Dominik Bartkiewicz 2017-04-19 04:05:34 MDT

Hi

Could you send me your lua submit plugin and slurmctld log containing job 3745846?
Output from "sacctmgr show qos" will be also useful.


Dominik

Comment 3 James Powell 2017-04-19 17:25:51 MDT

Hi Dominik,

Uploading attachments as requested.

Slurmctld log from submit time of job #3745846. Please note I turned on "DebugFlags=Backfill,Priority" a few hours after the job was submitted. I also tried manually setting the priority & nice values too e.g.

cm01:~ # history | grep update | grep 3745846
 1007  2017-04-19 15:24:26 scontrol update  job 3745846 priority=20000
 1019  2017-04-19 15:48:54 scontrol update  job 3745846 priority=20000
 1021  2017-04-19 15:52:35 scontrol update  job 3745846 nice=-20000
 1025  2017-04-19 16:04:58 scontrol update  job 3745846 priority=20000
 1026  2017-04-19 16:05:04 scontrol update  job 3745846 nice=-20000

Current state of job:

cm01:~ # scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
   UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
   Priority=1 Nice=-1000 Account=root QOS=gpu
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
   StartTime=2017-04-24T13:30:11 EndTime=2017-04-25T13:30:11 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=g[033-034,051-052]
   NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=32768,node=4
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
   Features=(null) Gres=gpu:3 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
   WorkDir=/flush2/ho033/namd/hil32k/no-moieties
   StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   StdIn=/dev/null
   StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
   Power=


Cheers

James

Comment 4 James Powell 2017-04-19 17:27:34 MDT

Created attachment 4378 [details]
config info as requested

Comment 5 James Powell 2017-04-19 17:31:40 MDT

Created attachment 4379 [details]
slurmctld log from time of job #3745846 submit

Comment 6 James Powell 2017-04-20 00:15:00 MDT

I should mention that we have a script run from cron that changes the 'NICE" factor of jobs, so that each users gets raised priority for a few jobs. The idea is that people are happier if something is running - and particularly unhappy when nothing is running.

This has been disabled to remove it from consideration - we're still experiencing the odd behaviour

e.g.

pow114@cm01:~> squeue -u mcg130 -o "%.9i %.13P %.8j %.8u %.2t %.6D %Q"
    JOBID     PARTITION     NAME     USER ST  NODES PRIORITY
  3824062   h2,h24,defq   run6_G   mcg130 PD      6 1
pow114@cm01:~> scontrol update job 3824062 priority=17000
pow114@cm01:~> squeue -u mcg130 -o "%.9i %.13P %.8j %.8u %.2t %.6D %Q"
    JOBID     PARTITION     NAME     USER ST  NODES PRIORITY
  3824062   h2,h24,defq   run6_G   mcg130 PD      6 17000

...few minutes later...

pow114@cm01:~> squeue -u mcg130 -o "%.9i %.13P %.8j %.8u %.2t %.6D %Q"
    JOBID     PARTITION     NAME     USER ST  NODES PRIORITY
  3824062   h2,h24,defq   run6_G   mcg130 PD      6 1

Cheers

James

Comment 7 Dominik Bartkiewicz 2017-04-21 04:39:02 MDT

Thanks  for log.

In log there is something which modify all jobs every 5 minutes, I suspect that this was done by cron and If I understand correctly: your problem with floating priority won't disappear after disabled this script.

Correct way to get scheduler work like you expect is to increase fairshare factor.
In slurm.conf PriorityWeightJobSize 600 time is greater than rest of the factors.

I have some suspicion what cause problem and I will try to reproduce this.

I also noticed multiple reconfiguration, do you have any automatic script which could do this? 

Dominik

Comment 8 Dominik Bartkiewicz 2017-04-21 05:59:25 MDT

I forgot.
Could you set debugflags Priority and try to catch this?

Dominik

Comment 9 James Powell 2017-04-23 22:33:30 MDT

Hi Dominik,

Yes, we had a cron running every 5 minutes - now disabled (it only altered NICE values). We haven't been able to get the desired behaviour using fairshare factor - but will look into it again shortly.


There were multiple reconfigurations, all either initiated by me (after adding logging flags to slurm.conf for example) or by Bright Cluster Manager, nothing scripted to my knowledge.


We're running with these debug flags;

cm01:~ # grep ^Debug /etc/slurm/slurm.conf 
DebugFlags=Backfill,Priority


The problem jobs have now moved through the queue and executed (was able to get a higher priority to "stick" after a few days). We continue to monitor for priority "1" jobs

Cheers

James

Comment 19 Dominik Bartkiewicz 2017-05-02 07:38:06 MDT

Hi

Do you have any new log?

Dominik

Comment 20 James Powell 2017-05-02 18:49:04 MDT

Hi Dominik,

The jobs we had priority issues with completed a week or so ago & we've not seen a recurrence of the issue since. I've left debugging flags in place & am monitoring for jobs reporting priority "1" from "squeue".

cheers

James

Comment 23 Dominik Bartkiewicz 2017-05-18 04:16:32 MDT

Hi

We add 2 patches that fix this issue:
https://github.com/SchedMD/slurm/commit/bf7e0e7b1ca89
https://github.com/SchedMD/slurm/commit/a116884059250

I'm going to close this because I believe this problem has been solved.

Dominik