| Summary: | Job priority & scheduling | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | James Powell <James.Powell> |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 16.05.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CSIRO | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 17.02.4 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
version, slurm.conf & sdiag output
config info as requested slurmctld log from time of job #3745846 submit |
||
Trying to get some jobs that have been long queued running, I've experimented with manually updating the priority, which has only a temporary effect on the priority reported by scontrol e.g. (all commands within a few seconds)
pow114@cm01:~> sprio -u ho033
JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS
3745846 ho033 10510 62 406 41 1 9000
pow114@cm01:~> scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
Priority=1 Nice=-1000 Account=root QOS=gpu
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
StartTime=2017-04-26T10:10:00 EndTime=2017-04-27T10:10:00 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=g[044,046,048,085]
NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=64,mem=32768,node=4
Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
Features=(null) Gres=gpu:3 Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
WorkDir=/flush2/ho033/namd/hil32k/no-moieties
StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
StdIn=/dev/null
StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
Power=
pow114@cm01:~> scontrol update job 3745846 priority=20000
pow114@cm01:~> scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
Priority=20000 Nice=0 Account=root QOS=gpu
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=g[044,046,048,085]
NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=64,mem=32768,node=4
Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
Features=(null) Gres=gpu:3 Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
WorkDir=/flush2/ho033/namd/hil32k/no-moieties
StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
StdIn=/dev/null
StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
Power=
pow114@cm01:~> sprio -u ho033
JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS
pow114@cm01:~> scontrol show job 3745846
JobId=3745846 JobName=namd-4nodes
UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A
Priority=1 Nice=0 Account=root QOS=gpu
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03
StartTime=2017-04-26T10:10:00 EndTime=2017-04-27T10:10:00 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=g[044,046,048,085]
NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=64,mem=32768,node=4
Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=*
MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0
Features=(null) Gres=gpu:3 Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes
WorkDir=/flush2/ho033/namd/hil32k/no-moieties
StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
StdIn=/dev/null
StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out
Power=
pow114@cm01:~> squeue -u ho033 -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %Q"
JOBID PARTITION NAME USER ST TIME NODES PRIORITY
3745846 h24gpu,gp namd-4no ho033 PD 0:00 4 1001
pow114@cm01:~> squeue -u ho033 -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %Q"
JOBID PARTITION NAME USER ST TIME NODES PRIORITY
3745846 h24gpu,gp namd-4no ho033 PD 0:00 4 1
Priority reported with squeue appeared to drain back to 1
From the scontrol man page I see "Explicitly setting a job's priority clears any previously set nice value and removes the priority/multifactor plugin's ability to manage a job's priority" This appears to be not the case?
As a "work around" we're using MaxJobs to limit savvy users that are breaking their jobs into many many tiny pieces.
Hi Could you send me your lua submit plugin and slurmctld log containing job 3745846? Output from "sacctmgr show qos" will be also useful. Dominik Hi Dominik, Uploading attachments as requested. Slurmctld log from submit time of job #3745846. Please note I turned on "DebugFlags=Backfill,Priority" a few hours after the job was submitted. I also tried manually setting the priority & nice values too e.g. cm01:~ # history | grep update | grep 3745846 1007 2017-04-19 15:24:26 scontrol update job 3745846 priority=20000 1019 2017-04-19 15:48:54 scontrol update job 3745846 priority=20000 1021 2017-04-19 15:52:35 scontrol update job 3745846 nice=-20000 1025 2017-04-19 16:04:58 scontrol update job 3745846 priority=20000 1026 2017-04-19 16:05:04 scontrol update job 3745846 nice=-20000 Current state of job: cm01:~ # scontrol show job 3745846 JobId=3745846 JobName=namd-4nodes UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A Priority=1 Nice=-1000 Account=root QOS=gpu JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03 StartTime=2017-04-24T13:30:11 EndTime=2017-04-25T13:30:11 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=g[033-034,051-052] NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,mem=32768,node=4 Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=* MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0 Features=(null) Gres=gpu:3 Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes WorkDir=/flush2/ho033/namd/hil32k/no-moieties StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out StdIn=/dev/null StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out Power= Cheers James Created attachment 4378 [details]
config info as requested
Created attachment 4379 [details]
slurmctld log from time of job #3745846 submit
I should mention that we have a script run from cron that changes the 'NICE" factor of jobs, so that each users gets raised priority for a few jobs. The idea is that people are happier if something is running - and particularly unhappy when nothing is running.
This has been disabled to remove it from consideration - we're still experiencing the odd behaviour
e.g.
pow114@cm01:~> squeue -u mcg130 -o "%.9i %.13P %.8j %.8u %.2t %.6D %Q"
JOBID PARTITION NAME USER ST NODES PRIORITY
3824062 h2,h24,defq run6_G mcg130 PD 6 1
pow114@cm01:~> scontrol update job 3824062 priority=17000
pow114@cm01:~> squeue -u mcg130 -o "%.9i %.13P %.8j %.8u %.2t %.6D %Q"
JOBID PARTITION NAME USER ST NODES PRIORITY
3824062 h2,h24,defq run6_G mcg130 PD 6 17000
...few minutes later...
pow114@cm01:~> squeue -u mcg130 -o "%.9i %.13P %.8j %.8u %.2t %.6D %Q"
JOBID PARTITION NAME USER ST NODES PRIORITY
3824062 h2,h24,defq run6_G mcg130 PD 6 1
Cheers
James
Thanks for log. In log there is something which modify all jobs every 5 minutes, I suspect that this was done by cron and If I understand correctly: your problem with floating priority won't disappear after disabled this script. Correct way to get scheduler work like you expect is to increase fairshare factor. In slurm.conf PriorityWeightJobSize 600 time is greater than rest of the factors. I have some suspicion what cause problem and I will try to reproduce this. I also noticed multiple reconfiguration, do you have any automatic script which could do this? Dominik I forgot. Could you set debugflags Priority and try to catch this? Dominik Hi Dominik, Yes, we had a cron running every 5 minutes - now disabled (it only altered NICE values). We haven't been able to get the desired behaviour using fairshare factor - but will look into it again shortly. There were multiple reconfigurations, all either initiated by me (after adding logging flags to slurm.conf for example) or by Bright Cluster Manager, nothing scripted to my knowledge. We're running with these debug flags; cm01:~ # grep ^Debug /etc/slurm/slurm.conf DebugFlags=Backfill,Priority The problem jobs have now moved through the queue and executed (was able to get a higher priority to "stick" after a few days). We continue to monitor for priority "1" jobs Cheers James Hi Do you have any new log? Dominik Hi Dominik, The jobs we had priority issues with completed a week or so ago & we've not seen a recurrence of the issue since. I've left debugging flags in place & am monitoring for jobs reporting priority "1" from "squeue". cheers James Hi We add 2 patches that fix this issue: https://github.com/SchedMD/slurm/commit/bf7e0e7b1ca89 https://github.com/SchedMD/slurm/commit/a116884059250 I'm going to close this because I believe this problem has been solved. Dominik |
Created attachment 4366 [details] version, slurm.conf & sdiag output Hi SchedMD, We're experiencing job scheduling issues, particularly with very small jobs getting ahead of larger jobs when we'd not expect them to. As a starting point, is the following expected behavior? Conflicting priority numbers 10446 vs 1 e.g. pow114@cm01:~> sprio -u ho033 JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS 3745846 ho033 10446 9 395 41 1 9000 pow114@cm01:~> scontrol show job 3745846 JobId=3745846 JobName=namd-4nodes UserId=ho033(303416) GroupId=hpc-users(319125) MCS_label=N/A Priority=1 Nice=-1000 Account=root QOS=gpu JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2017-04-18T13:25:03 EligibleTime=2017-04-18T13:25:03 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=h24gpu,gpu AllocNode:Sid=bragg-gpu:31615 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=g[087,089,094,098] NumNodes=4-4 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,mem=32768,node=4 Socks/Node=* NtasksPerN:B:S:C=16:0:*:1 CoreSpec=* MinCPUsNode=16 MinMemoryCPU=512M MinTmpDiskNode=0 Features=(null) Gres=gpu:3 Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/flush2/ho033/namd/hil32k/no-moieties/job.slurm.4nodes WorkDir=/flush2/ho033/namd/hil32k/no-moieties StdErr=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out StdIn=/dev/null StdOut=/flush2/ho033/namd/hil32k/no-moieties/slurm-3745846.out Power= The system seems to be behaving as though the priority reported from "scontrol show job" is the value being not the much higher value reported from "sprio" Will attach slurm.conf & sdiag output Cheers James