2639 – priority jobs reserving busy nodes

Ticket 2639 - priority jobs reserving busy nodes

Summary: priority jobs reserving busy nodes

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	15.08.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-04-14 09:39 MDT by Michael Gutteridge
Modified:	2016-04-14 09:54 MDT (History)
CC List:	0 users

See Also:
Site:	FHCRC - Fred Hutchinson Cancer Research Center
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm configuration (53.09 KB, text/plain) 2016-04-14 09:39 MDT, Michael Gutteridge	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Michael Gutteridge 2016-04-14 09:39:35 MDT

Created attachment 3001 [details]
slurm configuration

I am running into a situation where priority jobs (i.e. jobs at the top of the list to be run) are reserving busy resources when there appear to be available resources (resources running preemptable jobs).

Right now my queue looks like:

 squeue -t pd |head
          JOBID      USER  ACCOUNT PARTITION QOS      NAME               ST       TIME  NODES CPUS MIN_ NODELIST(R PRIORITY
       38429342   lsycuro fredrick    campus normal   PRODEGE-0-14606646 PD       0:00      1 4    4    (Resources 110001
       38429343   lsycuro fredrick    campus normal   PRODEGE-0-14606646 PD       0:00      1 4    4    (Priority) 110001

That first job looks like this:

$ scontrol show job 38429342
JobId=38429342 JobName=PRODEGE-0-1460664608
   UserId=lsycuro(35247) GroupId=g_lsycuro(35247)
   Priority=110000 Nice=0 Account=fredricks_d QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=3-00:00:00 TimeMin=N/A
   SubmitTime=2016-04-14T13:10:08 EligibleTime=2016-04-14T13:10:08
   StartTime=2016-04-14T19:52:02 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=campus AllocNode:Sid=sphinx:20706
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=gizmof17
   NumNodes=1 NumCPUs=4 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)

It has apparently reserved node gizmof17, which is currently in use by a guaranteed job:

$ squeue -w gizmof17
          JOBID      USER  ACCOUNT PARTITION QOS      NAME               ST       TIME  NODES CPUS MIN_ NODELIST(R PRIORITY
       38355078   yzhuang  huang_y    campus normal   myCover_rv10rp1_it  R 3-19:39:28    201 201  1    gizmof[2-1 10010

meanwhile, nearly identical resources are running preemptable jobs:

: squeue -w gizmof233
          JOBID      USER  ACCOUNT PARTITION QOS      NAME               ST       TIME  NODES CPUS MIN_ NODELIST(R PRIORITY
       38428440  pbradley bradley_   restart restart  job13m1_996         R    2:50:20      1 1    1    gizmof233  11111
       38428364  pbradley bradley_   restart restart  job13m1_996         R    3:29:27      1 1    1    gizmof233  1
       38428345  pbradley bradley_   restart restart  job13m1_996         R    3:37:27      1 1    1    gizmof233  1
       38426571  pbradley bradley_   restart restart  job13m1_947         R    6:30:32      1 1    1    gizmof233  1

AFAICT, jobs on gizmof233 should be able to be preempted... there are about 10 other nodes (all identical) that also have jobs that could be preempted to provide sufficient resources for the priority job.

Interestingly, backfill seems to work fine (i.e. jobs can backfill around this priority job and preempt resources on these nodes).

Let me know what other information I can provide.

Comment 1 Danny Auble 2016-04-14 09:48:24 MDT

Hey Michael,

From your configuration it doesn't appear gizmof233 is part of the campus partition

PartitionName=campus Default=yes DefaultTime=3-0 MaxTime=30-0 Nodes=gizmof[1-180],gizmof[241-384],gizmog[1-10] PreemptMode=off Priority=10000 QOS=public State=UP

Is that expected, or am I reading it wrong?

Comment 2 Michael Gutteridge 2016-04-14 09:51:58 MDT

ah, crap.  I forgot about the hole in that partition. So sorry, should have caught that.

Comment 3 Danny Auble 2016-04-14 09:54:26 MDT

No problem, glad it was an easy overlook ;).