Ticket 2639

Summary: priority jobs reserving busy nodes
Product: Slurm Reporter: Michael Gutteridge <mrg>
Component: SchedulingAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 15.08.7   
Hardware: Linux   
OS: Linux   
Site: FHCRC - Fred Hutchinson Cancer Research Center Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm configuration

Description Michael Gutteridge 2016-04-14 09:39:35 MDT
Created attachment 3001 [details]
slurm configuration

I am running into a situation where priority jobs (i.e. jobs at the top of the list to be run) are reserving busy resources when there appear to be available resources (resources running preemptable jobs).

Right now my queue looks like:

 squeue -t pd |head
          JOBID      USER  ACCOUNT PARTITION QOS      NAME               ST       TIME  NODES CPUS MIN_ NODELIST(R PRIORITY
       38429342   lsycuro fredrick    campus normal   PRODEGE-0-14606646 PD       0:00      1 4    4    (Resources 110001
       38429343   lsycuro fredrick    campus normal   PRODEGE-0-14606646 PD       0:00      1 4    4    (Priority) 110001

That first job looks like this:

$ scontrol show job 38429342
JobId=38429342 JobName=PRODEGE-0-1460664608
   UserId=lsycuro(35247) GroupId=g_lsycuro(35247)
   Priority=110000 Nice=0 Account=fredricks_d QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=3-00:00:00 TimeMin=N/A
   SubmitTime=2016-04-14T13:10:08 EligibleTime=2016-04-14T13:10:08
   StartTime=2016-04-14T19:52:02 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=campus AllocNode:Sid=sphinx:20706
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=gizmof17
   NumNodes=1 NumCPUs=4 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)

It has apparently reserved node gizmof17, which is currently in use by a guaranteed job:

$ squeue -w gizmof17
          JOBID      USER  ACCOUNT PARTITION QOS      NAME               ST       TIME  NODES CPUS MIN_ NODELIST(R PRIORITY
       38355078   yzhuang  huang_y    campus normal   myCover_rv10rp1_it  R 3-19:39:28    201 201  1    gizmof[2-1 10010

meanwhile, nearly identical resources are running preemptable jobs:

: squeue -w gizmof233
          JOBID      USER  ACCOUNT PARTITION QOS      NAME               ST       TIME  NODES CPUS MIN_ NODELIST(R PRIORITY
       38428440  pbradley bradley_   restart restart  job13m1_996         R    2:50:20      1 1    1    gizmof233  11111
       38428364  pbradley bradley_   restart restart  job13m1_996         R    3:29:27      1 1    1    gizmof233  1
       38428345  pbradley bradley_   restart restart  job13m1_996         R    3:37:27      1 1    1    gizmof233  1
       38426571  pbradley bradley_   restart restart  job13m1_947         R    6:30:32      1 1    1    gizmof233  1

AFAICT, jobs on gizmof233 should be able to be preempted... there are about 10 other nodes (all identical) that also have jobs that could be preempted to provide sufficient resources for the priority job.

Interestingly, backfill seems to work fine (i.e. jobs can backfill around this priority job and preempt resources on these nodes).

Let me know what other information I can provide.
Comment 1 Danny Auble 2016-04-14 09:48:24 MDT
Hey Michael,

From your configuration it doesn't appear gizmof233 is part of the campus partition

PartitionName=campus Default=yes DefaultTime=3-0 MaxTime=30-0 Nodes=gizmof[1-180],gizmof[241-384],gizmog[1-10] PreemptMode=off Priority=10000 QOS=public State=UP

Is that expected, or am I reading it wrong?
Comment 2 Michael Gutteridge 2016-04-14 09:51:58 MDT
ah, crap.  I forgot about the hole in that partition. So sorry, should have caught that.
Comment 3 Danny Auble 2016-04-14 09:54:26 MDT
No problem, glad it was an easy overlook ;).