Ticket 2774

Summary: Job stuck PENDING with ReqNodeNotAvail but all nodes are available
Product: Slurm Reporter: Jeff White <jeff.white>
Component: SchedulingAssignee: Tim Wickberg <tim>
Status: RESOLVED INVALID QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 15.08.7   
Hardware: Linux   
OS: Linux   
Site: Washington State University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmctld.log.gz
slurmd.log.gz

Description Jeff White 2016-05-26 08:26:38 MDT
Yesterday we noticed a few jobs were not running with the reason "ReqNodeNotAvail".  We could not see a reason for this (nodes in the requested partition had enough free resources to run the job) and at some point later the jobs ran without us doing anything to them.  Today we're seeing the same reason for a job pending but this time it's a job assigned to a partition that only contains 1 node and that node is definitely 100% available from what I can see.  Any idea what is going on here?  I'll attach logs and here's what I have for the job that is currently stuck PENDING:

# scontrol show job 74806
JobId=74806 JobName=idv36085
   UserId=jeff.white(8003) GroupId=its_p_sto_qa_hpc_kamiak-its_staff(7000)
   Priority=4294851084 Nice=0 Account=noninvestor QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes: Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2016-05-26T14:11:10 EligibleTime=2016-05-26T14:11:10
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=free_gpu AllocNode:Sid=login-p1n02:48990
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=24 CPUs/Task=24 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,mem=154584,node=1
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=24 MinMemoryCPU=6441M MinTmpDiskNode=0
   Features=(null) Gres=gpu:2 Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/tmp/myjob_jeff.white.36085
   WorkDir=/home/jeff.white
   StdErr=/home/jeff.white/slurm-idv36085.o74806
   StdIn=/dev/null
   StdOut=/home/jeff.white/slurm-idv36085.o74806
   Power= SICP=0

... and the node it /should/ be running on:

# scontrol show node sn3
NodeName=sn3 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.91 Features=(null)
   Gres=gpu:tesla:4
   NodeAddr=sn3 NodeHostName=sn3 Version=15.08
   OS=Linux RealMemory=257854 AllocMem=0 FreeMem=253182 Sockets=2 Boards=1
   State=MAINT ThreadsPerCore=1 TmpDisk=128927 Weight=1 Owner=N/A
   BootTime=2016-05-26T13:57:21 SlurmdStartTime=2016-05-26T14:02:29
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

... which is the only node in this partition:

# scontrol show partition free_gpu
PartitionName=free_gpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=free
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=sn3
   Priority=1000 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE
   State=UP TotalCPUs=24 TotalNodes=1 SelectTypeParameters=N/A
   DefMemPerCPU=6441 MaxMemPerNode=UNLIMITED

... and there is a reservation set on the node (which my sbatch should be requesting to use):

# scontrol show reservation sn3
ReservationName=sn3 StartTime=2016-05-26T09:38:56 EndTime=2017-05-26T09:38:56 Duration=365-00:00:00
   Nodes=sn3 NodeCnt=1 CoreCnt=24 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES
   TRES=cpu=24
   Users=nick.maggio,yunshu.du,gabriel.delacruz,jeff.white Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
Comment 1 Tim Wickberg 2016-05-26 08:36:06 MDT
I wouldn't recommend routinely running jobs in reservations with the MAINT flag set... that flag indicates the hardware may be sporadically unavailable, and changes how the time is accounted for by sreport/sacct as well.

The problem here is that the job was not submitted against the reservation. Jobs must be explicitly run under a reservation in Slurm.

If the job specified "--reservation=sn3" it would have let the job launch when the reservation started. 'scontrol update jobid=74806 reservation=sn3' should get the job running now.
Comment 2 Jeff White 2016-05-26 08:36:18 MDT
Created attachment 3154 [details]
slurmctld.log.gz
Comment 3 Jeff White 2016-05-26 08:37:39 MDT
Created attachment 3155 [details]
slurmd.log.gz
Comment 4 Jeff White 2016-05-26 08:47:45 MDT
Well, that was fast...  I found the cause too.  It was a problem with the program that was being used to generate the sbatch file (it didn't handle the reservation correctly so it didn't ask Slurm for it; nor did it produce an error or log message saying it ignored the reservation we specified).