Yesterday we noticed a few jobs were not running with the reason "ReqNodeNotAvail". We could not see a reason for this (nodes in the requested partition had enough free resources to run the job) and at some point later the jobs ran without us doing anything to them. Today we're seeing the same reason for a job pending but this time it's a job assigned to a partition that only contains 1 node and that node is definitely 100% available from what I can see. Any idea what is going on here? I'll attach logs and here's what I have for the job that is currently stuck PENDING: # scontrol show job 74806 JobId=74806 JobName=idv36085 UserId=jeff.white(8003) GroupId=its_p_sto_qa_hpc_kamiak-its_staff(7000) Priority=4294851084 Nice=0 Account=noninvestor QOS=normal JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes: Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2016-05-26T14:11:10 EligibleTime=2016-05-26T14:11:10 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=free_gpu AllocNode:Sid=login-p1n02:48990 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=24 CPUs/Task=24 ReqB:S:C:T=0:0:*:* TRES=cpu=24,mem=154584,node=1 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=24 MinMemoryCPU=6441M MinTmpDiskNode=0 Features=(null) Gres=gpu:2 Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/tmp/myjob_jeff.white.36085 WorkDir=/home/jeff.white StdErr=/home/jeff.white/slurm-idv36085.o74806 StdIn=/dev/null StdOut=/home/jeff.white/slurm-idv36085.o74806 Power= SICP=0 ... and the node it /should/ be running on: # scontrol show node sn3 NodeName=sn3 Arch=x86_64 CoresPerSocket=12 CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.91 Features=(null) Gres=gpu:tesla:4 NodeAddr=sn3 NodeHostName=sn3 Version=15.08 OS=Linux RealMemory=257854 AllocMem=0 FreeMem=253182 Sockets=2 Boards=1 State=MAINT ThreadsPerCore=1 TmpDisk=128927 Weight=1 Owner=N/A BootTime=2016-05-26T13:57:21 SlurmdStartTime=2016-05-26T14:02:29 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s ... which is the only node in this partition: # scontrol show partition free_gpu PartitionName=free_gpu AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=free DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=sn3 Priority=1000 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=REQUEUE State=UP TotalCPUs=24 TotalNodes=1 SelectTypeParameters=N/A DefMemPerCPU=6441 MaxMemPerNode=UNLIMITED ... and there is a reservation set on the node (which my sbatch should be requesting to use): # scontrol show reservation sn3 ReservationName=sn3 StartTime=2016-05-26T09:38:56 EndTime=2017-05-26T09:38:56 Duration=365-00:00:00 Nodes=sn3 NodeCnt=1 CoreCnt=24 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES TRES=cpu=24 Users=nick.maggio,yunshu.du,gabriel.delacruz,jeff.white Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
I wouldn't recommend routinely running jobs in reservations with the MAINT flag set... that flag indicates the hardware may be sporadically unavailable, and changes how the time is accounted for by sreport/sacct as well. The problem here is that the job was not submitted against the reservation. Jobs must be explicitly run under a reservation in Slurm. If the job specified "--reservation=sn3" it would have let the job launch when the reservation started. 'scontrol update jobid=74806 reservation=sn3' should get the job running now.
Created attachment 3154 [details] slurmctld.log.gz
Created attachment 3155 [details] slurmd.log.gz
Well, that was fast... I found the cause too. It was a problem with the program that was being used to generate the sbatch file (it didn't handle the reservation correctly so it didn't ask Slurm for it; nor did it produce an error or log message saying it ignored the reservation we specified).