Ticket 5305

Summary: JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: SchedulingAssignee: Alejandro Sanchez <alex>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 17.11.7   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Ole.H.Nielsen@fysik.dtu.dk 2018-06-13 05:27:23 MDT
Created attachment 7072 [details]
slurm.conf

With Slurm 17.11.7 we again have jobs that are pending with a Reason=ReqNodeNotAvail,_UnavailableNodes message:

# scontrol show job 592174
JobId=592174 JobName=ktrain.sh
   UserId=schiotz(2851) GroupId=camdfac(1257) MCS_label=N/A
   Priority=102371 Nice=0 Account=camdfac QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:a117,c001,g[079-110],h[001-002],i[002-051] Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=20:00:00 TimeMin=N/A
   SubmitTime=Wed 11:49:30 EligibleTime=Wed 11:49:30
   StartTime=Wed 13:22:50 EndTime=Thu 09:22:50 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=Wed 13:16:34
   Partition=xeon16 AllocNode:Sid=thul:19422
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=h002
   NumNodes=1-1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=62.50G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=gpu:K20Xm:4 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/niflheim/schiotz/development/atomic-resolution-tensorflow/ktrain.sh
   WorkDir=/home/niflheim/schiotz/development/atomic-resolution-tensorflow
   StdErr=/home/niflheim/schiotz/development/atomic-resolution-tensorflow/ktrain-592174.out
   StdIn=/dev/null
   StdOut=/home/niflheim/schiotz/development/atomic-resolution-tensorflow/ktrain-592174.out
   Power=

This job was submitted with --mem=0 and requesting GPUs Gres=gpu:K20Xm:4 that are available on the allocated node h002 (which is a member of 3 partitions):

# scontrol show node h002
NodeName=h002 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=16.01
   AvailableFeatures=xeon2670v2,infiniband,xeon16,GPU_K20Xm
   ActiveFeatures=xeon2670v2,infiniband,xeon16,GPU_K20Xm
   Gres=gpu:K20Xm:4
   NodeAddr=h002 NodeHostName=h002 Version=17.11
   OS=Linux 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 
   RealMemory=256000 AllocMem=61440 FreeMem=231352 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=32752 Weight=14412 Owner=N/A MCS_label=N/A
   Partitions=xeon16,xeon16_128,xeon16_256 
   BootTime=Fri 13:24:37 SlurmdStartTime=Tue 14:55:58
   CfgTRES=cpu=16,mem=250G,billing=16
   AllocTRES=cpu=16,mem=60G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


This is most likely the same issue reported in Bug 4932 and Bug 4976, but which should hopefully have been fixed in 17.11.7.  Can you kindly revisit this problem?  I attach our current slurm.conf

Thanks,
Ole
Comment 1 Alejandro Sanchez 2018-06-14 03:53:33 MDT
Ole, could you show me the exact sbatch command request and the #SBATCH options as well as the slurmctld.log?
Comment 2 Alejandro Sanchez 2018-06-14 08:17:47 MDT
... also are any of these nodes in a reservation and/or DRAIN/DOWN and/or owned by a user who requested --exclusive=user?

a117,c001,g[079-110],h[001-002],i[002-051]
Comment 3 Alejandro Sanchez 2018-06-14 09:14:36 MDT
which job is running on h002 and how much time passed between:

# scontrol show job 592174
and
# scontrol show node h002

I'm wondering if h002 was resumed back to available from down/drained in between the two scontrol requests.
Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2018-06-25 05:53:55 MDT
(In reply to Alejandro Sanchez from comment #3)
> which job is running on h002 and how much time passed between:
> 
> # scontrol show job 592174
> and
> # scontrol show node h002
> 
> I'm wondering if h002 was resumed back to available from down/drained in
> between the two scontrol requests.

These commands were issued within a few minutes, and no changes were made to the system in between.  The node a117 was down, and node c001 was drained, but those nodes belong to a completely different partition.

Unfortunately, we have not been able to reproduce this error.  I guess the case should be closed, since we can't come up with a reproducer.
Comment 5 Alejandro Sanchez 2018-06-26 08:57:30 MDT
All right. Once we get a reproducer we'll at least have something to work of off. Please reopen if you encounter this again. Thanks.