Ticket 5305

Summary:	JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	Scheduling	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	17.11.7
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf

Description Ole.H.Nielsen@fysik.dtu.dk 2018-06-13 05:27:23 MDT

Created attachment 7072 [details]
slurm.conf

With Slurm 17.11.7 we again have jobs that are pending with a Reason=ReqNodeNotAvail,_UnavailableNodes message:

# scontrol show job 592174
JobId=592174 JobName=ktrain.sh
   UserId=schiotz(2851) GroupId=camdfac(1257) MCS_label=N/A
   Priority=102371 Nice=0 Account=camdfac QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:a117,c001,g[079-110],h[001-002],i[002-051] Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=20:00:00 TimeMin=N/A
   SubmitTime=Wed 11:49:30 EligibleTime=Wed 11:49:30
   StartTime=Wed 13:22:50 EndTime=Thu 09:22:50 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=Wed 13:16:34
   Partition=xeon16 AllocNode:Sid=thul:19422
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=h002
   NumNodes=1-1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=62.50G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=gpu:K20Xm:4 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/niflheim/schiotz/development/atomic-resolution-tensorflow/ktrain.sh
   WorkDir=/home/niflheim/schiotz/development/atomic-resolution-tensorflow
   StdErr=/home/niflheim/schiotz/development/atomic-resolution-tensorflow/ktrain-592174.out
   StdIn=/dev/null
   StdOut=/home/niflheim/schiotz/development/atomic-resolution-tensorflow/ktrain-592174.out
   Power=

This job was submitted with --mem=0 and requesting GPUs Gres=gpu:K20Xm:4 that are available on the allocated node h002 (which is a member of 3 partitions):

# scontrol show node h002
NodeName=h002 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=16.01
   AvailableFeatures=xeon2670v2,infiniband,xeon16,GPU_K20Xm
   ActiveFeatures=xeon2670v2,infiniband,xeon16,GPU_K20Xm
   Gres=gpu:K20Xm:4
   NodeAddr=h002 NodeHostName=h002 Version=17.11
   OS=Linux 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 
   RealMemory=256000 AllocMem=61440 FreeMem=231352 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=32752 Weight=14412 Owner=N/A MCS_label=N/A
   Partitions=xeon16,xeon16_128,xeon16_256 
   BootTime=Fri 13:24:37 SlurmdStartTime=Tue 14:55:58
   CfgTRES=cpu=16,mem=250G,billing=16
   AllocTRES=cpu=16,mem=60G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


This is most likely the same issue reported in Bug 4932 and Bug 4976, but which should hopefully have been fixed in 17.11.7.  Can you kindly revisit this problem?  I attach our current slurm.conf

Thanks,
Ole

Comment 1 Alejandro Sanchez 2018-06-14 03:53:33 MDT

Ole, could you show me the exact sbatch command request and the #SBATCH options as well as the slurmctld.log?

Comment 2 Alejandro Sanchez 2018-06-14 08:17:47 MDT

... also are any of these nodes in a reservation and/or DRAIN/DOWN and/or owned by a user who requested --exclusive=user?

a117,c001,g[079-110],h[001-002],i[002-051]

Comment 3 Alejandro Sanchez 2018-06-14 09:14:36 MDT

which job is running on h002 and how much time passed between:

# scontrol show job 592174
and
# scontrol show node h002

I'm wondering if h002 was resumed back to available from down/drained in between the two scontrol requests.

Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2018-06-25 05:53:55 MDT

(In reply to Alejandro Sanchez from comment #3)
> which job is running on h002 and how much time passed between:
> 
> # scontrol show job 592174
> and
> # scontrol show node h002
> 
> I'm wondering if h002 was resumed back to available from down/drained in
> between the two scontrol requests.

These commands were issued within a few minutes, and no changes were made to the system in between.  The node a117 was down, and node c001 was drained, but those nodes belong to a completely different partition.

Unfortunately, we have not been able to reproduce this error.  I guess the case should be closed, since we can't come up with a reproducer.

Comment 5 Alejandro Sanchez 2018-06-26 08:57:30 MDT

All right. Once we get a reproducer we'll at least have something to work of off. Please reopen if you encounter this again. Thanks.