5305 – JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes

Ticket 5305 - JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes

Summary: JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	17.11.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-06-13 05:27 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified:	2018-06-26 08:57 MDT (History)
CC List:	0 users

See Also:
Site:	DTU Physics
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (4.92 KB, text/plain) 2018-06-13 05:27 MDT, Ole.H.Nielsen@fysik.dtu.dk	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ole.H.Nielsen@fysik.dtu.dk 2018-06-13 05:27:23 MDT

Created attachment 7072 [details]
slurm.conf

With Slurm 17.11.7 we again have jobs that are pending with a Reason=ReqNodeNotAvail,_UnavailableNodes message:

# scontrol show job 592174
JobId=592174 JobName=ktrain.sh
   UserId=schiotz(2851) GroupId=camdfac(1257) MCS_label=N/A
   Priority=102371 Nice=0 Account=camdfac QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:a117,c001,g[079-110],h[001-002],i[002-051] Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=20:00:00 TimeMin=N/A
   SubmitTime=Wed 11:49:30 EligibleTime=Wed 11:49:30
   StartTime=Wed 13:22:50 EndTime=Thu 09:22:50 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=Wed 13:16:34
   Partition=xeon16 AllocNode:Sid=thul:19422
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=h002
   NumNodes=1-1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=62.50G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=gpu:K20Xm:4 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/niflheim/schiotz/development/atomic-resolution-tensorflow/ktrain.sh
   WorkDir=/home/niflheim/schiotz/development/atomic-resolution-tensorflow
   StdErr=/home/niflheim/schiotz/development/atomic-resolution-tensorflow/ktrain-592174.out
   StdIn=/dev/null
   StdOut=/home/niflheim/schiotz/development/atomic-resolution-tensorflow/ktrain-592174.out
   Power=

This job was submitted with --mem=0 and requesting GPUs Gres=gpu:K20Xm:4 that are available on the allocated node h002 (which is a member of 3 partitions):

# scontrol show node h002
NodeName=h002 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=16.01
   AvailableFeatures=xeon2670v2,infiniband,xeon16,GPU_K20Xm
   ActiveFeatures=xeon2670v2,infiniband,xeon16,GPU_K20Xm
   Gres=gpu:K20Xm:4
   NodeAddr=h002 NodeHostName=h002 Version=17.11
   OS=Linux 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 
   RealMemory=256000 AllocMem=61440 FreeMem=231352 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=32752 Weight=14412 Owner=N/A MCS_label=N/A
   Partitions=xeon16,xeon16_128,xeon16_256 
   BootTime=Fri 13:24:37 SlurmdStartTime=Tue 14:55:58
   CfgTRES=cpu=16,mem=250G,billing=16
   AllocTRES=cpu=16,mem=60G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


This is most likely the same issue reported in Bug 4932 and Bug 4976, but which should hopefully have been fixed in 17.11.7.  Can you kindly revisit this problem?  I attach our current slurm.conf

Thanks,
Ole

Comment 1 Alejandro Sanchez 2018-06-14 03:53:33 MDT

Ole, could you show me the exact sbatch command request and the #SBATCH options as well as the slurmctld.log?

Comment 2 Alejandro Sanchez 2018-06-14 08:17:47 MDT

... also are any of these nodes in a reservation and/or DRAIN/DOWN and/or owned by a user who requested --exclusive=user?

a117,c001,g[079-110],h[001-002],i[002-051]

Comment 3 Alejandro Sanchez 2018-06-14 09:14:36 MDT

which job is running on h002 and how much time passed between:

# scontrol show job 592174
and
# scontrol show node h002

I'm wondering if h002 was resumed back to available from down/drained in between the two scontrol requests.

Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2018-06-25 05:53:55 MDT

(In reply to Alejandro Sanchez from comment #3)
> which job is running on h002 and how much time passed between:
> 
> # scontrol show job 592174
> and
> # scontrol show node h002
> 
> I'm wondering if h002 was resumed back to available from down/drained in
> between the two scontrol requests.

These commands were issued within a few minutes, and no changes were made to the system in between.  The node a117 was down, and node c001 was drained, but those nodes belong to a completely different partition.

Unfortunately, we have not been able to reproduce this error.  I guess the case should be closed, since we can't come up with a reproducer.

Comment 5 Alejandro Sanchez 2018-06-26 08:57:30 MDT

All right. Once we get a reproducer we'll at least have something to work of off. Please reopen if you encounter this again. Thanks.