Created attachment 7072 [details] slurm.conf With Slurm 17.11.7 we again have jobs that are pending with a Reason=ReqNodeNotAvail,_UnavailableNodes message: # scontrol show job 592174 JobId=592174 JobName=ktrain.sh UserId=schiotz(2851) GroupId=camdfac(1257) MCS_label=N/A Priority=102371 Nice=0 Account=camdfac QOS=normal JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:a117,c001,g[079-110],h[001-002],i[002-051] Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=20:00:00 TimeMin=N/A SubmitTime=Wed 11:49:30 EligibleTime=Wed 11:49:30 StartTime=Wed 13:22:50 EndTime=Thu 09:22:50 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=Wed 13:16:34 Partition=xeon16 AllocNode:Sid=thul:19422 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=h002 NumNodes=1-1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=16,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=62.50G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=gpu:K20Xm:4 Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/niflheim/schiotz/development/atomic-resolution-tensorflow/ktrain.sh WorkDir=/home/niflheim/schiotz/development/atomic-resolution-tensorflow StdErr=/home/niflheim/schiotz/development/atomic-resolution-tensorflow/ktrain-592174.out StdIn=/dev/null StdOut=/home/niflheim/schiotz/development/atomic-resolution-tensorflow/ktrain-592174.out Power= This job was submitted with --mem=0 and requesting GPUs Gres=gpu:K20Xm:4 that are available on the allocated node h002 (which is a member of 3 partitions): # scontrol show node h002 NodeName=h002 Arch=x86_64 CoresPerSocket=8 CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=16.01 AvailableFeatures=xeon2670v2,infiniband,xeon16,GPU_K20Xm ActiveFeatures=xeon2670v2,infiniband,xeon16,GPU_K20Xm Gres=gpu:K20Xm:4 NodeAddr=h002 NodeHostName=h002 Version=17.11 OS=Linux 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 RealMemory=256000 AllocMem=61440 FreeMem=231352 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=32752 Weight=14412 Owner=N/A MCS_label=N/A Partitions=xeon16,xeon16_128,xeon16_256 BootTime=Fri 13:24:37 SlurmdStartTime=Tue 14:55:58 CfgTRES=cpu=16,mem=250G,billing=16 AllocTRES=cpu=16,mem=60G CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s This is most likely the same issue reported in Bug 4932 and Bug 4976, but which should hopefully have been fixed in 17.11.7. Can you kindly revisit this problem? I attach our current slurm.conf Thanks, Ole
Ole, could you show me the exact sbatch command request and the #SBATCH options as well as the slurmctld.log?
... also are any of these nodes in a reservation and/or DRAIN/DOWN and/or owned by a user who requested --exclusive=user? a117,c001,g[079-110],h[001-002],i[002-051]
which job is running on h002 and how much time passed between: # scontrol show job 592174 and # scontrol show node h002 I'm wondering if h002 was resumed back to available from down/drained in between the two scontrol requests.
(In reply to Alejandro Sanchez from comment #3) > which job is running on h002 and how much time passed between: > > # scontrol show job 592174 > and > # scontrol show node h002 > > I'm wondering if h002 was resumed back to available from down/drained in > between the two scontrol requests. These commands were issued within a few minutes, and no changes were made to the system in between. The node a117 was down, and node c001 was drained, but those nodes belong to a completely different partition. Unfortunately, we have not been able to reproduce this error. I guess the case should be closed, since we can't come up with a reproducer.
All right. Once we get a reproducer we'll at least have something to work of off. Please reopen if you encounter this again. Thanks.