10258 – Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions

Ticket 10258 - Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions

Summary: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher prior...

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Cloud (show other tickets)
Version:	- Unsupported Older Versions
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-11-19 11:54 MST by Raj
Modified:	2020-11-19 11:59 MST (History)
CC List:	0 users

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Raj 2020-11-19 11:54:53 MST

All,

I am running SLURM 18.08.9 using the elastic plug-in in AWS cloud and ran into this weird error and even since this happened the WORKER nodes are not being create and the jobs are not executed.

Our setup : We have the controller (slurmctld) running on AWS C4 class server and it spins up a worker node when a job is submitted using sbatch command. When the scheduled job is completed the worker node is terminated. This has been working for last 4 months but stopped working since last Monday (11/16). The error says,

"Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions"

What I am noticing is, the controller is trying to allocate the controller as a worker node (I could be wrong). Ever since it happened the controller is not spinning up worker nodes and not running any scheduled jobs. Another piece of information, when this happened I noticed some AWS capacity issues but they were resolved in 30 mins. 

Any help to resolve this is appreciated.

Here is the scontrol show job #jobnumber
(base) [centos@ip-198-122-102-172 ~]$ scontrol show job 501
JobId=501 JobName=damocles
   UserId=centos(1000) GroupId=centos(1000) MCS_label=N/A
   Priority=4294901755 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=08:00:00 TimeMin=N/A
   SubmitTime=2020-11-18T23:51:19 EligibleTime=2020-11-18T23:51:19
   AccrueTime=2020-11-18T23:51:19
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2020-11-19T18:52:03
   Partition=normal AllocNode:Sid=ip-198-122-102-172:1366
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/efs/damocles_latest/main/data_output/damocles_30aedf_sbatch.sh
   WorkDir=/efs/damocles_latest/main
   StdErr=/efs/damocles_latest/main/data_output/slurm_logfile.501.err
   StdIn=/dev/null
   StdOut=/efs/damocles_latest/main/data_output/slurm_logfile.501.out
   Power=