Ticket 10258

Summary: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions
Product: Slurm Reporter: Raj <rmallamp>
Component: CloudAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Raj 2020-11-19 11:54:53 MST
All,

I am running SLURM 18.08.9 using the elastic plug-in in AWS cloud and ran into this weird error and even since this happened the WORKER nodes are not being create and the jobs are not executed.

Our setup : We have the controller (slurmctld) running on AWS C4 class server and it spins up a worker node when a job is submitted using sbatch command. When the scheduled job is completed the worker node is terminated. This has been working for last 4 months but stopped working since last Monday (11/16). The error says,

"Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions"

What I am noticing is, the controller is trying to allocate the controller as a worker node (I could be wrong). Ever since it happened the controller is not spinning up worker nodes and not running any scheduled jobs. Another piece of information, when this happened I noticed some AWS capacity issues but they were resolved in 30 mins. 

Any help to resolve this is appreciated.

Here is the scontrol show job #jobnumber
(base) [centos@ip-198-122-102-172 ~]$ scontrol show job 501
JobId=501 JobName=damocles
   UserId=centos(1000) GroupId=centos(1000) MCS_label=N/A
   Priority=4294901755 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=08:00:00 TimeMin=N/A
   SubmitTime=2020-11-18T23:51:19 EligibleTime=2020-11-18T23:51:19
   AccrueTime=2020-11-18T23:51:19
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2020-11-19T18:52:03
   Partition=normal AllocNode:Sid=ip-198-122-102-172:1366
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/efs/damocles_latest/main/data_output/damocles_30aedf_sbatch.sh
   WorkDir=/efs/damocles_latest/main
   StdErr=/efs/damocles_latest/main/data_output/slurm_logfile.501.err
   StdIn=/dev/null
   StdOut=/efs/damocles_latest/main/data_output/slurm_logfile.501.out
   Power=