Ticket 8416

Summary:	Over 300 jobs stuck in Pending state -Reason=PartitionConfig
Product:	Slurm	Reporter:	Sudhakar Lakkaraju <slakkaraju>
Component:	Scheduling	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	ASC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	CentOS	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Slurm.conf sdiag.txt qos.txt

Description Sudhakar Lakkaraju 2020-01-29 10:28:06 MST

Created attachment 12877 [details]
Slurm.conf

Hi,
Over 300 jobs stuck in Pending state -Reason=PartitionConfig.
Even though resources are available jobs not getting scheduled.
Please find the attached log files and slurm.conf

Comment 1 Sudhakar Lakkaraju 2020-01-29 10:31:22 MST

Created attachment 12878 [details]
sdiag.txt

Comment 2 Sudhakar Lakkaraju 2020-01-29 10:31:47 MST

Created attachment 12879 [details]
qos.txt

Comment 3 Sudhakar Lakkaraju 2020-01-29 10:34:18 MST

Hi, 
We are using slurm 17.11.9-2

Comment 4 Sudhakar Lakkaraju 2020-01-29 10:40:37 MST

Here is the link to download slurmctld.log and slurmdbd.log
https://send.firefox.com/download/4f712369abb29078/#yfY5UaT46IMoV-H0ZCVVEw

Comment 5 Jason Booth 2020-01-29 10:58:18 MST

Hi Sudhakar Lakkaraju - out bug system is having trouble associating you with our supported sites. Can you tell me if you work with Chris, Nick or David?

Comment 6 Sudhakar Lakkaraju 2020-01-29 11:00:55 MST

Yes I do work with Dr. David Young, Nick is no longer working with ASC.

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Wednesday, January 29, 2020 at 11:58 AM
To: Sudhakar Lakkaraju <slakkaraju@asc.edu>
Subject: [Bug 8416] Over 300 jobs stuck in Pending state -Reason=PartitionConfig

Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=8416#c5> on bug 8416<https://bugs.schedmd.com/show_bug.cgi?id=8416> from Jason Booth<mailto:jbooth@schedmd.com>

Hi Sudhakar Lakkaraju - out bug system is having trouble associating you with

our supported sites. Can you tell me if you work with Chris, Nick or David?

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 8 Marshall Garey 2020-01-29 14:26:48 MST

I'm looking into this.

What version of Slurm are you running? 18.08 and 19.05 are the supported versions of Slurm, and when 20.02 is released next month 18.08 won't be supported anymore. We strongly recommend you upgrade to a supported version of Slurm.

Comment 9 Sudhakar Lakkaraju 2020-01-29 14:29:30 MST

(In reply to Marshall Garey from comment #8)
> I'm looking into this.
> 
> What version of Slurm are you running? 18.08 and 19.05 are the supported
> versions of Slurm, and when 20.02 is released next month 18.08 won't be
> supported anymore. We strongly recommend you upgrade to a supported version
> of Slurm.

I'm running slurm 17.11.9-2.

May 2nd week we will be upgrading slurm to 20.02

Comment 10 Marshall Garey 2020-01-29 14:40:42 MST

Can you also send the output of scontrol -d show job <jobid> for one of the stuck jobs?

Comment 11 Sudhakar Lakkaraju 2020-01-29 15:00:34 MST

(In reply to Marshall Garey from comment #10)
> Can you also send the output of scontrol -d show job <jobid> for one of the
> stuck jobs?

scontrol show job 364041
JobId=364041 JobName=bestjobshSCRIPT
   UserId=asnzrg(3628) GroupId=analyst(10000) MCS_label=N/A
   Priority=6028 Nice=0 Account=users QOS=express
   JobState=PENDING Reason=PartitionConfig Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=2020-01-29T14:52:31 EligibleTime=2020-01-29T14:52:32
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2020-01-29T15:59:47
   Partition=dmc-ivy-bridge,knl,dmc-haswell,dmc-broadwell,gpu_kepler,gpu_pascal,gpu_volta,dmc-skylake,dmc-sr950 AllocNode:Sid=dmcvlogin4:6999
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=500M,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
   Features=dmc DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/mnt/beegfs/home/asnzrg
   StdErr=/mnt/beegfs/home/asnzrg/bestjobshSCRIPT.o364041
   StdIn=/dev/null
   StdOut=/mnt/beegfs/home/asnzrg/bestjobshSCRIPT.o364041
   Power=

Comment 15 Marshall Garey 2020-01-30 12:00:21 MST

I have (partially) reproduced this on both 17.11 and the newest version of Slurm. The key is that jobs are being submitted to all partitions, even if the job's QOS isn't allowed on that partition.

My test:

PartitionName=DEFAULT State=UP MaxTime=600 Default=NO Nodes=ALL
PartitionName=normal allowqos=normal Default=yes
PartitionName=debug allowqos=debug Default=yes
PartitionName=debug2 allowqos=debug2

marshall@voyager:~/slurm-local/17.11/voyager$ sacctmgr show qos format=name
      Name 
---------- 
    normal 
     debug 
    debug2 


I submit a job that takes up the whole cluster, then submit 3 jobs, each to all partitions and requesting a different qos.

When the main scheduler runs, the jobs are pending with reason "Resources." When the backfill scheduler runs, the jobs are pending with reason "PartitionConfig." Here's an output from squeue showing the transition:

Wed Jan 29 15:48:25 2020
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 9 normal,de     wrap marshall PD       0:00      9 (PartitionConfig)
                 8 normal,de     wrap marshall PD       0:00      9 (PartitionConfig)
                 7 normal,de     wrap marshall PD       0:00      9 (PartitionConfig)
                 6    normal     wrap marshall  R       2:57      9 v[1-9]

Wed Jan 29 15:48:26 2020
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 9 normal,de     wrap marshall PD       0:00      9 (Resources)
                 8 normal,de     wrap marshall PD       0:00      9 (Resources)
                 7 normal,de     wrap marshall PD       0:00      9 (Resources)
                 6    normal     wrap marshall  R       2:58      9 v[1-9]


scontrol show job once on "Resources" and once on "PartitionConfig" - however, my job has an estimated StartTime, unlike the job you uploaded which doesn't.

marshall@voyager:~/slurm-local/17.11/voyager$ scontrol show job 7
JobId=7 JobName=wrap
   UserId=marshall(1017) GroupId=marshall(1017) MCS_label=N/A
   Priority=478 Nice=0 Account=acct QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A
   SubmitTime=2020-01-29T15:45:55 EligibleTime=2020-01-29T15:45:55
   StartTime=2020-01-29T16:05:28 EndTime=2020-01-29T16:25:28 Deadline=N/A


marshall@voyager:~/slurm-local/17.11/voyager$ scontrol show job 7
JobId=7 JobName=wrap
   UserId=marshall(1017) GroupId=marshall(1017) MCS_label=N/A
   Priority=478 Nice=0 Account=acct QOS=normal
   JobState=PENDING Reason=PartitionConfig Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A
   SubmitTime=2020-01-29T15:45:55 EligibleTime=2020-01-29T15:45:55
   StartTime=2020-01-29T16:05:28 EndTime=2020-01-29T16:25:28 Deadline=N/A


I'm not sure why the job you uploaded doesn't have an estimated start time. However, I suspect that this problem is mostly cosmetic and the jobs that are pending will eventually start - unless the jobs continue to not have a start time. Did any of the jobs pending with this "PartitionConfig" reason start since yesterday?

I'll dig into the code to see why the backfill scheduler sets the reason to PartitionConfig rather than resources but the main scheduler sets the reason to Resources.

Do your users usually submit jobs to all partitions? Something you could do is use a job submit plugin to verify that the job is only submitted to partitions where its QOS is allowed.

Comment 16 Marshall Garey 2020-02-10 13:10:00 MST

I don't have a fix yet; but have you been able to try the job submit plugin to work around this issue?

Comment 17 Sudhakar Lakkaraju 2020-02-10 15:46:42 MST

Please close the ticket. Our workaround was to increase "bf_max_job_test" value. Eventually, it started to behave right.

Comment 18 Marshall Garey 2020-02-10 16:04:37 MST

Okay - closing as infogiven. I definitely still recommend a job submit plugin so that jobs aren't submitted to partitions where they can't run - even if the users are lazy, the job submit plugin could silently remove partitions where the job could never run. It wastes CPU cycles trying to schedule the jobs in partitions where they can't run; you'd probably see increased scheduler performance.