Created attachment 12877 [details] Slurm.conf Hi, Over 300 jobs stuck in Pending state -Reason=PartitionConfig. Even though resources are available jobs not getting scheduled. Please find the attached log files and slurm.conf
Created attachment 12878 [details] sdiag.txt
Created attachment 12879 [details] qos.txt
Hi, We are using slurm 17.11.9-2
Here is the link to download slurmctld.log and slurmdbd.log https://send.firefox.com/download/4f712369abb29078/#yfY5UaT46IMoV-H0ZCVVEw
Hi Sudhakar Lakkaraju - out bug system is having trouble associating you with our supported sites. Can you tell me if you work with Chris, Nick or David?
Yes I do work with Dr. David Young, Nick is no longer working with ASC. From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Wednesday, January 29, 2020 at 11:58 AM To: Sudhakar Lakkaraju <slakkaraju@asc.edu> Subject: [Bug 8416] Over 300 jobs stuck in Pending state -Reason=PartitionConfig Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=8416#c5> on bug 8416<https://bugs.schedmd.com/show_bug.cgi?id=8416> from Jason Booth<mailto:jbooth@schedmd.com> Hi Sudhakar Lakkaraju - out bug system is having trouble associating you with our supported sites. Can you tell me if you work with Chris, Nick or David? ________________________________ You are receiving this mail because: * You reported the bug.
I'm looking into this. What version of Slurm are you running? 18.08 and 19.05 are the supported versions of Slurm, and when 20.02 is released next month 18.08 won't be supported anymore. We strongly recommend you upgrade to a supported version of Slurm.
(In reply to Marshall Garey from comment #8) > I'm looking into this. > > What version of Slurm are you running? 18.08 and 19.05 are the supported > versions of Slurm, and when 20.02 is released next month 18.08 won't be > supported anymore. We strongly recommend you upgrade to a supported version > of Slurm. I'm running slurm 17.11.9-2. May 2nd week we will be upgrading slurm to 20.02
Can you also send the output of scontrol -d show job <jobid> for one of the stuck jobs?
(In reply to Marshall Garey from comment #10) > Can you also send the output of scontrol -d show job <jobid> for one of the > stuck jobs? scontrol show job 364041 JobId=364041 JobName=bestjobshSCRIPT UserId=asnzrg(3628) GroupId=analyst(10000) MCS_label=N/A Priority=6028 Nice=0 Account=users QOS=express JobState=PENDING Reason=PartitionConfig Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A SubmitTime=2020-01-29T14:52:31 EligibleTime=2020-01-29T14:52:32 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-01-29T15:59:47 Partition=dmc-ivy-bridge,knl,dmc-haswell,dmc-broadwell,gpu_kepler,gpu_pascal,gpu_volta,dmc-skylake,dmc-sr950 AllocNode:Sid=dmcvlogin4:6999 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=500M,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0 Features=dmc DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/mnt/beegfs/home/asnzrg StdErr=/mnt/beegfs/home/asnzrg/bestjobshSCRIPT.o364041 StdIn=/dev/null StdOut=/mnt/beegfs/home/asnzrg/bestjobshSCRIPT.o364041 Power=
I have (partially) reproduced this on both 17.11 and the newest version of Slurm. The key is that jobs are being submitted to all partitions, even if the job's QOS isn't allowed on that partition. My test: PartitionName=DEFAULT State=UP MaxTime=600 Default=NO Nodes=ALL PartitionName=normal allowqos=normal Default=yes PartitionName=debug allowqos=debug Default=yes PartitionName=debug2 allowqos=debug2 marshall@voyager:~/slurm-local/17.11/voyager$ sacctmgr show qos format=name Name ---------- normal debug debug2 I submit a job that takes up the whole cluster, then submit 3 jobs, each to all partitions and requesting a different qos. When the main scheduler runs, the jobs are pending with reason "Resources." When the backfill scheduler runs, the jobs are pending with reason "PartitionConfig." Here's an output from squeue showing the transition: Wed Jan 29 15:48:25 2020 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9 normal,de wrap marshall PD 0:00 9 (PartitionConfig) 8 normal,de wrap marshall PD 0:00 9 (PartitionConfig) 7 normal,de wrap marshall PD 0:00 9 (PartitionConfig) 6 normal wrap marshall R 2:57 9 v[1-9] Wed Jan 29 15:48:26 2020 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9 normal,de wrap marshall PD 0:00 9 (Resources) 8 normal,de wrap marshall PD 0:00 9 (Resources) 7 normal,de wrap marshall PD 0:00 9 (Resources) 6 normal wrap marshall R 2:58 9 v[1-9] scontrol show job once on "Resources" and once on "PartitionConfig" - however, my job has an estimated StartTime, unlike the job you uploaded which doesn't. marshall@voyager:~/slurm-local/17.11/voyager$ scontrol show job 7 JobId=7 JobName=wrap UserId=marshall(1017) GroupId=marshall(1017) MCS_label=N/A Priority=478 Nice=0 Account=acct QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A SubmitTime=2020-01-29T15:45:55 EligibleTime=2020-01-29T15:45:55 StartTime=2020-01-29T16:05:28 EndTime=2020-01-29T16:25:28 Deadline=N/A marshall@voyager:~/slurm-local/17.11/voyager$ scontrol show job 7 JobId=7 JobName=wrap UserId=marshall(1017) GroupId=marshall(1017) MCS_label=N/A Priority=478 Nice=0 Account=acct QOS=normal JobState=PENDING Reason=PartitionConfig Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A SubmitTime=2020-01-29T15:45:55 EligibleTime=2020-01-29T15:45:55 StartTime=2020-01-29T16:05:28 EndTime=2020-01-29T16:25:28 Deadline=N/A I'm not sure why the job you uploaded doesn't have an estimated start time. However, I suspect that this problem is mostly cosmetic and the jobs that are pending will eventually start - unless the jobs continue to not have a start time. Did any of the jobs pending with this "PartitionConfig" reason start since yesterday? I'll dig into the code to see why the backfill scheduler sets the reason to PartitionConfig rather than resources but the main scheduler sets the reason to Resources. Do your users usually submit jobs to all partitions? Something you could do is use a job submit plugin to verify that the job is only submitted to partitions where its QOS is allowed.
I don't have a fix yet; but have you been able to try the job submit plugin to work around this issue?
Please close the ticket. Our workaround was to increase "bf_max_job_test" value. Eventually, it started to behave right.
Okay - closing as infogiven. I definitely still recommend a job submit plugin so that jobs aren't submitted to partitions where they can't run - even if the users are lazy, the job submit plugin could silently remove partitions where the job could never run. It wastes CPU cycles trying to schedule the jobs in partitions where they can't run; you'd probably see increased scheduler performance.