911 – best_fit topology failure : no switch satisfying the request found

Ticket 911 - best_fit topology failure : no switch satisfying the request found

Summary: best_fit topology failure : no switch satisfying the request found

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	14.03.4
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-06-24 07:55 MDT by Ryan Cox
Modified:	2014-07-08 08:27 MDT (History)
CC List:	1 user (show)

See Also:
Site:	BYU - Brigham Young University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.03.5
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
topology from slurm.conf (2.76 KB, text/plain) 2014-06-24 07:55 MDT, Ryan Cox	Details
scontrol show part (5.20 KB, text/plain) 2014-06-25 04:53 MDT, Ryan Cox	Details
scontrol show job (115.98 KB, application/x-gzip) 2014-06-25 04:58 MDT, Ryan Cox	Details
scontrol show job #2 (186.18 KB, application/x-gzip) 2014-06-25 05:36 MDT, Ryan Cox	Details
scontrol show job #3, with sane times (119.22 KB, application/x-gzip) 2014-06-25 06:10 MDT, Ryan Cox	Details
Fix for bad start time (1.34 KB, patch) 2014-06-25 11:29 MDT, Moe Jette	Details \| Diff
Log job priority and partitions (656 bytes, patch) 2014-06-26 04:42 MDT, Moe Jette	Details \| Diff
start times for rcpacker job 3826001 (15.36 KB, text/plain) 2014-06-26 07:23 MDT, Ryan Cox	Details
comparison of slurm.conf files and scontrol show part (9.50 KB, text/plain) 2014-06-26 07:53 MDT, Ryan Cox	Details
slurmctld.log.gz with backfill and backfillmap debug (139.46 KB, application/x-gzip) 2014-06-26 08:02 MDT, Ryan Cox	Details
scontrol show job (178.24 KB, application/x-gzip) 2014-06-26 08:05 MDT, Ryan Cox	Details
Correct job's partition info (2.31 KB, patch) 2014-06-26 10:16 MDT, Moe Jette	Details \| Diff
scontrol show job (87.40 KB, application/x-gzip) 2014-06-30 05:46 MDT, Ryan Cox	Details
Possible fix (1.92 KB, patch) 2014-06-30 09:56 MDT, Moe Jette	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ryan Cox 2014-06-24 07:55:27 MDT

Created attachment 975 [details]
topology from slurm.conf

I'm seeing some odd behavior where jobs are getting an estimated start time of a year from now.  The job seems like it has been waiting much longer than it should have, possibly because jobs are getting backfilled in front of it since its start date is in 2015.  I increased the debug level and enabled backfill debug to see if I could figure it out.  I saw this in the logs:
[2014-06-24T09:20:02.402] debug2: found 191 usable nodes from config containing m7-[1-12]-[1-16]
[2014-06-24T09:20:02.402] debug2: found 127 usable nodes from config containing m7-[13-20]-[1-16]
[2014-06-24T09:20:02.402] debug3: _pick_best_nodes: job 3826001 idle_nodes 71 share_nodes 507
[2014-06-24T09:20:02.402] debug2: select_p_job_test for job 3826001
[2014-06-24T09:20:02.402] debug3: cons_res: _vns: node m7-1-2 sharing?
[2014-06-24T09:20:02.402] debug3: cons_res: _vns: node m7-1-3 sharing?
........ cut for brevity  .........
[2014-06-24T09:20:02.405] debug3: cons_res: _vns: node m7-12-16 sharing?
[2014-06-24T09:20:02.405] debug3: cons_res: _vns: node m7-15-9 sharing?
[2014-06-24T09:20:02.405] debug3: cons_res: _vns: node m7-17-15 sharing?
[2014-06-24T09:20:02.405] debug:  job 3826001: best_fit topology failure : no switch satisfying the request found
[2014-06-24T09:20:02.405] debug2: select_p_job_test for job 3826001
[2014-06-24T09:20:02.405] debug3: sched: JobId=3826001. State=PENDING. Reason=Resources. Priority=144705882. Partition=m7.


I'm wondering why switches could be a problem for job 3826001.  Note the StartTime=2015-06-20T17:30:04 and that there is no topology request.

root@sched1:~# scontrol -d show job 3826001
JobId=3826001 Name=test
   UserId=rcpacker(21331) GroupId=rcpacker(21559)
   Priority=144535840 Nice=0 Account=sgorrell QOS=m7_64
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-02:00:00 TimeMin=N/A
   SubmitTime=2014-06-20T11:31:31 EligibleTime=2014-06-20T11:31:31
   StartTime=2015-06-20T17:30:04 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=m7 AllocNode:Sid=m7int01:19193
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=35-35 NumCPUs=560 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/panfs/pan.fsl.byu.edu/home/grp/fslg_IFI/step6_harmonic/run_STARCCM+_FSL_Ryan_hb /fslhome/rcpacker/compute/IFI/step5_explicit_mixing/hb_v07_m5000_explicit@018000.sim
   WorkDir=/panfs/pan.fsl.byu.edu/home/grp/fslg_IFI/step6_harmonic
   StdErr=/panfs/pan.fsl.byu.edu/home/grp/fslg_IFI/step6_harmonic/slurm-3826001.out
   StdIn=/dev/null
   StdOut=/panfs/pan.fsl.byu.edu/home/grp/fslg_IFI/step6_harmonic/slurm-3826001.out


I'm attaching the network topology information.  Here is sinfo for the m7 partition.  We have a few nodes draining for something, but other than that I can't think of what other information you need.

root@sched1:~# sinfo -p m7
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
m7           up 16-00:00:0     12   drng m7-1-15,m7-3-[3,5],m7-6-[2,16],m7-7-8,m7-9-[8,10,15],m7-15-16,m7-16-9,m7-20-15
m7           up 16-00:00:0     32    mix m7-2-[5,7],m7-3-[2,8],m7-5-[4,13-15],m7-6-11,m7-7-5,m7-8-[13-14],m7-9-1,m7-10-[7,12],m7-13-8,m7-14-[1,6,14],m7-15-[8-10],m7-16-[3,5-6],m7-17-15,m7-18-[5,8,15],m7-19-[9,11],m7-20-7
m7           up 16-00:00:0    273  alloc m7-1-[2-14,16],m7-2-[1-4,6,8-16],m7-3-[1,4,6-7,9,11-16],m7-4-[1-16],m7-5-[1-3,5-12,16],m7-6-[1,3-10,12-15],m7-7-[1-4,6-7,9-16],m7-8-[1-12,15-16],m7-9-[2-7,9,11-14,16],m7-10-[1-6,8-11,13-16],m7-11-[1-16],m7-12-[1-16],m7-13-[2-7,9-16],m7-14-[2-5,7-13,15-16],m7-15-[1-7,11-15],m7-16-[1-2,4,7-8,10-16],m7-17-[1-14,16],m7-18-[1-4,6-7,9-14,16],m7-19-[1-8,10,12-16],m7-20-[1-6,8-14,16]
m7           up 16-00:00:0      1   idle m7-3-10

Comment 1 Ryan Cox 2014-06-25 03:56:19 MDT

It does appear that the job will never run as long as other jobs exist that can run on the target nodes.  Other jobs are definitely getting backfilled in front of it since the job isn't scheduled to start until mid-2015.

I'm able to reproduce the problem by submitting a job with an identical request:
#SBATCH --time=26:00:00
#SBATCH --ntasks=560
#SBATCH --nodes=35
#SBATCH --mem-per-cpu=4G
#SBATCH -p m7
#SBATCH --exclusive

Comment 2 Moe Jette 2014-06-25 04:08:25 MDT

This message:
job 3826001: best_fit topology failure : no switch satisfying the request found

is clearly misleading. I am changing it to:
job 3826001: best_fit topology failure: no switch currently has sufficient resource to satisfy the request

You will note this job requires 35 nodes and there was only 1 idle node and 32 partially idle nodes in your logs, so without preemption this job can not start. Were you expecting this job to preempt other jobs to start?

====================

Your message just came in this morning. Perhaps this is more of a backfill issue?

Comment 3 Ryan Cox 2014-06-25 04:20:52 MDT

I filed the bug report yesterday then added a comment today.  I think it is more of a backfill problem.  The job is currently the highest priority job in the system but 6 and 7 day lower priority jobs are being backfilled in front of it.  It was submitted 2014-06-20T11:31:31 and we have plenty of nodes to satisfy the request.  None of the required nodes (35 nodes of m7-[13-20]-[1-16]) are idle and other jobs have started on it as recently as this morning.

I wonder how the job can have a start time of 2015 and how the lower priority jobs can backfill in front of it.  There are no reservations that would preclude it from starting.  Maybe I messed something up?

SchedulerParameters=bf_max_job_user=20,bf_interval=60,default_queue_depth=15,max_job_bf=5000,bf_window=20160,bf_continue,max_sched_time=6,bf_resolution=900


After looking some more, it turns out that another user requesting the same resources also has the exact same 2015 start date:
ryancox@m7int02:~$ scontrol -d show job 3688135
JobId=3688135 Name=NRTN28O2complex_299
   UserId=bemily2(20212) GroupId=bemily2(20222)
   Priority=716 Nice=0 Account=jchansen QOS=m7_64
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=6-16:30:00 TimeMin=N/A
   SubmitTime=2014-06-09T12:56:34 EligibleTime=2014-06-09T12:56:34
   StartTime=2015-06-20T17:30:04 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=m7 AllocNode:Sid=m7int02:32448
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
   MinCPUsNode=1 MinMemoryCPU=3G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=1 Contiguous=0 Licenses=(null) Network=(null)
   Command=/tmp/qsub_wrapper_modified_script_eUuYSRqODM
   WorkDir=/panfs/pan.fsl.byu.edu/scr/usr/12/bemily2/Aerosols/british/NRTN28O2complex
   StdErr=/panfs/pan.fsl.byu.edu/scr/usr/12/bemily2/Aerosols/british/NRTN28O2complex/NRTN28O2complex_299.e%j
   StdIn=/dev/null
   StdOut=/panfs/pan.fsl.byu.edu/scr/usr/12/bemily2/Aerosols/british/NRTN28O2complex/NRTN28O2complex_299.o3688135


ryancox@m7int02:~$ sinfo -n m7-[13-20]-[1-16] -p m7
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
m7           up 16-00:00:0      3   drng m7-15-16,m7-16-9,m7-20-15
m7           up 16-00:00:0      5    mix m7-13-8,m7-15-9,m7-16-[3,6],m7-17-15
m7           up 16-00:00:0    119  alloc m7-13-[2-7,9-16],m7-14-[1-16],m7-15-[1-8,10-15],m7-16-[1-2,4-5,7-8,10-16],m7-17-[1-14,16],m7-18-[1-16],m7-19-[1-16],m7-20-[1-14,16]

ryancox@m7int02:~$ squeue -l --start -j 3826001
Wed Jun 25 10:15:22 2014
  JOBID PARTITION     NAME     USER ST          START_TIME  NODES NODELIST(REASON)
3826001        m7     test rcpacker PD 2015-06-20T17:30:04     35 (Resources)

ryancox@m7int02:~$ squeue -l -w m7-[13-20]-[1-16] -S \-S | head -5
Wed Jun 25 10:14:41 2014
             JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
           3688134        m7 NRTN28O2  bemily2  RUNNING    1:51:52 6-16:30:00      1 m7-14-13
           3688126        m7 NRTN28O2  bemily2  RUNNING    3:45:45 6-16:30:00      1 m7-14-14
           3688127        m7 NRTN28O2  bemily2  RUNNING    3:45:45 6-16:30:00      1 m7-18-5

ryancox@m7int02:~$ squeue -l -w m7-[13-20]-[1-16] -S e | head -10
Wed Jun 25 10:19:23 2014
             JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
           3687977        m7 NRTN28O2  bemily2  RUNNING 6-15:10:52 6-16:30:00      1 m7-15-14
           3687978        m7 NRTN28O2  bemily2  RUNNING 6-08:31:59 6-16:30:00      1 m7-19-10
           3687979        m7 NRTN28O2  bemily2  RUNNING 6-07:56:57 6-16:30:00      1 m7-20-11
           3687981        m7 NRTN28O2  bemily2  RUNNING 6-05:30:53 6-16:30:00      1 m7-19-2
           3687982        m7 NRTN28O2  bemily2  RUNNING 6-04:37:51 6-16:30:00      1 m7-17-12
           3687983        m7 NRTN28O2  bemily2  RUNNING 6-04:29:12 6-16:30:00      1 m7-18-7
           3687986        m7 NRTN28O2  bemily2  RUNNING 6-03:20:05 6-16:30:00      1 m7-13-6
           3834115        m7 input.sh    jorr3  RUNNING   23:51:30 1-22:00:00      1 m7-13-8

Comment 4 Moe Jette 2014-06-25 04:50:05 MDT

(In reply to Ryan Cox from comment #3)
> I wonder how the job can have a start time of 2015 and how the lower
> priority jobs can backfill in front of it.

That makes two of us.
Do you still have a 16 day time limit on the m7 partition?
Do you have any running jobs with crazy long time limits (perhaps due to the partition limit being over-ridden by an QoS or association limit or being reset by a sys admin)?
I know it will be verbose, but perhaps you could either attach or email me the output of the following commands:
scontrol show part
scontrol show job

Comment 5 Ryan Cox 2014-06-25 04:53:07 MDT

Created attachment 985 [details]
scontrol show part

Comment 6 Ryan Cox 2014-06-25 04:58:17 MDT

Created attachment 986 [details]
scontrol show job

All accounts are set to a max of 7 days but we retained the 16 day limit on some partitions in case we needed to grant an exception.  I think we're done with exceptions at this point so I may lower to 7 days in the future.  There are certainly no jobs with a time limit above 7 days at this point.

ryancox@m7int02:~$ squeue -l -S \-l | head -10
Wed Jun 25 10:56:59 2014
             JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
           3851742    m6beta run_all_   ram247  RUNNING       7:45 7-00:00:00      1 m6-19-12
           3851701    m6beta run_all_   ram247  RUNNING      36:17 7-00:00:00      1 m6-31-16
           3851693    m6beta run_all_   ram247  RUNNING      51:25 7-00:00:00      1 m6-31-16
           3851680    m6beta run_all_   ram247  RUNNING    1:02:01 7-00:00:00      1 m6-31-16
           3851675    m6beta run_all_   ram247  RUNNING    1:11:10 7-00:00:00      1 m6-31-16
           3851660    m6beta run_all_   ram247  RUNNING    1:31:36 7-00:00:00      1 m6-31-16
           3851657    m6beta run_all_   ram247  RUNNING    1:44:45 7-00:00:00      1 m6-31-16
           3851645    m6beta run_all_   ram247  RUNNING    1:53:08 7-00:00:00      1 m6-13-9
ryancox@m7int02:~$ squeue -l -w m7-[13-20]-[1-16] -S \-l | head -10
Wed Jun 25 10:57:11 2014
             JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
           3688134        m7 NRTN28O2  bemily2  RUNNING    2:34:22 6-16:30:00      1 m7-14-13
           3688126        m7 NRTN28O2  bemily2  RUNNING    4:28:15 6-16:30:00      1 m7-14-14
           3688127        m7 NRTN28O2  bemily2  RUNNING    4:28:15 6-16:30:00      1 m7-18-5
           3688128        m7 NRTN28O2  bemily2  RUNNING    4:28:15 6-16:30:00      1 m7-18-6
           3688129        m7 NRTN28O2  bemily2  RUNNING    4:28:15 6-16:30:00      1 m7-18-8
           3688130        m7 NRTN28O2  bemily2  RUNNING    4:28:15 6-16:30:00      1 m7-18-9
           3688131        m7 NRTN28O2  bemily2  RUNNING    4:28:15 6-16:30:00      1 m7-18-13
           3688132        m7 NRTN28O2  bemily2  RUNNING    4:28:15 6-16:30:00      1 m7-18-15

Comment 7 Ryan Cox 2014-06-25 05:36:24 MDT

Created attachment 987 [details]
scontrol show job #2

In case it helps, here's another one.  Two people just submitted jobs that are higher priority than rcpacker's: mwpswim and plf1.

You can see that several people ended up with the same start time:
root@sched1:/tmp# zgrep -B6 StartTime=2015-06-20T17:30:04 scontrol_show_job.gz |grep UserId
   UserId=fslcollab8(20131) GroupId=fslcollab8(20127)
   UserId=fslcollab8(20131) GroupId=fslcollab8(20127)
   UserId=fslcollab8(20131) GroupId=fslcollab8(20127)
   UserId=fslcollab8(20131) GroupId=fslcollab8(20127)
   UserId=fslcollab8(20131) GroupId=fslcollab8(20127)
   UserId=glh43(6466) GroupId=glh43(13733)
   UserId=rcpacker(21331) GroupId=rcpacker(21559)
   UserId=glh43(6466) GroupId=glh43(13733)
   UserId=mwpswim(21328) GroupId=mwpswim(21556)

It's really odd that they're all the same time down to the second.

Comment 8 Moe Jette 2014-06-25 05:39:38 MDT

I'm not sure if this might be related, but what are your QoS preemption rules?

Comment 9 Ryan Cox 2014-06-25 05:39:49 MDT

Check out the EndTime:
root@sched1:/tmp# scontrol show reservation licenseres
ReservationName=licenseres StartTime=2014-06-20T17:30:03 EndTime=2015-06-20T17:30:03 Duration=365-00:00:00
   Nodes= NodeCnt=0 CoreCnt=0 Features=(null) PartitionName=dol4 Flags=LICENSE_ONLY
   Users=root Accounts=(null) Licenses=emptylic:1,ansys_aa_r:5,ansys_aa_r_hpc:4,starccm_ccmpsuite:12,starccm_hpcdomains:318,starccm_star-nx:2,xilinx_ISE:1,xilinx_PlanAhead:1,xilinx_SDK:1,xilinx_Vivado_System_Edition:3 State=ACTIVE

The ones I have checked don't request any licenses though.

Comment 10 Ryan Cox 2014-06-25 05:42:59 MDT

ryancox@m7int02:~$ sacctmgr list qos
      Name   Priority  GraceTime    Preempt PreemptMode                                    Flags UsageThres UsageFactor  GrpCPUs  GrpCPUMins GrpCPURunMins GrpJobs  GrpMem GrpNodes GrpSubmit     GrpWall  MaxCPUs  MaxCPUMins MaxNodes     MaxWall MaxCPUsPU MaxJobsPU MaxNodesPU MaxSubmitPU 
---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- -------- ----------- ------------- ------- ------- -------- --------- ----------- -------- ----------- -------- ----------- --------- --------- ---------- ----------- 
    normal          0   00:00:00    standby     cluster                                                        1.000000                                                                                                                                                                        
   standby          0   00:05:00                cluster                                                        0.000000  1000000             1000000000000000                                                 1000000                                                                             
   private          0   00:00:00    standby     cluster                                                        1.000000    20000             1000000000000000                                                                                                                                     
      test      20000   00:00:00    standby     cluster                              DenyOnLimit               1.000000    20000             1000000000000000                                                     512                         01:00:00       256        12                        
     maint      50000   00:00:00    standby     cluster                                                        0.000000  1000000             1000000000000000                                                 1000000                                                                             
preemptab+          0   00:00:05                cluster                                NoReserve               1.000000  1000000                                                                           1000000                                                                             
    bigmem          0   00:00:00    standby     cluster                              DenyOnLimit               1.500000                                                                                                                  5-00:00:00                              3             
     m7_64          0   00:00:00    standby     cluster                                                        1.250000                                                                                                                                                                        


preemptable was a testing thing I need to get rid of.  We're using "standby", though it is almost never used.

PreemptType=preempt/qos
PreemptMode=cancel

Comment 11 Moe Jette 2014-06-25 05:47:24 MDT

(In reply to Ryan Cox from comment #9)
> Check out the EndTime:
> root@sched1:/tmp# scontrol show reservation licenseres
> ReservationName=licenseres StartTime=2014-06-20T17:30:03
> EndTime=2015-06-20T17:30:03 Duration=365-00:00:00
>    Nodes= NodeCnt=0 CoreCnt=0 Features=(null) PartitionName=dol4
> Flags=LICENSE_ONLY
>    Users=root Accounts=(null)
> Licenses=emptylic:1,ansys_aa_r:5,ansys_aa_r_hpc:4,starccm_ccmpsuite:12,
> starccm_hpcdomains:318,starccm_star-nx:2,xilinx_ISE:1,xilinx_PlanAhead:1,
> xilinx_SDK:1,xilinx_Vivado_System_Edition:3 State=ACTIVE
> 
> The ones I have checked don't request any licenses though.

Given the matching times, this must be related although I'm not sure how at this point.

Comment 12 Ryan Cox 2014-06-25 06:10:24 MDT

Created attachment 990 [details]
scontrol show job #3, with sane times

The other odd thing is that sometimes the jobs do get sane start time estimates.  Currently they're all sane, within the next 3-4 days.

Comment 13 Moe Jette 2014-06-25 06:32:08 MDT

Question about your use of licenses: I see _no_ jobs requesting any licenses at all in the first "scontrol show job" output. Is that expected?

I'm wondering if there is something wrong with Slurm's reporting of job requests for licenses, although I don't see any such problem in my testing. If jobs really are using licenses, that could result in the job of interest being delayed until the advanced reservation for licenses end.

Comment 14 Ryan Cox 2014-06-25 07:00:28 MDT

(In reply to Moe Jette from comment #13)
> Question about your use of licenses: I see _no_ jobs requesting any licenses
> at all in the first "scontrol show job" output. Is that expected?
> 

I imagine that we have quite a few people who don't know that they need to request licenses, so I guess it's not unexpected.  Many of our users don't use commercial code or use code that has a site license (no tracking required).


> I'm wondering if there is something wrong with Slurm's reporting of job
> requests for licenses, although I don't see any such problem in my testing.
> If jobs really are using licenses, that could result in the job of interest
> being delayed until the advanced reservation for licenses end.

I just submitted a job that properly showed a license request in scontrol show job both while pending and when running.

For some reason all job start times are still sane.  I'm trying to figure out what may have changed (jobs, other reservations, etc).  I haven't changed the configuration at all.

Comment 15 Ryan Cox 2014-06-25 07:03:01 MDT

And now they're back to 2015-06-20T17:30:04 again. Right after I hit submit last time.

Comment 16 Moe Jette 2014-06-25 07:06:31 MDT

(In reply to Ryan Cox from comment #15)
> And now they're back to 2015-06-20T17:30:04 again. Right after I hit submit
> last time.

At least we know its related to an advanced reservation of licenses.
I'm guessing that if you remove that advanced reservation the pending jobs will run. That would just be a work around until I can track down what must be a bug in Slurm.

Comment 17 Ryan Cox 2014-06-25 07:27:15 MDT

After deleting the license reservation, I saw some different behavior.

ryancox@m7int02:~$ scontrol show reservation
ReservationName=binutils_test StartTime=2014-06-24T16:26:08 EndTime=2015-06-24T16:26:08 Duration=365-00:00:00
   Nodes=m6-30-16 NodeCnt=1 CoreCnt=12 Features=(null) PartitionName=dol4 Flags=MAINT,SPEC_NODES
   Users=lbrown,levijm Accounts=(null) Licenses=(null) State=ACTIVE

ryancox@m7int02:~$ squeue_with_priority |grep 2015
3826001      1445        m7     test     rcpacker  PENDING       0:00  1-02:00:00       35   560 2015-06-24T16:26:09 (Resources)
ryancox@m7int02:~$ 


I waited until after I was sure a few complete backfill iterations ran.  It looks like it could be a reservation issue in general, not just licenses.

Comment 18 Ryan Cox 2014-06-25 09:02:27 MDT

I just noticed that the debug info in the logs (backfill flag) seems to disagree with the squeue and scontrol show job output when the 2015 value shows up.

root@sched1:~# grep 2015 /var/log/slurm/slurmctld.log |grep start |grep -v ' reservation'
root@sched1:~# grep 3826001 /var/log/slurm/slurmctld.log |grep start
[2014-06-25T12:53:31.238] Job 3826001 to start at 2014-06-28T20:53:06, end at 2014-06-29T22:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11]
[2014-06-25T12:54:43.250] Job 3826001 to start at 2014-06-28T20:53:06, end at 2014-06-29T22:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11]
[2014-06-25T12:55:55.229] Job 3826001 to start at 2014-06-28T20:53:06, end at 2014-06-29T22:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11]
[2014-06-25T12:57:07.241] Job 3826001 to start at 2014-06-28T20:53:06, end at 2014-06-29T22:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11]
[2014-06-25T12:58:19.231] Job 3826001 to start at 2014-06-28T20:53:06, end at 2014-06-29T22:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11]
[2014-06-25T13:50:08.485] Job 3826001 to start at 2014-06-30T18:15:00, end at 2014-07-01T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T13:55:25.246] Job 3826001 to start at 2014-06-29T00:53:33, end at 2014-06-30T02:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T13:59:40.224] Job 3826001 to start at 2014-06-29T00:53:33, end at 2014-06-30T02:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:36:42.478] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:37:51.465] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:39:00.462] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:40:09.465] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:41:18.453] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:42:27.449] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:43:36.450] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:44:45.452] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:45:54.455] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:47:03.452] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:48:12.510] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:49:21.465] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:50:30.451] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:51:39.452] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:52:48.445] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:53:57.443] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:55:06.442] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]
[2014-06-25T14:56:15.481] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11]


I haven't had the backfill debug enabled the whole time but I know that the 2015 value has happened at some of the points when the debug info was enabled.

Comment 19 Moe Jette 2014-06-25 09:11:15 MDT

(In reply to Ryan Cox from comment #18)
> I just noticed that the debug info in the logs (backfill flag) seems to
> disagree with the squeue and scontrol show job output when the 2015 value
> shows up.
> 
> root@sched1:~# grep 2015 /var/log/slurm/slurmctld.log |grep start |grep -v '
> reservation'
> root@sched1:~# grep 3826001 /var/log/slurm/slurmctld.log |grep start
> [2014-06-25T12:53:31.238] Job 3826001 to start at 2014-06-28T20:53:06, end
> at 2014-06-29T22:45:00 on
> m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12],
> m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11]
> 
> I haven't had the backfill debug enabled the whole time but I know that the
> 2015 value has happened at some of the points when the debug info was
> enabled.

If the job requires resources that a reservation does not free until outside of the time frame considered by the backfill scheduler, it just sets the jobs start time to then and moves to the next job without determining where the job might run or generating this message. I'll correct that oversight to improve our view into what is happening.

So far, I have not been able to reproduce anything like what you see unless the pending job requires resources (licenses or nodes) that are in the reservation. I'll keep looking...

Comment 20 Ryan Cox 2014-06-25 09:33:20 MDT

(In reply to Moe Jette from comment #19)
> (In reply to Ryan Cox from comment #18)
> > I just noticed that the debug info in the logs (backfill flag) seems to
> > disagree with the squeue and scontrol show job output when the 2015 value
> > shows up.
> > 
> > root@sched1:~# grep 2015 /var/log/slurm/slurmctld.log |grep start |grep -v '
> > reservation'
> > root@sched1:~# grep 3826001 /var/log/slurm/slurmctld.log |grep start
> > [2014-06-25T12:53:31.238] Job 3826001 to start at 2014-06-28T20:53:06, end
> > at 2014-06-29T22:45:00 on
> > m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12],
> > m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11]
> > 
> > I haven't had the backfill debug enabled the whole time but I know that the
> > 2015 value has happened at some of the points when the debug info was
> > enabled.
> 
> If the job requires resources that a reservation does not free until outside
> of the time frame considered by the backfill scheduler, it just sets the
> jobs start time to then and moves to the next job without determining where
> the job might run or generating this message. I'll correct that oversight to
> improve our view into what is happening.
> 

That makes more sense, though the weird part is why the reservations seem to affect it at all since it isn't on the same nodes.

Would it help to add some extra code temporarily to the backfill code that prints out extra info for just that job number?  I'm okay trying it out temporarily.  I don't know backfill well enough to attempt it or even know what might help, though I can now do so for the prioritization code if it is called for :)

> So far, I have not been able to reproduce anything like what you see unless
> the pending job requires resources (licenses or nodes) that are in the
> reservation. I'll keep looking...

That's not surprising.  It doesn't seem to be consistent so I can't always reproduce it on demand either.  And it toggles, which is the weird part.  I've tried correlating it with other events but haven't noticed anything so far.

Comment 21 Moe Jette 2014-06-25 09:41:50 MDT

(In reply to Ryan Cox from comment #20)
> Would it help to add some extra code temporarily to the backfill code that
> prints out extra info for just that job number?  I'm okay trying it out
> temporarily. 

Absolutely if you're up for that. Just to make sure that I send you a clean patch, are you using 14.03.4?

Comment 22 Ryan Cox 2014-06-25 09:45:03 MDT

(In reply to Moe Jette from comment #21)
> (In reply to Ryan Cox from comment #20)
> > Would it help to add some extra code temporarily to the backfill code that
> > prints out extra info for just that job number?  I'm okay trying it out
> > temporarily. 
> 
> Absolutely if you're up for that. Just to make sure that I send you a clean
> patch, are you using 14.03.4?

14.03.4-2, except we've applied our LEVEL_BASED patches on top of it.  That shouldn't touch any of the same code and I could resolve conflicts if it did.

Comment 23 Moe Jette 2014-06-25 10:33:17 MDT

In the course building and testing a patch for you, I discovered an anomaly that could be the root problem. I'm still studying the code, but have the problem area fairly well contained now.

Comment 24 Moe Jette 2014-06-25 11:29:33 MDT

Created attachment 992 [details]
Fix for bad start time

This should fix the problem with the bogus start times. Under some conditions it would set a job's estimates start time to the end time of the reservation that ends next, even if that is a year from now or has nothing to do with the resources that job needs. It should apply cleanly to your code base and is just a few lines.

I'll also be including this in v14.03.5.

Comment 25 Moe Jette 2014-06-25 11:37:30 MDT

This is the commit with my patch:

https://github.com/SchedMD/slurm/commit/a7054a23d0b6c0ada90c0b1b238c36a7c778dc2e

Comment 26 Ryan Cox 2014-06-26 02:41:25 MDT

I applied the patch.  The start times are now + bf_window (20160 minutes) which looks like the intended behavior when unable to schedule a start time.  There is currently one user who has jobs with a start time much less than that but the rest are now + bf_window.

Comment 27 Moe Jette 2014-06-26 02:44:32 MDT

(In reply to Ryan Cox from comment #26)
> I applied the patch.  The start times are now + bf_window (20160 minutes)
> which looks like the intended behavior when unable to schedule a start time.
> There is currently one user who has jobs with a start time much less than
> that but the rest are now + bf_window.

Great.

Was there some problem with job 3826001 not being scheduled or does it just keep getting deferred because higher priority jobs slipping in ahead of it?

Comment 28 Ryan Cox 2014-06-26 03:31:39 MDT

> Was there some problem with job 3826001 not being scheduled or does it just
> keep getting deferred because higher priority jobs slipping in ahead of it?

Yes.  There are some higher priority jobs that have started in front of it, but it appears that it will never start.

Here is the current top of the queue:
ryancox@m7int02:~$ squeue_with_priority |head
  JOBID  PRIORITY PARTITION     NAME         USER    STATE       TIME   TIMELIMIT    NODES  CPUS          START_TIME NODELIST(REASON)
3862044      1457        m7 fullannu      mwpswim  PENDING       0:00     4:00:00       16   512 2014-07-10T08:46:36 (Resources)
3826001      1445        m7     test     rcpacker  PENDING       0:00  1-02:00:00       35   560 2014-07-10T08:46:36 (Resources)
3688141       716        m7 NRTN28O2      bemily2  PENDING       0:00  6-16:30:00        1    16 2014-07-10T08:46:36 (Resources)
3688142       716        m7 NRTN28O2      bemily2  PENDING       0:00  6-16:30:00        1    16 2014-07-10T08:46:36 (Resources)
3688143       716        m7 NRTN28O2      bemily2  PENDING       0:00  6-16:30:00        1    16 2014-07-10T08:46:36 (Resources)
3688144       716        m7 NRTN28O2      bemily2  PENDING       0:00  6-16:30:00        1    16 2014-07-10T08:46:36 (Resources)
3688145       716        m7 NRTN28O2      bemily2  PENDING       0:00  6-16:30:00        1    16 2014-07-10T08:46:36 (Resources)
3688146       716        m7 NRTN28O2      bemily2  PENDING       0:00  6-16:30:00        1    16 2014-07-10T08:46:36 (AssociationResourceLimit)
3688147       716        m7 NRTN28O2      bemily2  PENDING       0:00  6-16:30:00        1    16 2014-07-10T08:46:36 (AssociationResourceLimit)


mwpswim is in the same account as rcpacker and has higher priority currently.  However, mwpswim's job can run on any m7 node and rcpacker can only run on the m7 nodes with 64 GB of memory.

Every time the backfill scheduler runs the start times for most jobs gets pushed back by one minute.  It looks like none of the jobs that require m7 ever get an estimated start time and they also seem not to reserve any nodes.  I would think that an estimated start time that is greater than the maximum time limit would always result in backfilled jobs starting when they shouldn't.

Maybe I'm misunderstanding something, but if the maximum time limit is 7 days and the bf_window is 14 days (we may want to lower bf_window at some point), shouldn't that mean that the highest priority job will be guaranteed to start in 7 days no matter what since all running jobs will be done by then?  Currently it is looking like they will never start because other jobs backfill in front of it.  Note that glh43 and I just had jobs with lower priorities than the top job start on nodes that the top priority job can run on.  Mine specifies the max time limit while glh43's is 3 days.  Maybe glh43's won't impact mwpswim's job's start time, but I would think that mine definitely does.

ryancox@m7int01:~$ squeue -o "%.18i %.5Q %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R" -t r -S -S -p m7 |head -15
             JOBID PRIOR PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
       3823877_195     1        m7     xjob    glh43  RUNNING       1:54 3-00:00:00      1 m7-4-7
           3877338  1000        m7   sbatch  ryancox  RUNNING       3:54 7-00:00:00      1 m7-5-5
       3823877_192     1        m7     xjob    glh43  RUNNING      18:54 3-00:00:00      1 m7-6-14
       3823877_191     1        m7     xjob    glh43  RUNNING      31:54 3-00:00:00      1 m7-3-5
       3823877_190     1        m7     xjob    glh43  RUNNING      35:54 3-00:00:00      1 m7-3-3
           3877319   411        m7 04_Intr_ stevenk3  RUNNING      41:54 2-00:00:00      1 m7-13-8
           3877320   411        m7 04_Intr_ stevenk3  RUNNING      41:54 2-00:00:00      1 m7-16-3
           3877321   411        m7 04_Intr_ stevenk3  RUNNING      41:54 2-00:00:00      1 m7-16-6
       3823877_189     1        m7     xjob    glh43  RUNNING      57:54 3-00:00:00      1 m7-6-2
           3862018  1104        m7 scd4-042 lundgrer  RUNNING    2:58:09 4-00:00:00     16 m7-1-11,m7-2-[4,7],m7-3-14,m7-4-[2,4,10,13],m7-5-[2,4],m7-7-[7,11],m7-8-[7,11],m7-10-[11,16]
           3688137   716        m7 NRTN28O2  bemily2  RUNNING    4:03:30 6-16:30:00      1 m7-13-6
           3688138   716        m7 NRTN28O2  bemily2  RUNNING    4:03:30 6-16:30:00      1 m7-14-16
           3688139   716        m7 NRTN28O2  bemily2  RUNNING    4:03:30 6-16:30:00      1 m7-15-9
           3688140   716        m7 NRTN28O2  bemily2  RUNNING    4:03:30 6-16:30:00      1 m7-18-7


With backfill debug enabled, the following shows that some jobs get start times but not the top two jobs:
root@sched1:~# grep -c 'start at' /var/log/slurm/slurmctld.log 
126
root@sched1:~# grep -c 'start at' /var/log/slurm/slurmctld.log |grep 3862044
root@sched1:~# grep -c 'start at' /var/log/slurm/slurmctld.log |grep 3826001
root@sched1:~# 

I've been running this for a while. The newer job by mwpswim has the same start time as rcpacker every time too.
ryancox@m7int02:~$ while true; do squeue -ho "%.7i %.9Q %.9P %.8j %.12u %.8T %.10M %.11l %.8D %.5C %.19S %R" -S "-p,-S" --state=pending -u rcpacker; sleep 1m; done
3826001      1445        m7     test     rcpacker  PENDING       0:00  1-02:00:00       35   560 2014-07-10T08:47:36 (Resources)
3826001      1445        m7     test     rcpacker  PENDING       0:00  1-02:00:00       35   560 2014-07-10T08:48:36 (Resources)
3826001      1445        m7     test     rcpacker  PENDING       0:00  1-02:00:00       35   560 2014-07-10T08:49:36 (Resources)
3826001      1445        m7     test     rcpacker  PENDING       0:00  1-02:00:00       35   560 2014-07-10T08:50:36 (Resources)
.......cut for brevity..... the times increased by one minute every minute ....
3826001      1445        m7     test     rcpacker  PENDING       0:00  1-02:00:00       35   560 2014-07-10T09:24:36 (Resources)
3826001      1445        m7     test     rcpacker  PENDING       0:00  1-02:00:00       35   560 2014-07-10T09:25:36 (Resources)
3826001      1445        m7     test     rcpacker  PENDING       0:00  1-02:00:00       35   560 2014-07-10T09:26:36 (Resources)
3826001      1445        m7     test     rcpacker  PENDING       0:00  1-02:00:00       35   560 2014-07-10T09:27:36 (Resources)

Comment 29 Moe Jette 2014-06-26 04:18:47 MDT

Responses inline - Moe

(In reply to Ryan Cox from comment #28)
> > Was there some problem with job 3826001 not being scheduled or does it just
> > keep getting deferred because higher priority jobs slipping in ahead of it?
> 
> Yes.  There are some higher priority jobs that have started in front of it,
> but it appears that it will never start.
> 
> Here is the current top of the queue:
> ryancox@m7int02:~$ squeue_with_priority |head
>   JOBID  PRIORITY PARTITION     NAME         USER    STATE       TIME  
> TIMELIMIT    NODES  CPUS          START_TIME NODELIST(REASON)
> 3862044      1457        m7 fullannu      mwpswim  PENDING       0:00    
> 4:00:00       16   512 2014-07-10T08:46:36 (Resources)
> 3826001      1445        m7     test     rcpacker  PENDING       0:00 
> 1-02:00:00       35   560 2014-07-10T08:46:36 (Resources)
> 3688141       716        m7 NRTN28O2      bemily2  PENDING       0:00 
> 6-16:30:00        1    16 2014-07-10T08:46:36 (Resources)
> 3688142       716        m7 NRTN28O2      bemily2  PENDING       0:00 
> 6-16:30:00        1    16 2014-07-10T08:46:36 (Resources)
> 3688143       716        m7 NRTN28O2      bemily2  PENDING       0:00 
> 6-16:30:00        1    16 2014-07-10T08:46:36 (Resources)
> 3688144       716        m7 NRTN28O2      bemily2  PENDING       0:00 
> 6-16:30:00        1    16 2014-07-10T08:46:36 (Resources)
> 3688145       716        m7 NRTN28O2      bemily2  PENDING       0:00 
> 6-16:30:00        1    16 2014-07-10T08:46:36 (Resources)
> 3688146       716        m7 NRTN28O2      bemily2  PENDING       0:00 
> 6-16:30:00        1    16 2014-07-10T08:46:36 (AssociationResourceLimit)
> 3688147       716        m7 NRTN28O2      bemily2  PENDING       0:00 
> 6-16:30:00        1    16 2014-07-10T08:46:36 (AssociationResourceLimit)
> 
> 
> mwpswim is in the same account as rcpacker and has higher priority
> currently.  However, mwpswim's job can run on any m7 node and rcpacker can
> only run on the m7 nodes with 64 GB of memory.

Since mwpswim's job has a higher priority then rcpacker's job, the backfill scheduler will reserve resources to start that job as soon as possible. Assuming there is nothing in mwpswim's job to prevent it from using the 64 GB nodes, they may well get reserved even if doing so delays the initiation of a lower priority job (i.e. rcpacker's job).

I'm not sure if this applies in your case, but the job's priority reported by squeue is only one factor in the sort order of jobs for the backfill scheduler. The jobs are sorted by
1. If job X can preempt job Y, then job X gets preference
2. If job X has ANY reservation and job Y has none, then job X gets preference
3. Priority of the job's partition (jobs submitted to multiple partitions are sorted as a separate job for each partition)
4. Job priority
5. Job ID


> Every time the backfill scheduler runs the start times for most jobs gets
> pushed back by one minute.  It looks like none of the jobs that require m7
> ever get an estimated start time and they also seem not to reserve any
> nodes.  I would think that an estimated start time that is greater than the
> maximum time limit would always result in backfilled jobs starting when they
> shouldn't.
> 
> Maybe I'm misunderstanding something, but if the maximum time limit is 7
> days and the bf_window is 14 days (we may want to lower bf_window at some
> point), shouldn't that mean that the highest priority job will be guaranteed
> to start in 7 days no matter what since all running jobs will be done by
> then?

Assuming you define priority as I describe above rather than just the job's "priority" field, that is correct, with the following exceptions: The job is blocked by
1. an association or QOS limit
2. an advanced reservation
3. required Down nodes
4. job dependencies
5. licenses

Are there any lower "priority" jobs with access to m7 with earlier start times?
I would check them with respect to the job prioritization rules and see if that is messing things up.

I'm considering adding a new DebugFlag that can output a list of schedulable jobs in priority order. I'm thinking that may be really helpful, even if verbose.


> Currently it is looking like they will never start because other jobs
> backfill in front of it. Note that glh43 and I just had jobs with lower
> priorities than the top job start on nodes that the top priority job can run
> on.  Mine specifies the max time limit while glh43's is 3 days.  Maybe
> glh43's won't impact mwpswim's job's start time, but I would think that mine
> definitely does.
>
> ryancox@m7int01:~$ squeue -o "%.18i %.5Q %.9P %.8j %.8u %.8T %.10M %.9l %.6D
> %R" -t r -S -S -p m7 |head -15
>              JOBID PRIOR PARTITION     NAME     USER    STATE       TIME
> TIMELIMIT  NODES NODELIST(REASON)
>        3823877_195     1        m7     xjob    glh43  RUNNING       1:54
> 3-00:00:00      1 m7-4-7
>            3877338  1000        m7   sbatch  ryancox  RUNNING       3:54
> 7-00:00:00      1 m7-5-5
>        3823877_192     1        m7     xjob    glh43  RUNNING      18:54
> 3-00:00:00      1 m7-6-14
>        3823877_191     1        m7     xjob    glh43  RUNNING      31:54
> 3-00:00:00      1 m7-3-5
>        3823877_190     1        m7     xjob    glh43  RUNNING      35:54
> 3-00:00:00      1 m7-3-3
>            3877319   411        m7 04_Intr_ stevenk3  RUNNING      41:54
> 2-00:00:00      1 m7-13-8
>            3877320   411        m7 04_Intr_ stevenk3  RUNNING      41:54
> 2-00:00:00      1 m7-16-3
>            3877321   411        m7 04_Intr_ stevenk3  RUNNING      41:54
> 2-00:00:00      1 m7-16-6
>        3823877_189     1        m7     xjob    glh43  RUNNING      57:54
> 3-00:00:00      1 m7-6-2
>            3862018  1104        m7 scd4-042 lundgrer  RUNNING    2:58:09
> 4-00:00:00     16
> m7-1-11,m7-2-[4,7],m7-3-14,m7-4-[2,4,10,13],m7-5-[2,4],m7-7-[7,11],m7-8-[7,
> 11],m7-10-[11,16]
>            3688137   716        m7 NRTN28O2  bemily2  RUNNING    4:03:30
> 6-16:30:00      1 m7-13-6
>            3688138   716        m7 NRTN28O2  bemily2  RUNNING    4:03:30
> 6-16:30:00      1 m7-14-16
>            3688139   716        m7 NRTN28O2  bemily2  RUNNING    4:03:30
> 6-16:30:00      1 m7-15-9
>            3688140   716        m7 NRTN28O2  bemily2  RUNNING    4:03:30
> 6-16:30:00      1 m7-18-7
> 
> 
> With backfill debug enabled, the following shows that some jobs get start
> times but not the top two jobs:
> root@sched1:~# grep -c 'start at' /var/log/slurm/slurmctld.log 
> 126
> root@sched1:~# grep -c 'start at' /var/log/slurm/slurmctld.log |grep 3862044
> root@sched1:~# grep -c 'start at' /var/log/slurm/slurmctld.log |grep 3826001
> root@sched1:~# 
> 
> I've been running this for a while. The newer job by mwpswim has the same
> start time as rcpacker every time too.
> ryancox@m7int02:~$ while true; do squeue -ho "%.7i %.9Q %.9P %.8j %.12u %.8T
> %.10M %.11l %.8D %.5C %.19S %R" -S "-p,-S" --state=pending -u rcpacker;
> sleep 1m; done
> 3826001      1445        m7     test     rcpacker  PENDING       0:00 
> 1-02:00:00       35   560 2014-07-10T08:47:36 (Resources)
> 3826001      1445        m7     test     rcpacker  PENDING       0:00 
> 1-02:00:00       35   560 2014-07-10T08:48:36 (Resources)
> 3826001      1445        m7     test     rcpacker  PENDING       0:00 
> 1-02:00:00       35   560 2014-07-10T08:49:36 (Resources)
> 3826001      1445        m7     test     rcpacker  PENDING       0:00 
> 1-02:00:00       35   560 2014-07-10T08:50:36 (Resources)
> .......cut for brevity..... the times increased by one minute every minute
> ....
> 3826001      1445        m7     test     rcpacker  PENDING       0:00 
> 1-02:00:00       35   560 2014-07-10T09:24:36 (Resources)
> 3826001      1445        m7     test     rcpacker  PENDING       0:00 
> 1-02:00:00       35   560 2014-07-10T09:25:36 (Resources)
> 3826001      1445        m7     test     rcpacker  PENDING       0:00 
> 1-02:00:00       35   560 2014-07-10T09:26:36 (Resources)
> 3826001      1445        m7     test     rcpacker  PENDING       0:00 
> 1-02:00:00       35   560 2014-07-10T09:27:36 (Resources)

Comment 30 Moe Jette 2014-06-26 04:42:19 MDT

Created attachment 998 [details]
Log job priority and partitions

You might find this small patch helpful to diagnose the problem. With DebugFlags=backfill, it logs the job priorities and partitions being considered rather than just the job id. Here is a sample:

slurmctld: backfill: beginning
slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol4
slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol4
slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol4
slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol3
slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol3
slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol3
slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol2
slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol2
slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol2
slurmctld: backfill: reached end of job queue

This change will be in version 14.03.5. Note that the map of resources available to the backfill scheduler through time is now only available by using   DebugFlags=BackfillMap

Comment 31 Ryan Cox 2014-06-26 07:23:53 MDT

Created attachment 999 [details]
start times for rcpacker job 3826001

I applied the patch and have looked at the logs with Backfill and BackfillMap.  It's getting the right partition, m7.

Several people submitted jobs that got higher priorities than rcpacker (mwpsim cancelled his).  Oddly enough, once there were more jobs in front of rcpacker, he got a saner start time.  I boosted his priority *higher* than everyone else and now his start time is now + bf_window again.

Here is some website output we have that shows the order of users in the queue where Wait=MAX(last job to start, submit time of oldest pending job) in H:MM.  Start is projected start time in H:MM from now.  m6==m6beta, m7==any m7, M7==m7 64GB.

User        Jobs   Procs   Sponsor   Parts    Wait    Start
rcpacker    1      560     sgorrell   M7       143:43  335:59
rroth       4      400     ssc7       m6       0:01    1:49
danilob     1      6       drw32      m6 m7    0:09    1:21
plf1        4814   4814    ekr8       m6 m7    0:00    1:49
marktwe     1      16      jsk24      M7       0:07    21:06
masaki      1      1       mjc22      m6f      0:07     
jhogge10    3      12      db6        m6 m7    0:06    21:06
bemily2     693    11088   jchansen   M7       0:09    237:06
samparas    12     144     dhe        m6 m7    0:11    21:36
fslcollab8  1921   23052   glh43      m6 m7    0:09    21:36
glh43       403    4836    glh43      m6 m7    2:44    21:36
ryancox     1      16      staff      M7       2:43    335:59

I attached the projected start times for rcpacker's job, snapshotted every minute.  Note that there were a few minutes when the time was 2014-06-29 or 2014-06-30.  Quite a few other jobs got in front of him then.  After a while some of them started or were cancelled.  Then his start time went back to 14 days from now.  I boosted his priority part way through and now it's 14 days again consistently.

I can't think of a reason why having more jobs in front of him would cause him to get a sooner start time.  That seems very backwards but I'm struggling to come up with a good, consistent test case.

Comment 32 Ryan Cox 2014-06-26 07:24:42 MDT

I should add that masaki was hitting an association limit.

Comment 33 Moe Jette 2014-06-26 07:38:31 MDT

(In reply to Moe Jette from comment #30)
> slurmctld: backfill: beginning
> slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol4
> slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol4
> slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol4
> slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol3
> slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol3
> slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol3
> slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol2
> slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol2
> slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol2
> slurmctld: backfill: reached end of job queue

Would it be possible to get the slurmctld log entries for a backfill run. I'm particularly interested in seeing what might be in partition m7 ahead of rcpacker's jobs (i.e. some issue with prioritizing jobs)?

The only other thing that I can think of is that some jobs submitted to m7 plus another partition created a reservation on the m7 nodes, but I have not seen a problem of that sort.

Comment 34 Ryan Cox 2014-06-26 07:42:30 MDT

(In reply to Moe Jette from comment #33)
> (In reply to Moe Jette from comment #30)
> > slurmctld: backfill: beginning
> > slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol4
> > slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol4
> > slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol4
> > slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol3
> > slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol3
> > slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol3
> > slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol2
> > slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol2
> > slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol2
> > slurmctld: backfill: reached end of job queue
> 
> Would it be possible to get the slurmctld log entries for a backfill run.
> I'm particularly interested in seeing what might be in partition m7 ahead of
> rcpacker's jobs (i.e. some issue with prioritizing jobs)?
> 

Sure, I'll get that to you in a few minutes.

> The only other thing that I can think of is that some jobs submitted to m7
> plus another partition created a reservation on the m7 nodes, but I have not
> seen a problem of that sort.

Now that I think about it, I do recall seeing some jobs get started on m7 nodes but be in the m6beta partition and maybe vice versa.  I forgot about that.

I just checked on that, and wow:
ryancox@m7int01:~$ squeue -l -w m7-[1-20]-[1-16] -p m6beta
Thu Jun 26 13:40:28 2014
             JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
           3828180    m6beta TRSTS22_ jtfuller  RUNNING 2-22:07:44 3-00:00:00      1 m7-17-15
           3845637    m6beta ts1-3fts  yzhang8  RUNNING 1-16:47:05 6-00:00:00      1 m7-7-3
       3563150_614    m6beta     xjob fslcolla  RUNNING 3-08:42:01 7-00:00:00      1 m7-9-4
       3563150_613    m6beta     xjob fslcolla  RUNNING 3-08:46:04 7-00:00:00      1 m7-9-15
       3563150_612    m6beta     xjob fslcolla  RUNNING 3-09:08:26 7-00:00:00      1 m7-8-5
       3563150_611    m6beta     xjob fslcolla  RUNNING 3-09:36:54 7-00:00:00      1 m7-12-13
       3563150_608    m6beta     xjob fslcolla  RUNNING 3-10:44:54 7-00:00:00      1 m7-9-2
       3563150_602    m6beta     xjob fslcolla  RUNNING 3-16:16:09 7-00:00:00      1 m7-9-10
       3563150_601    m6beta     xjob fslcolla  RUNNING 3-17:16:45 7-00:00:00      1 m7-8-1
       3563150_600    m6beta     xjob fslcolla  RUNNING 3-21:19:07 7-00:00:00      1 m7-4-5
       3563150_597    m6beta     xjob fslcolla  RUNNING 3-21:41:21 7-00:00:00      1 m7-2-1
       3563150_595    m6beta     xjob fslcolla  RUNNING 4-06:48:22 7-00:00:00      1 m7-4-15
       3563150_587    m6beta     xjob fslcolla  RUNNING 4-07:12:59 7-00:00:00      1 m7-10-2
       3563150_590    m6beta     xjob fslcolla  RUNNING 4-07:12:59 7-00:00:00      1 m7-6-12
       3563150_593    m6beta     xjob fslcolla  RUNNING 4-07:12:59 7-00:00:00      1 m7-6-16
       3563150_573    m6beta     xjob fslcolla  RUNNING 4-07:14:00 7-00:00:00      1 m7-4-3
       3563150_574    m6beta     xjob fslcolla  RUNNING 4-07:14:00 7-00:00:00      1 m7-4-6
       3563150_577    m6beta     xjob fslcolla  RUNNING 4-07:14:00 7-00:00:00      1 m7-4-12
       3563150_578    m6beta     xjob fslcolla  RUNNING 4-07:14:00 7-00:00:00      1 m7-4-14
       3563150_580    m6beta     xjob fslcolla  RUNNING 4-07:14:00 7-00:00:00      1 m7-5-1
       3563150_583    m6beta     xjob fslcolla  RUNNING 4-07:14:00 7-00:00:00      1 m7-5-8
       3563150_584    m6beta     xjob fslcolla  RUNNING 4-07:14:00 7-00:00:00      1 m7-5-10
       3563150_585    m6beta     xjob fslcolla  RUNNING 4-07:14:00 7-00:00:00      1 m7-5-11
       3563150_558    m6beta     xjob fslcolla  RUNNING 4-07:15:01 7-00:00:00      1 m7-7-2
       3563150_560    m6beta     xjob fslcolla  RUNNING 4-07:15:01 7-00:00:00      1 m7-7-8
       3563150_563    m6beta     xjob fslcolla  RUNNING 4-07:15:01 7-00:00:00      1 m7-7-15
       3563150_567    m6beta     xjob fslcolla  RUNNING 4-07:15:01 7-00:00:00      1 m7-10-5
       3563150_553    m6beta     xjob fslcolla  RUNNING 5-20:52:59 7-00:00:00      1 m7-6-4
       3563150_548    m6beta     xjob fslcolla  RUNNING 5-22:20:32 7-00:00:00      1 m7-3-7
       3563150_547    m6beta     xjob fslcolla  RUNNING 5-23:09:54 7-00:00:00      1 m7-4-9
       3563150_546    m6beta     xjob fslcolla  RUNNING 5-23:19:46 7-00:00:00      1 m7-7-14
       3563150_542    m6beta     xjob fslcolla  RUNNING 6-05:44:35 7-00:00:00      1 m7-6-8
       3563150_539    m6beta     xjob fslcolla  RUNNING 6-07:11:03 7-00:00:00      1 m7-11-9
       3563150_538    m6beta     xjob fslcolla  RUNNING 6-07:16:38 7-00:00:00      1 m7-2-6
       3563150_534    m6beta     xjob fslcolla  RUNNING 6-08:24:31 7-00:00:00      1 m7-5-7
       3563150_532    m6beta     xjob fslcolla  RUNNING 6-09:26:58 7-00:00:00      1 m7-8-9
       3563150_531    m6beta     xjob fslcolla  RUNNING 6-09:45:55 7-00:00:00      1 m7-4-11
       3685461_279    m6beta     xjob    glh43  RUNNING 2-08:54:49 3-00:00:00      1 m7-12-2
       3685461_278    m6beta     xjob    glh43  RUNNING 2-08:56:49 3-00:00:00      1 m7-6-9
       3685461_269    m6beta     xjob    glh43  RUNNING 2-09:38:52 3-00:00:00      1 m7-9-16
       3685461_268    m6beta     xjob    glh43  RUNNING 2-09:43:52 3-00:00:00      1 m7-9-13
       3685461_262    m6beta     xjob    glh43  RUNNING 2-09:57:52 3-00:00:00      1 m7-11-7
       3685461_177    m6beta     xjob    glh43  RUNNING 2-16:52:59 3-00:00:00      1 m7-12-12
       3563150_620    m6beta     xjob fslcolla  RUNNING 3-03:57:51 7-00:00:00      1 m7-11-2
       3823877_176    m6beta     xjob    glh43  RUNNING 1-05:02:54 3-00:00:00      1 m7-12-10
       3823877_175    m6beta     xjob    glh43  RUNNING 1-05:16:33 3-00:00:00      1 m7-12-9
       3823877_170    m6beta     xjob    glh43  RUNNING 1-05:39:37 3-00:00:00      1 m7-2-13
       3823877_166    m6beta     xjob    glh43  RUNNING 1-05:56:22 3-00:00:00      1 m7-2-9
       3823877_156    m6beta     xjob    glh43  RUNNING 1-07:11:24 3-00:00:00      1 m7-12-6
       3823877_157    m6beta     xjob    glh43  RUNNING 1-07:11:24 3-00:00:00      1 m7-2-8
       3823877_151    m6beta     xjob    glh43  RUNNING 1-07:13:33 3-00:00:00      1 m7-5-12
       3823877_139    m6beta     xjob    glh43  RUNNING 1-07:15:04 3-00:00:00      1 m7-12-3
       3823877_141    m6beta     xjob    glh43  RUNNING 1-07:15:04 3-00:00:00      1 m7-12-14
       3823877_144    m6beta     xjob    glh43  RUNNING 1-07:15:04 3-00:00:00      1 m7-5-14
       3823877_146    m6beta     xjob    glh43  RUNNING 1-07:15:04 3-00:00:00      1 m7-6-1
       3823877_148    m6beta     xjob    glh43  RUNNING 1-07:15:04 3-00:00:00      1 m7-6-11
           3877772    m6beta ox-H2SO4 samparas  RUNNING    1:22:57 1-21:00:00      1 m7-1-9
           3877751    m6beta ox-H2SO4 samparas  RUNNING    2:50:58 1-16:00:00      1 m7-13-8
           3877749    m6beta ox-H2SO4 samparas  RUNNING    2:54:57 1-16:00:00      1 m7-9-9
           3877750    m6beta ox-H2SO4 samparas  RUNNING    2:54:57 1-16:00:00      1 m7-16-3
           3877748    m6beta ox-H2SO4 samparas  RUNNING    2:56:28 1-16:00:00      1 m7-16-6
           3877379    m6beta TS1-from    deepa  RUNNING    3:48:52 2-04:00:00      1 m7-3-8
           3851728    m6beta CH_Activ ecahua97  RUNNING 1-03:03:30 4-04:00:00      1 m7-9-1
           3851732    m6beta CH_Activ ecahua97  RUNNING 1-03:03:30 4-04:00:00      1 m7-2-5
           3851733    m6beta CH_Activ ecahua97  RUNNING 1-03:03:30 4-04:00:00      1 m7-6-3
       3823877_196    m6beta     xjob    glh43  RUNNING    3:15:56 3-00:00:00      1 m7-5-5
       3823877_195    m6beta     xjob    glh43  RUNNING    4:18:52 3-00:00:00      1 m7-4-7
       3823877_192    m6beta     xjob    glh43  RUNNING    4:35:52 3-00:00:00      1 m7-6-14
       3823877_191    m6beta     xjob    glh43  RUNNING    4:48:52 3-00:00:00      1 m7-3-5
       3823877_190    m6beta     xjob    glh43  RUNNING    4:52:52 3-00:00:00      1 m7-3-3
       3823877_189    m6beta     xjob    glh43  RUNNING    5:14:52 3-00:00:00      1 m7-6-2
       3823877_187    m6beta     xjob    glh43  RUNNING    6:01:55 3-00:00:00      1 m7-5-13
       3563150_632    m6beta     xjob fslcolla  RUNNING 1-07:15:04 7-00:00:00      1 m7-9-6
       3563150_635    m6beta     xjob fslcolla  RUNNING 1-07:15:04 7-00:00:00      1 m7-1-4
       3563150_639    m6beta     xjob fslcolla  RUNNING 1-07:15:04 7-00:00:00      1 m7-1-13
       3685461_174    m6beta     xjob    glh43  RUNNING 2-23:10:34 3-00:00:00      1 m7-11-13
ryancox@m7int01:~$ squeue -l -w m6-[1-35]-[1-16] -p m7
Thu Jun 26 13:40:40 2014
             JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
ryancox@m7int01:~$ 


ryancox@m7int01:~$ cat /etc/slurm/partitions.conf
PartitionName=Default State=UP MaxTime=16-0 DefaultTime=30:00 shared=yes
PartitionName=dol4 priority=100 Default=yes Nodes=m5-21-[1-16] AllowGroups=dol4nodes,fslstaff
PartitionName=physics priority=90 Nodes=m6-33-[1-5],m6-33-[9-13],m6-34-[1-5],m6-34-[9-13] AllowGroups=rls5physicsnodes,dwn8physicsnodes,ewh5physicsnodes,glh43physicsnodes,mw22physicsnodes,fslstaff
PartitionName=dhe priority=80 Nodes=m6-35-[1-16] AllowGroups=dhenodes
PartitionName=m6beta priority=71 Nodes=m6-[1-2]-[2-16],m6-[3-32]-[1-16]
PartitionName=m6 priority=70
PartitionName=m7 priority=60 Nodes=m7-1-[2-16],m7-[2-12]-[1-16],m7-13-[2-16],m7-[14-20]-[1-16]
PartitionName=m5f priority=40 Nodes=m5f-1-[1-2] MaxTime=5-0
PartitionName=m6f priority=30 Nodes=m6f-1-[1-7] MaxTime=5-0
PartitionName=short priority=25 Nodes=m7-13-1,m6-1-1,m6-2-1,m7-1-1 MaxTime=1:00:00 MaxNodes=1
PartitionName=m5gpu priority=20 Nodes=m5gpu-1-[1-2]
PartitionName=m6gpu priority=10 Nodes=m6-33-8
PartitionName=blankpart priority=1
ryancox@m7int01:~$ cat /etc/slurm/nodes.conf
NodeName=default state=UNKNOWN
NodeName=m5-21-[1-16] sockets=2 corespersocket=4 RealMemory=24576 feature=ib,dol4,m5,intel,nehalem,cpu2800MHz,mem1333MHz,private,beta,sse4.1,sse4.2
NodeName=m6-33-[1-5],m6-33-[9-13],m6-34-[1-5],m6-34-[9-13] sockets=2 corespersocket=6 RealMemory=49152 feature=m6,physics,intel,westmere,cpu2667MHz,mem1333MHz,ib,beta,sse4.1,sse4.2
NodeName=m6-35-[1-16] sockets=2 corespersocket=6 RealMemory=24576 feature=m6,dhe,intel,westmere,private,cpu2667MHz,mem1067MHz,beta,sse4.1,sse4.2
NodeName=m6-[1-32]-[1-16] sockets=2 corespersocket=6 RealMemory=24576 feature=m6,intel,westmere,cpu2667MHz,mem1067MHz,beta,sse4.1,sse4.2
NodeName=m7-[1-12]-[1-16] sockets=2 corespersocket=8 weight=1 RealMemory=32768 feature=m7,ib,intel,sandybridge,cpu2200MHz,mem1600MHz,beta,sse4.1,sse4.2,avx
NodeName=m7-[13-20]-[1-16] sockets=2 corespersocket=8 weight=2 RealMemory=65536 feature=m7,ib,intel,sandybridge,cpu2200MHz,mem1600MHz,beta,sse4.1,sse4.2,avx
NodeName=m5f-1-[1-2] sockets=4 corespersocket=4 RealMemory=131072 feature=amd,barcelona,mem128,cpu2300MHz,mem667MHz,bigmem,nonbeta
NodeName=m6f-1-[1-7] sockets=4 corespersocket=8 RealMemory=262144 feature=intel,nehalemex,cpu1870MHz,mem978MHz,mem256,bigmem,beta,sse4.1,sse4.2
NodeName=m5gpu-1-1 sockets=2 corespersocket=4 gres=gpu:4 RealMemory=32768 feature=gpu,nvidia,t10,s1070,cuda,nondol4,intel,harpertown,cpu2667MHz,mem667MHz,nonbeta,sse4.1
NodeName=m5gpu-1-2 sockets=2 corespersocket=4 gres=gpu:4 RealMemory=32768 feature=gpu,nvidia,t10,s1070,cuda,nondol4,intel,harpertown,cpu2667MHz,mem667MHz,beta,sse4.1
NodeName=m6-33-8 sockets=2 corespersocket=6 gres=gpu RealMemory=24576 feature=nvidia,m2050,intel,westmere,cpu2800MHz,mem1333MHz,cuda,gpu,beta,sse4.1,sse4.2

Comment 35 Moe Jette 2014-06-26 07:45:42 MDT

(In reply to Ryan Cox from comment #34)
> Now that I think about it, I do recall seeing some jobs get started on m7
> nodes but be in the m6beta partition and maybe vice versa.  I forgot about
> that.
> 
> I just checked on that, and wow:
> ryancox@m7int01:~$ squeue -l -w m7-[1-20]-[1-16] -p m6beta
> Thu Jun 26 13:40:28 2014
>              JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT
> NODES NODELIST(REASON)
>            3828180    m6beta TRSTS22_ jtfuller  RUNNING 2-22:07:44
> 3-00:00:00      1 m7-17-15
>            3845637    m6beta ts1-3fts  yzhang8  RUNNING 1-16:47:05
> 6-00:00:00      1 m7-7-3
>        3563150_614    m6beta     xjob fslcolla  RUNNING 3-08:42:01

That's a great clue!
Could you run "scontrol show part" and confirm it matches your slurm.conf?

Comment 36 Ryan Cox 2014-06-26 07:53:22 MDT

Created attachment 1001 [details]
comparison of slurm.conf files and scontrol show part

As far as I can tell, they do match.  Unfortunately I didn't see any jobs that started after the most recent backfill patch that would show what partition the job was assigned.

Here is what I did see for the most recent one. Nothing too useful as far as I can tell:
root@sched1:~# grep 3877772 /var/log/slurm/slurmctld.log | tail -30
[2014-06-26T12:08:53.008] backfill test for job 3877772
[2014-06-26T12:08:53.020] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1
[2014-06-26T12:08:54.056] backfill test for job 3877772
[2014-06-26T12:09:54.005] backfill test for job 3877772
[2014-06-26T12:09:54.018] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1
[2014-06-26T12:09:55.076] backfill test for job 3877772
[2014-06-26T12:10:55.005] backfill test for job 3877772
[2014-06-26T12:10:55.017] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1
[2014-06-26T12:10:56.036] backfill test for job 3877772
[2014-06-26T12:11:56.005] backfill test for job 3877772
[2014-06-26T12:11:56.019] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1
[2014-06-26T12:11:57.058] backfill test for job 3877772
[2014-06-26T12:12:57.005] backfill test for job 3877772
[2014-06-26T12:12:57.018] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1
[2014-06-26T12:12:58.109] backfill test for job 3877772
[2014-06-26T12:13:58.005] backfill test for job 3877772
[2014-06-26T12:13:58.017] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1
[2014-06-26T12:13:59.057] backfill test for job 3877772
[2014-06-26T12:14:59.005] backfill test for job 3877772
[2014-06-26T12:14:59.018] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1
[2014-06-26T12:15:00.034] backfill test for job 3877772
[2014-06-26T12:16:00.005] backfill test for job 3877772
[2014-06-26T12:16:00.056] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1
[2014-06-26T12:16:03.167] backfill test for job 3877772
[2014-06-26T12:17:03.051] backfill test for job 3877772
[2014-06-26T12:17:03.065] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1
[2014-06-26T12:17:31.326] backfill test for job 3877772
[2014-06-26T12:17:31.327] backfill: Started JobId=3877772 on m7-1-9
[2014-06-26T12:23:08.746] Recovered job 3877772 597
[2014-06-26T12:31:18.328] Recovered job 3877772 597

Comment 37 Ryan Cox 2014-06-26 07:55:35 MDT

And I just did md5sums on our nodes.conf and partitions.conf files.  They are the same on the primary and backup ctld/dbd and on all compute nodes.

Comment 38 Ryan Cox 2014-06-26 08:02:54 MDT

Created attachment 1002 [details]
slurmctld.log.gz with backfill and backfillmap debug

Comment 39 Ryan Cox 2014-06-26 08:05:15 MDT

Created attachment 1003 [details]
scontrol show job

In case it's useful

Comment 40 Moe Jette 2014-06-26 08:21:42 MDT

I compared some of the (In reply to Ryan Cox from comment #39)
> Created attachment 1003 [details]
> scontrol show job
> 
> In case it's useful

Actually, it is. I was comparing your latest squeue output against the "scontrol show job" from yesterday and found a few anomalies. This one is particularly interesting.

From Wednesday:
JobId=3851728 Name=CH_Activation_trial3_18_F_TSLBS.com
   UserId=ecahua97(21355) GroupId=ecahua97(21587)
   Priority=317 Nice=0 Account=dhe QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:16:30 TimeLimit=4-04:00:00 TimeMin=N/A
   SubmitTime=2014-06-25T10:36:15 EligibleTime=2014-06-25T10:36:15
   StartTime=2014-06-25T10:36:58 EndTime=2014-06-29T14:36:58
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=m7 AllocNode:Sid=m7int01:7866
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=m7-9-1
   BatchHost=m7-9-1

Today/Thursday:
JobId=3851728 Name=CH_Activation_trial3_18_F_TSLBS.com
   UserId=ecahua97(21355) GroupId=ecahua97(21587)
   Priority=317 Nice=0 Account=dhe QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=1-03:27:07 TimeLimit=4-04:00:00 TimeMin=N/A
   SubmitTime=2014-06-25T10:36:15 EligibleTime=2014-06-25T10:36:15
   StartTime=2014-06-25T10:36:58 EndTime=2014-06-29T14:36:58
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=m6beta AllocNode:Sid=m7int01:7866
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=m7-9-1
   BatchHost=m7-9-1

"scontrol update job ..." can't even change a job's partition string, so something else is doing it. There aren't many places that modify the string, so hopefully I can track it down soon.

Comment 41 Moe Jette 2014-06-26 09:37:43 MDT

Found it!

The good news is that the scheduling appears to be correct. The output in squeue or scontrol show job is displaying the wrong partition under some situations. The problem is caused when a job is submitted to multiple partitions, the job is scheduled by the backfill scheduler in a partition that is Not first in the list, then the slurmctld is restarted. I should have a fix soon.

Comment 42 Ryan Cox 2014-06-26 09:42:34 MDT

(In reply to Moe Jette from comment #41)
> Found it!
> 
> The good news is that the scheduling appears to be correct. The output in
> squeue or scontrol show job is displaying the wrong partition under some
> situations. The problem is caused when a job is submitted to multiple
> partitions, the job is scheduled by the backfill scheduler in a partition
> that is Not first in the list, then the slurmctld is restarted. I should
> have a fix soon.

Excellent.  So the initial scheduling works, then the partition gets messed up.  Would that potentially affect backfill since some nodes would have jobs "from a different partition" on them?

Comment 43 Moe Jette 2014-06-26 09:44:51 MDT

(In reply to Ryan Cox from comment #42)
> Excellent.  So the initial scheduling works, then the partition gets messed
> up.  Would that potentially affect backfill since some nodes would have jobs
> "from a different partition" on them?

No, we are just not seeing all of the jobs actually associated with m7 that we should be.

Comment 44 Moe Jette 2014-06-26 10:16:28 MDT

Created attachment 1004 [details]
Correct job's partition info

This will correct the name of the partition associated with running jobs under some circumstances. This will not help you for jobs that are already running. They will continue to be reported as being associated with the incorrect partition name, but jobs started after this patch will be reported in the correct partition.

Comment 45 Ryan Cox 2014-06-26 10:45:55 MDT

Thanks.  We are now running with the patch.  It looks like all the affected jobs will be done in 3-4 days.

Is there anything else I can send you for the backfill issue?  For now I'm creating a reservation to assign rcpacker's job to so it actually gets started.  If we need another test case I'll just create a job like rcpacker's.  Currently, everyone except for one user has start times that are 14 days out so reproducing it shouldn't be too hard.

It's too late in the day for me to really do anything else so I'll work on it in the morning.

Comment 46 Moe Jette 2014-06-26 11:39:34 MDT

(In reply to Ryan Cox from comment #45)
> Thanks.  We are now running with the patch.  It looks like all the affected
> jobs will be done in 3-4 days.

That should at least provide a clearer picture of what is happening.


> Is there anything else I can send you for the backfill issue? 

No now, thanks.


> For now I'm
> creating a reservation to assign rcpacker's job to so it actually gets
> started. 

I've been hoping to solve the problem, but an advanced reservation should at least get rcpacker's job run.


> If we need another test case I'll just create a job like
> rcpacker's.  Currently, everyone except for one user has start times that
> are 14 days out so reproducing it shouldn't be too hard.

Do you have any feeling about the difference between jobs in m7 that are starting and those which are not? Memory size? Node count? Time limit?


> It's too late in the day for me to really do anything else so I'll work on
> it in the morning.

Comment 47 Ryan Cox 2014-06-27 04:17:03 MDT

> Do you have any feeling about the difference between jobs in m7 that are
> starting and those which are not? Memory size? Node count? Time limit?
> 

I think it's mostly single node jobs or jobs with small enough node counts that they happen to get a reservation occasionally and then start.  Even single node jobs can get a start time that is really far away.

The only other pattern I think I'm seeing is that affected jobs need m7 and don't have multiple partitions listed, but I'm not positive about that.

Well, that and the more bigger jobs there are queued, the more likely the start times will be sane for everyone.  Maybe.

Comment 48 Moe Jette 2014-06-27 04:39:38 MDT

(In reply to Moe Jette from comment #44)
> Created attachment 1004 [details]
> 
> This will correct the name of the partition associated with running jobs
> under some circumstances. This will not help you for jobs that are already
> running. They will continue to be reported as being associated with the
> incorrect partition name, but jobs started after this patch will be reported
> in the correct partition.

I believe this patch fixes the underlying backfill scheduling problem too, although I have not been able to reproduce it. The select/cons_res plugin tracks resource usage by partition, which is needed for preemption, oversubscribing resources, and gang scheduling). If a job's partition pointer is wrong (i.e. the nodes in are the m7 partition, but it's partition pointer references m6), the select/cons_res is going to get a bit confused. Nodes can be moved between partitions, so I'm not expecting slurmctld to crash or anything like that (or you would have already seen that), but I could see the bookkeeping for a partition getting off.

Were you able to get rcpacker's job started yet?

Comment 49 Ryan Cox 2014-06-27 07:32:45 MDT

(In reply to Moe Jette from comment #48)
> I believe this patch fixes the underlying backfill scheduling problem too,
> although I have not been able to reproduce it. The select/cons_res plugin
> tracks resource usage by partition, which is needed for preemption,
> oversubscribing resources, and gang scheduling). If a job's partition
> pointer is wrong (i.e. the nodes in are the m7 partition, but it's partition
> pointer references m6), the select/cons_res is going to get a bit confused.
> Nodes can be moved between partitions, so I'm not expecting slurmctld to
> crash or anything like that (or you would have already seen that), but I
> could see the bookkeeping for a partition getting off.
> 

Okay.  That makes sense.  I'll let you know on Monday what I see since most of the jobs in the wrong partition will be done by then, though not all.

By the way, have you considered creating some kind of snapshot mechanism that will snapshot information about pending and running jobs, node states, associations, etc to diagnose problems like this?  The test suite can already create different setups but it might be nice to ask a customer to snapshot their setup then be able to import it into slurmctld/slurmdbd so you can play around with gdb, extra debug statements, code modifications, or whatever with real data.

I have no idea how feasible that is but it seems like the hardest part would be faking that a currently running job that started 6 hours ago is still running, only now with fake hardware at a different time (maybe all times everywhere have now-snapshot_time added to them).  Well, I'm sure there are lots of pretty hard parts.

> Were you able to get rcpacker's job started yet?

Not yet.  It should start late Sunday night in the advanced reservation I created; it was the earliest time I could find on that hardware.

Comment 50 Moe Jette 2014-06-27 09:15:28 MDT

(In reply to Ryan Cox from comment #49) 
> By the way, have you considered creating some kind of snapshot mechanism
> that will snapshot information about pending and running jobs, node states,
> associations, etc to diagnose problems like this?  The test suite can
> already create different setups but it might be nice to ask a customer to
> snapshot their setup then be able to import it into slurmctld/slurmdbd so
> you can play around with gdb, extra debug statements, code modifications, or
> whatever with real data.

Making a copy of the state save files does this, but it's pretty difficult to work with.

Comment 51 Ryan Cox 2014-06-30 05:46:10 MDT

Created attachment 1010 [details]
scontrol show job

Sadly the problem still exists.  All running jobs were wiped out by an extended power outage last night so 1) I can't find any more jobs with the wrong partition and 2) expect a few typos in what I write...

I'm attaching the most recent scontrol show node.  mwpswim can run on any m7 node and rcpacker can only run on 64 GB m7 nodes.  Note that we have an advanced reservation starting at noon that I will soon assign rcpacker's job to.

Am I misunderstanding something with how backfill works?  My understanding is that no lower priority job should ever start in such a way that it delays the start time of a higher priority job.  Under current conditions, the higher priority job gets a start time as far away as bf_window (14 days) and any lower priority job is eligible to start before the higher priority job as long as its time limit is less than 14 days.  They are all less than 14 days since 7 days is the association maximum, so anyone can "backfill" in front and consequently delay larger jobs as long as resources are available at the moment for the smaller jobs but not the larger ones.  That is not intentional, right?

(I hope that made sense... I'm a little tired).

Comment 52 Moe Jette 2014-06-30 06:11:04 MDT

(In reply to Ryan Cox from comment #51)
> Am I misunderstanding something with how backfill works?  My understanding
> is that no lower priority job should ever start in such a way that it delays
> the start time of a higher priority job.

That is correct, although a job's priority reported by squeue is only one factor in the sort order of jobs for the backfill scheduler. The jobs are sorted by
1. If job X can preempt job Y, then job X gets preference
2. If job X has ANY reservation and job Y has none, then job X gets preference
3. Priority of the job's partition (jobs submitted to multiple partitions are sorted as a separate job for each partition)
4. Job priority
5. Job ID

As it goes down the list, resources are reserved for pending jobs based upon when and where they are expected to start.

> Under current conditions, the
> higher priority job gets a start time as far away as bf_window (14 days) and
> any lower priority job is eligible to start before the higher priority job
> as long as its time limit is less than 14 days. They are all less than 14
> days since 7 days is the association maximum, so anyone can "backfill" in
> front and consequently delay larger jobs as long as resources are available
> at the moment for the smaller jobs but not the larger ones.  That is not
> intentional, right?

I really thought the patches that you have would put the job of interest at the head of the queue for jobs that can run in m7. Given the incorrect partition information reported on jobs, it was very difficult to see what was happening earlier. I have no idea right why that isn't happening now other than perhaps some QOS or association limit.
Could you check that?

Perhaps you could enable the Backfill and BackfillMap debug flags for a while and I can match those logs against matching job info.

Comment 53 Moe Jette 2014-06-30 09:56:46 MDT

Created attachment 1011 [details]
Possible fix

This patch removes some logic from the backfill scheduler designed to avoid duplicate tests to schedule a job on the the same set of resources as a previous test (e.g. changes to available resources from the previous test occur in a set of nodes outside of the partition used by the job being evaluated). I have not been able to come up with a way to reproduce the failure you see, but am suspicious about this code. Removing the code (i.e. applying this patch) will in the worst case slow down the backfill scheduler a bit. In the best case, it fixes the problem that you see.

Comment 54 Moe Jette 2014-07-03 06:25:57 MDT

Do you have any update on this?

Comment 55 Ryan Cox 2014-07-03 07:57:49 MDT

Sorry I haven't gotten back to you.  I haven't seen any problems for the last few days but our queue has a very different makeup now than previously due to a power outage Sunday night.  We have lots more jobs from lots more users waiting a lot longer than normal.

That said, I haven't noticed any weird behavior since Monday.  I did apply the patch in comment 53 but I don't know if that fixed things or if it was because the power outage cleared out all the jobs that were in the wrong partition.

I'll keep an eye on it next week once the queue gets smaller.  Thanks for checking.

Comment 56 Ryan Cox 2014-07-08 07:34:22 MDT

I haven't seen any problems in the last week.  Maybe the last patch and/or fixing the jobs' partitions has solved the problem.

Comment 57 Moe Jette 2014-07-08 08:25:04 MDT

Thanks for the feedback. I've applied the most recent patch to the version 14.03 code branch. It will be in version 14.03.5 when released. Please re-open if you encounter more problems.

Comment 58 Ryan Cox 2014-07-08 08:27:06 MDT

Thanks for all the time and effort you put into this.  I know this was a particularly tough one.