Created attachment 975 [details] topology from slurm.conf I'm seeing some odd behavior where jobs are getting an estimated start time of a year from now. The job seems like it has been waiting much longer than it should have, possibly because jobs are getting backfilled in front of it since its start date is in 2015. I increased the debug level and enabled backfill debug to see if I could figure it out. I saw this in the logs: [2014-06-24T09:20:02.402] debug2: found 191 usable nodes from config containing m7-[1-12]-[1-16] [2014-06-24T09:20:02.402] debug2: found 127 usable nodes from config containing m7-[13-20]-[1-16] [2014-06-24T09:20:02.402] debug3: _pick_best_nodes: job 3826001 idle_nodes 71 share_nodes 507 [2014-06-24T09:20:02.402] debug2: select_p_job_test for job 3826001 [2014-06-24T09:20:02.402] debug3: cons_res: _vns: node m7-1-2 sharing? [2014-06-24T09:20:02.402] debug3: cons_res: _vns: node m7-1-3 sharing? ........ cut for brevity ......... [2014-06-24T09:20:02.405] debug3: cons_res: _vns: node m7-12-16 sharing? [2014-06-24T09:20:02.405] debug3: cons_res: _vns: node m7-15-9 sharing? [2014-06-24T09:20:02.405] debug3: cons_res: _vns: node m7-17-15 sharing? [2014-06-24T09:20:02.405] debug: job 3826001: best_fit topology failure : no switch satisfying the request found [2014-06-24T09:20:02.405] debug2: select_p_job_test for job 3826001 [2014-06-24T09:20:02.405] debug3: sched: JobId=3826001. State=PENDING. Reason=Resources. Priority=144705882. Partition=m7. I'm wondering why switches could be a problem for job 3826001. Note the StartTime=2015-06-20T17:30:04 and that there is no topology request. root@sched1:~# scontrol -d show job 3826001 JobId=3826001 Name=test UserId=rcpacker(21331) GroupId=rcpacker(21559) Priority=144535840 Nice=0 Account=sgorrell QOS=m7_64 JobState=PENDING Reason=Resources Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:00 TimeLimit=1-02:00:00 TimeMin=N/A SubmitTime=2014-06-20T11:31:31 EligibleTime=2014-06-20T11:31:31 StartTime=2015-06-20T17:30:04 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=m7 AllocNode:Sid=m7int01:19193 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=35-35 NumCPUs=560 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0 MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/panfs/pan.fsl.byu.edu/home/grp/fslg_IFI/step6_harmonic/run_STARCCM+_FSL_Ryan_hb /fslhome/rcpacker/compute/IFI/step5_explicit_mixing/hb_v07_m5000_explicit@018000.sim WorkDir=/panfs/pan.fsl.byu.edu/home/grp/fslg_IFI/step6_harmonic StdErr=/panfs/pan.fsl.byu.edu/home/grp/fslg_IFI/step6_harmonic/slurm-3826001.out StdIn=/dev/null StdOut=/panfs/pan.fsl.byu.edu/home/grp/fslg_IFI/step6_harmonic/slurm-3826001.out I'm attaching the network topology information. Here is sinfo for the m7 partition. We have a few nodes draining for something, but other than that I can't think of what other information you need. root@sched1:~# sinfo -p m7 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST m7 up 16-00:00:0 12 drng m7-1-15,m7-3-[3,5],m7-6-[2,16],m7-7-8,m7-9-[8,10,15],m7-15-16,m7-16-9,m7-20-15 m7 up 16-00:00:0 32 mix m7-2-[5,7],m7-3-[2,8],m7-5-[4,13-15],m7-6-11,m7-7-5,m7-8-[13-14],m7-9-1,m7-10-[7,12],m7-13-8,m7-14-[1,6,14],m7-15-[8-10],m7-16-[3,5-6],m7-17-15,m7-18-[5,8,15],m7-19-[9,11],m7-20-7 m7 up 16-00:00:0 273 alloc m7-1-[2-14,16],m7-2-[1-4,6,8-16],m7-3-[1,4,6-7,9,11-16],m7-4-[1-16],m7-5-[1-3,5-12,16],m7-6-[1,3-10,12-15],m7-7-[1-4,6-7,9-16],m7-8-[1-12,15-16],m7-9-[2-7,9,11-14,16],m7-10-[1-6,8-11,13-16],m7-11-[1-16],m7-12-[1-16],m7-13-[2-7,9-16],m7-14-[2-5,7-13,15-16],m7-15-[1-7,11-15],m7-16-[1-2,4,7-8,10-16],m7-17-[1-14,16],m7-18-[1-4,6-7,9-14,16],m7-19-[1-8,10,12-16],m7-20-[1-6,8-14,16] m7 up 16-00:00:0 1 idle m7-3-10
It does appear that the job will never run as long as other jobs exist that can run on the target nodes. Other jobs are definitely getting backfilled in front of it since the job isn't scheduled to start until mid-2015. I'm able to reproduce the problem by submitting a job with an identical request: #SBATCH --time=26:00:00 #SBATCH --ntasks=560 #SBATCH --nodes=35 #SBATCH --mem-per-cpu=4G #SBATCH -p m7 #SBATCH --exclusive
This message: job 3826001: best_fit topology failure : no switch satisfying the request found is clearly misleading. I am changing it to: job 3826001: best_fit topology failure: no switch currently has sufficient resource to satisfy the request You will note this job requires 35 nodes and there was only 1 idle node and 32 partially idle nodes in your logs, so without preemption this job can not start. Were you expecting this job to preempt other jobs to start? ==================== Your message just came in this morning. Perhaps this is more of a backfill issue?
I filed the bug report yesterday then added a comment today. I think it is more of a backfill problem. The job is currently the highest priority job in the system but 6 and 7 day lower priority jobs are being backfilled in front of it. It was submitted 2014-06-20T11:31:31 and we have plenty of nodes to satisfy the request. None of the required nodes (35 nodes of m7-[13-20]-[1-16]) are idle and other jobs have started on it as recently as this morning. I wonder how the job can have a start time of 2015 and how the lower priority jobs can backfill in front of it. There are no reservations that would preclude it from starting. Maybe I messed something up? SchedulerParameters=bf_max_job_user=20,bf_interval=60,default_queue_depth=15,max_job_bf=5000,bf_window=20160,bf_continue,max_sched_time=6,bf_resolution=900 After looking some more, it turns out that another user requesting the same resources also has the exact same 2015 start date: ryancox@m7int02:~$ scontrol -d show job 3688135 JobId=3688135 Name=NRTN28O2complex_299 UserId=bemily2(20212) GroupId=bemily2(20222) Priority=716 Nice=0 Account=jchansen QOS=m7_64 JobState=PENDING Reason=Priority Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:00 TimeLimit=6-16:30:00 TimeMin=N/A SubmitTime=2014-06-09T12:56:34 EligibleTime=2014-06-09T12:56:34 StartTime=2015-06-20T17:30:04 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=m7 AllocNode:Sid=m7int02:32448 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0 MinCPUsNode=1 MinMemoryCPU=3G MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=1 Contiguous=0 Licenses=(null) Network=(null) Command=/tmp/qsub_wrapper_modified_script_eUuYSRqODM WorkDir=/panfs/pan.fsl.byu.edu/scr/usr/12/bemily2/Aerosols/british/NRTN28O2complex StdErr=/panfs/pan.fsl.byu.edu/scr/usr/12/bemily2/Aerosols/british/NRTN28O2complex/NRTN28O2complex_299.e%j StdIn=/dev/null StdOut=/panfs/pan.fsl.byu.edu/scr/usr/12/bemily2/Aerosols/british/NRTN28O2complex/NRTN28O2complex_299.o3688135 ryancox@m7int02:~$ sinfo -n m7-[13-20]-[1-16] -p m7 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST m7 up 16-00:00:0 3 drng m7-15-16,m7-16-9,m7-20-15 m7 up 16-00:00:0 5 mix m7-13-8,m7-15-9,m7-16-[3,6],m7-17-15 m7 up 16-00:00:0 119 alloc m7-13-[2-7,9-16],m7-14-[1-16],m7-15-[1-8,10-15],m7-16-[1-2,4-5,7-8,10-16],m7-17-[1-14,16],m7-18-[1-16],m7-19-[1-16],m7-20-[1-14,16] ryancox@m7int02:~$ squeue -l --start -j 3826001 Wed Jun 25 10:15:22 2014 JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON) 3826001 m7 test rcpacker PD 2015-06-20T17:30:04 35 (Resources) ryancox@m7int02:~$ squeue -l -w m7-[13-20]-[1-16] -S \-S | head -5 Wed Jun 25 10:14:41 2014 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 3688134 m7 NRTN28O2 bemily2 RUNNING 1:51:52 6-16:30:00 1 m7-14-13 3688126 m7 NRTN28O2 bemily2 RUNNING 3:45:45 6-16:30:00 1 m7-14-14 3688127 m7 NRTN28O2 bemily2 RUNNING 3:45:45 6-16:30:00 1 m7-18-5 ryancox@m7int02:~$ squeue -l -w m7-[13-20]-[1-16] -S e | head -10 Wed Jun 25 10:19:23 2014 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 3687977 m7 NRTN28O2 bemily2 RUNNING 6-15:10:52 6-16:30:00 1 m7-15-14 3687978 m7 NRTN28O2 bemily2 RUNNING 6-08:31:59 6-16:30:00 1 m7-19-10 3687979 m7 NRTN28O2 bemily2 RUNNING 6-07:56:57 6-16:30:00 1 m7-20-11 3687981 m7 NRTN28O2 bemily2 RUNNING 6-05:30:53 6-16:30:00 1 m7-19-2 3687982 m7 NRTN28O2 bemily2 RUNNING 6-04:37:51 6-16:30:00 1 m7-17-12 3687983 m7 NRTN28O2 bemily2 RUNNING 6-04:29:12 6-16:30:00 1 m7-18-7 3687986 m7 NRTN28O2 bemily2 RUNNING 6-03:20:05 6-16:30:00 1 m7-13-6 3834115 m7 input.sh jorr3 RUNNING 23:51:30 1-22:00:00 1 m7-13-8
(In reply to Ryan Cox from comment #3) > I wonder how the job can have a start time of 2015 and how the lower > priority jobs can backfill in front of it. That makes two of us. Do you still have a 16 day time limit on the m7 partition? Do you have any running jobs with crazy long time limits (perhaps due to the partition limit being over-ridden by an QoS or association limit or being reset by a sys admin)? I know it will be verbose, but perhaps you could either attach or email me the output of the following commands: scontrol show part scontrol show job
Created attachment 985 [details] scontrol show part
Created attachment 986 [details] scontrol show job All accounts are set to a max of 7 days but we retained the 16 day limit on some partitions in case we needed to grant an exception. I think we're done with exceptions at this point so I may lower to 7 days in the future. There are certainly no jobs with a time limit above 7 days at this point. ryancox@m7int02:~$ squeue -l -S \-l | head -10 Wed Jun 25 10:56:59 2014 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 3851742 m6beta run_all_ ram247 RUNNING 7:45 7-00:00:00 1 m6-19-12 3851701 m6beta run_all_ ram247 RUNNING 36:17 7-00:00:00 1 m6-31-16 3851693 m6beta run_all_ ram247 RUNNING 51:25 7-00:00:00 1 m6-31-16 3851680 m6beta run_all_ ram247 RUNNING 1:02:01 7-00:00:00 1 m6-31-16 3851675 m6beta run_all_ ram247 RUNNING 1:11:10 7-00:00:00 1 m6-31-16 3851660 m6beta run_all_ ram247 RUNNING 1:31:36 7-00:00:00 1 m6-31-16 3851657 m6beta run_all_ ram247 RUNNING 1:44:45 7-00:00:00 1 m6-31-16 3851645 m6beta run_all_ ram247 RUNNING 1:53:08 7-00:00:00 1 m6-13-9 ryancox@m7int02:~$ squeue -l -w m7-[13-20]-[1-16] -S \-l | head -10 Wed Jun 25 10:57:11 2014 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 3688134 m7 NRTN28O2 bemily2 RUNNING 2:34:22 6-16:30:00 1 m7-14-13 3688126 m7 NRTN28O2 bemily2 RUNNING 4:28:15 6-16:30:00 1 m7-14-14 3688127 m7 NRTN28O2 bemily2 RUNNING 4:28:15 6-16:30:00 1 m7-18-5 3688128 m7 NRTN28O2 bemily2 RUNNING 4:28:15 6-16:30:00 1 m7-18-6 3688129 m7 NRTN28O2 bemily2 RUNNING 4:28:15 6-16:30:00 1 m7-18-8 3688130 m7 NRTN28O2 bemily2 RUNNING 4:28:15 6-16:30:00 1 m7-18-9 3688131 m7 NRTN28O2 bemily2 RUNNING 4:28:15 6-16:30:00 1 m7-18-13 3688132 m7 NRTN28O2 bemily2 RUNNING 4:28:15 6-16:30:00 1 m7-18-15
Created attachment 987 [details] scontrol show job #2 In case it helps, here's another one. Two people just submitted jobs that are higher priority than rcpacker's: mwpswim and plf1. You can see that several people ended up with the same start time: root@sched1:/tmp# zgrep -B6 StartTime=2015-06-20T17:30:04 scontrol_show_job.gz |grep UserId UserId=fslcollab8(20131) GroupId=fslcollab8(20127) UserId=fslcollab8(20131) GroupId=fslcollab8(20127) UserId=fslcollab8(20131) GroupId=fslcollab8(20127) UserId=fslcollab8(20131) GroupId=fslcollab8(20127) UserId=fslcollab8(20131) GroupId=fslcollab8(20127) UserId=glh43(6466) GroupId=glh43(13733) UserId=rcpacker(21331) GroupId=rcpacker(21559) UserId=glh43(6466) GroupId=glh43(13733) UserId=mwpswim(21328) GroupId=mwpswim(21556) It's really odd that they're all the same time down to the second.
I'm not sure if this might be related, but what are your QoS preemption rules?
Check out the EndTime: root@sched1:/tmp# scontrol show reservation licenseres ReservationName=licenseres StartTime=2014-06-20T17:30:03 EndTime=2015-06-20T17:30:03 Duration=365-00:00:00 Nodes= NodeCnt=0 CoreCnt=0 Features=(null) PartitionName=dol4 Flags=LICENSE_ONLY Users=root Accounts=(null) Licenses=emptylic:1,ansys_aa_r:5,ansys_aa_r_hpc:4,starccm_ccmpsuite:12,starccm_hpcdomains:318,starccm_star-nx:2,xilinx_ISE:1,xilinx_PlanAhead:1,xilinx_SDK:1,xilinx_Vivado_System_Edition:3 State=ACTIVE The ones I have checked don't request any licenses though.
ryancox@m7int02:~$ sacctmgr list qos Name Priority GraceTime Preempt PreemptMode Flags UsageThres UsageFactor GrpCPUs GrpCPUMins GrpCPURunMins GrpJobs GrpMem GrpNodes GrpSubmit GrpWall MaxCPUs MaxCPUMins MaxNodes MaxWall MaxCPUsPU MaxJobsPU MaxNodesPU MaxSubmitPU ---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- -------- ----------- ------------- ------- ------- -------- --------- ----------- -------- ----------- -------- ----------- --------- --------- ---------- ----------- normal 0 00:00:00 standby cluster 1.000000 standby 0 00:05:00 cluster 0.000000 1000000 1000000000000000 1000000 private 0 00:00:00 standby cluster 1.000000 20000 1000000000000000 test 20000 00:00:00 standby cluster DenyOnLimit 1.000000 20000 1000000000000000 512 01:00:00 256 12 maint 50000 00:00:00 standby cluster 0.000000 1000000 1000000000000000 1000000 preemptab+ 0 00:00:05 cluster NoReserve 1.000000 1000000 1000000 bigmem 0 00:00:00 standby cluster DenyOnLimit 1.500000 5-00:00:00 3 m7_64 0 00:00:00 standby cluster 1.250000 preemptable was a testing thing I need to get rid of. We're using "standby", though it is almost never used. PreemptType=preempt/qos PreemptMode=cancel
(In reply to Ryan Cox from comment #9) > Check out the EndTime: > root@sched1:/tmp# scontrol show reservation licenseres > ReservationName=licenseres StartTime=2014-06-20T17:30:03 > EndTime=2015-06-20T17:30:03 Duration=365-00:00:00 > Nodes= NodeCnt=0 CoreCnt=0 Features=(null) PartitionName=dol4 > Flags=LICENSE_ONLY > Users=root Accounts=(null) > Licenses=emptylic:1,ansys_aa_r:5,ansys_aa_r_hpc:4,starccm_ccmpsuite:12, > starccm_hpcdomains:318,starccm_star-nx:2,xilinx_ISE:1,xilinx_PlanAhead:1, > xilinx_SDK:1,xilinx_Vivado_System_Edition:3 State=ACTIVE > > The ones I have checked don't request any licenses though. Given the matching times, this must be related although I'm not sure how at this point.
Created attachment 990 [details] scontrol show job #3, with sane times The other odd thing is that sometimes the jobs do get sane start time estimates. Currently they're all sane, within the next 3-4 days.
Question about your use of licenses: I see _no_ jobs requesting any licenses at all in the first "scontrol show job" output. Is that expected? I'm wondering if there is something wrong with Slurm's reporting of job requests for licenses, although I don't see any such problem in my testing. If jobs really are using licenses, that could result in the job of interest being delayed until the advanced reservation for licenses end.
(In reply to Moe Jette from comment #13) > Question about your use of licenses: I see _no_ jobs requesting any licenses > at all in the first "scontrol show job" output. Is that expected? > I imagine that we have quite a few people who don't know that they need to request licenses, so I guess it's not unexpected. Many of our users don't use commercial code or use code that has a site license (no tracking required). > I'm wondering if there is something wrong with Slurm's reporting of job > requests for licenses, although I don't see any such problem in my testing. > If jobs really are using licenses, that could result in the job of interest > being delayed until the advanced reservation for licenses end. I just submitted a job that properly showed a license request in scontrol show job both while pending and when running. For some reason all job start times are still sane. I'm trying to figure out what may have changed (jobs, other reservations, etc). I haven't changed the configuration at all.
And now they're back to 2015-06-20T17:30:04 again. Right after I hit submit last time.
(In reply to Ryan Cox from comment #15) > And now they're back to 2015-06-20T17:30:04 again. Right after I hit submit > last time. At least we know its related to an advanced reservation of licenses. I'm guessing that if you remove that advanced reservation the pending jobs will run. That would just be a work around until I can track down what must be a bug in Slurm.
After deleting the license reservation, I saw some different behavior. ryancox@m7int02:~$ scontrol show reservation ReservationName=binutils_test StartTime=2014-06-24T16:26:08 EndTime=2015-06-24T16:26:08 Duration=365-00:00:00 Nodes=m6-30-16 NodeCnt=1 CoreCnt=12 Features=(null) PartitionName=dol4 Flags=MAINT,SPEC_NODES Users=lbrown,levijm Accounts=(null) Licenses=(null) State=ACTIVE ryancox@m7int02:~$ squeue_with_priority |grep 2015 3826001 1445 m7 test rcpacker PENDING 0:00 1-02:00:00 35 560 2015-06-24T16:26:09 (Resources) ryancox@m7int02:~$ I waited until after I was sure a few complete backfill iterations ran. It looks like it could be a reservation issue in general, not just licenses.
I just noticed that the debug info in the logs (backfill flag) seems to disagree with the squeue and scontrol show job output when the 2015 value shows up. root@sched1:~# grep 2015 /var/log/slurm/slurmctld.log |grep start |grep -v ' reservation' root@sched1:~# grep 3826001 /var/log/slurm/slurmctld.log |grep start [2014-06-25T12:53:31.238] Job 3826001 to start at 2014-06-28T20:53:06, end at 2014-06-29T22:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11] [2014-06-25T12:54:43.250] Job 3826001 to start at 2014-06-28T20:53:06, end at 2014-06-29T22:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11] [2014-06-25T12:55:55.229] Job 3826001 to start at 2014-06-28T20:53:06, end at 2014-06-29T22:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11] [2014-06-25T12:57:07.241] Job 3826001 to start at 2014-06-28T20:53:06, end at 2014-06-29T22:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11] [2014-06-25T12:58:19.231] Job 3826001 to start at 2014-06-28T20:53:06, end at 2014-06-29T22:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11] [2014-06-25T13:50:08.485] Job 3826001 to start at 2014-06-30T18:15:00, end at 2014-07-01T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T13:55:25.246] Job 3826001 to start at 2014-06-29T00:53:33, end at 2014-06-30T02:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T13:59:40.224] Job 3826001 to start at 2014-06-29T00:53:33, end at 2014-06-30T02:45:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:36:42.478] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:37:51.465] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:39:00.462] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:40:09.465] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:41:18.453] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:42:27.449] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:43:36.450] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:44:45.452] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:45:54.455] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:47:03.452] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:48:12.510] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:49:21.465] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:50:30.451] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:51:39.452] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:52:48.445] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:53:57.443] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:55:06.442] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] [2014-06-25T14:56:15.481] Job 3826001 to start at 2014-07-02T18:15:00, end at 2014-07-03T20:15:00 on m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,7,9,15],m7-16-[3,6,11-12,15],m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,5,8-9,11] I haven't had the backfill debug enabled the whole time but I know that the 2015 value has happened at some of the points when the debug info was enabled.
(In reply to Ryan Cox from comment #18) > I just noticed that the debug info in the logs (backfill flag) seems to > disagree with the squeue and scontrol show job output when the 2015 value > shows up. > > root@sched1:~# grep 2015 /var/log/slurm/slurmctld.log |grep start |grep -v ' > reservation' > root@sched1:~# grep 3826001 /var/log/slurm/slurmctld.log |grep start > [2014-06-25T12:53:31.238] Job 3826001 to start at 2014-06-28T20:53:06, end > at 2014-06-29T22:45:00 on > m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12], > m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11] > > I haven't had the backfill debug enabled the whole time but I know that the > 2015 value has happened at some of the points when the debug info was > enabled. If the job requires resources that a reservation does not free until outside of the time frame considered by the backfill scheduler, it just sets the jobs start time to then and moves to the next job without determining where the job might run or generating this message. I'll correct that oversight to improve our view into what is happening. So far, I have not been able to reproduce anything like what you see unless the pending job requires resources (licenses or nodes) that are in the reservation. I'll keep looking...
(In reply to Moe Jette from comment #19) > (In reply to Ryan Cox from comment #18) > > I just noticed that the debug info in the logs (backfill flag) seems to > > disagree with the squeue and scontrol show job output when the 2015 value > > shows up. > > > > root@sched1:~# grep 2015 /var/log/slurm/slurmctld.log |grep start |grep -v ' > > reservation' > > root@sched1:~# grep 3826001 /var/log/slurm/slurmctld.log |grep start > > [2014-06-25T12:53:31.238] Job 3826001 to start at 2014-06-28T20:53:06, end > > at 2014-06-29T22:45:00 on > > m7-13-[2,6,8,16],m7-14-[3,5,8,16],m7-15-[1-3,5,9,14-15],m7-16-[3-4,6,11-12], > > m7-17-[9,12,14-15],m7-18-[4,7,14],m7-19-[2,6,8,10],m7-20-[2,8-9,11] > > > > I haven't had the backfill debug enabled the whole time but I know that the > > 2015 value has happened at some of the points when the debug info was > > enabled. > > If the job requires resources that a reservation does not free until outside > of the time frame considered by the backfill scheduler, it just sets the > jobs start time to then and moves to the next job without determining where > the job might run or generating this message. I'll correct that oversight to > improve our view into what is happening. > That makes more sense, though the weird part is why the reservations seem to affect it at all since it isn't on the same nodes. Would it help to add some extra code temporarily to the backfill code that prints out extra info for just that job number? I'm okay trying it out temporarily. I don't know backfill well enough to attempt it or even know what might help, though I can now do so for the prioritization code if it is called for :) > So far, I have not been able to reproduce anything like what you see unless > the pending job requires resources (licenses or nodes) that are in the > reservation. I'll keep looking... That's not surprising. It doesn't seem to be consistent so I can't always reproduce it on demand either. And it toggles, which is the weird part. I've tried correlating it with other events but haven't noticed anything so far.
(In reply to Ryan Cox from comment #20) > Would it help to add some extra code temporarily to the backfill code that > prints out extra info for just that job number? I'm okay trying it out > temporarily. Absolutely if you're up for that. Just to make sure that I send you a clean patch, are you using 14.03.4?
(In reply to Moe Jette from comment #21) > (In reply to Ryan Cox from comment #20) > > Would it help to add some extra code temporarily to the backfill code that > > prints out extra info for just that job number? I'm okay trying it out > > temporarily. > > Absolutely if you're up for that. Just to make sure that I send you a clean > patch, are you using 14.03.4? 14.03.4-2, except we've applied our LEVEL_BASED patches on top of it. That shouldn't touch any of the same code and I could resolve conflicts if it did.
In the course building and testing a patch for you, I discovered an anomaly that could be the root problem. I'm still studying the code, but have the problem area fairly well contained now.
Created attachment 992 [details] Fix for bad start time This should fix the problem with the bogus start times. Under some conditions it would set a job's estimates start time to the end time of the reservation that ends next, even if that is a year from now or has nothing to do with the resources that job needs. It should apply cleanly to your code base and is just a few lines. I'll also be including this in v14.03.5.
This is the commit with my patch: https://github.com/SchedMD/slurm/commit/a7054a23d0b6c0ada90c0b1b238c36a7c778dc2e
I applied the patch. The start times are now + bf_window (20160 minutes) which looks like the intended behavior when unable to schedule a start time. There is currently one user who has jobs with a start time much less than that but the rest are now + bf_window.
(In reply to Ryan Cox from comment #26) > I applied the patch. The start times are now + bf_window (20160 minutes) > which looks like the intended behavior when unable to schedule a start time. > There is currently one user who has jobs with a start time much less than > that but the rest are now + bf_window. Great. Was there some problem with job 3826001 not being scheduled or does it just keep getting deferred because higher priority jobs slipping in ahead of it?
> Was there some problem with job 3826001 not being scheduled or does it just > keep getting deferred because higher priority jobs slipping in ahead of it? Yes. There are some higher priority jobs that have started in front of it, but it appears that it will never start. Here is the current top of the queue: ryancox@m7int02:~$ squeue_with_priority |head JOBID PRIORITY PARTITION NAME USER STATE TIME TIMELIMIT NODES CPUS START_TIME NODELIST(REASON) 3862044 1457 m7 fullannu mwpswim PENDING 0:00 4:00:00 16 512 2014-07-10T08:46:36 (Resources) 3826001 1445 m7 test rcpacker PENDING 0:00 1-02:00:00 35 560 2014-07-10T08:46:36 (Resources) 3688141 716 m7 NRTN28O2 bemily2 PENDING 0:00 6-16:30:00 1 16 2014-07-10T08:46:36 (Resources) 3688142 716 m7 NRTN28O2 bemily2 PENDING 0:00 6-16:30:00 1 16 2014-07-10T08:46:36 (Resources) 3688143 716 m7 NRTN28O2 bemily2 PENDING 0:00 6-16:30:00 1 16 2014-07-10T08:46:36 (Resources) 3688144 716 m7 NRTN28O2 bemily2 PENDING 0:00 6-16:30:00 1 16 2014-07-10T08:46:36 (Resources) 3688145 716 m7 NRTN28O2 bemily2 PENDING 0:00 6-16:30:00 1 16 2014-07-10T08:46:36 (Resources) 3688146 716 m7 NRTN28O2 bemily2 PENDING 0:00 6-16:30:00 1 16 2014-07-10T08:46:36 (AssociationResourceLimit) 3688147 716 m7 NRTN28O2 bemily2 PENDING 0:00 6-16:30:00 1 16 2014-07-10T08:46:36 (AssociationResourceLimit) mwpswim is in the same account as rcpacker and has higher priority currently. However, mwpswim's job can run on any m7 node and rcpacker can only run on the m7 nodes with 64 GB of memory. Every time the backfill scheduler runs the start times for most jobs gets pushed back by one minute. It looks like none of the jobs that require m7 ever get an estimated start time and they also seem not to reserve any nodes. I would think that an estimated start time that is greater than the maximum time limit would always result in backfilled jobs starting when they shouldn't. Maybe I'm misunderstanding something, but if the maximum time limit is 7 days and the bf_window is 14 days (we may want to lower bf_window at some point), shouldn't that mean that the highest priority job will be guaranteed to start in 7 days no matter what since all running jobs will be done by then? Currently it is looking like they will never start because other jobs backfill in front of it. Note that glh43 and I just had jobs with lower priorities than the top job start on nodes that the top priority job can run on. Mine specifies the max time limit while glh43's is 3 days. Maybe glh43's won't impact mwpswim's job's start time, but I would think that mine definitely does. ryancox@m7int01:~$ squeue -o "%.18i %.5Q %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R" -t r -S -S -p m7 |head -15 JOBID PRIOR PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 3823877_195 1 m7 xjob glh43 RUNNING 1:54 3-00:00:00 1 m7-4-7 3877338 1000 m7 sbatch ryancox RUNNING 3:54 7-00:00:00 1 m7-5-5 3823877_192 1 m7 xjob glh43 RUNNING 18:54 3-00:00:00 1 m7-6-14 3823877_191 1 m7 xjob glh43 RUNNING 31:54 3-00:00:00 1 m7-3-5 3823877_190 1 m7 xjob glh43 RUNNING 35:54 3-00:00:00 1 m7-3-3 3877319 411 m7 04_Intr_ stevenk3 RUNNING 41:54 2-00:00:00 1 m7-13-8 3877320 411 m7 04_Intr_ stevenk3 RUNNING 41:54 2-00:00:00 1 m7-16-3 3877321 411 m7 04_Intr_ stevenk3 RUNNING 41:54 2-00:00:00 1 m7-16-6 3823877_189 1 m7 xjob glh43 RUNNING 57:54 3-00:00:00 1 m7-6-2 3862018 1104 m7 scd4-042 lundgrer RUNNING 2:58:09 4-00:00:00 16 m7-1-11,m7-2-[4,7],m7-3-14,m7-4-[2,4,10,13],m7-5-[2,4],m7-7-[7,11],m7-8-[7,11],m7-10-[11,16] 3688137 716 m7 NRTN28O2 bemily2 RUNNING 4:03:30 6-16:30:00 1 m7-13-6 3688138 716 m7 NRTN28O2 bemily2 RUNNING 4:03:30 6-16:30:00 1 m7-14-16 3688139 716 m7 NRTN28O2 bemily2 RUNNING 4:03:30 6-16:30:00 1 m7-15-9 3688140 716 m7 NRTN28O2 bemily2 RUNNING 4:03:30 6-16:30:00 1 m7-18-7 With backfill debug enabled, the following shows that some jobs get start times but not the top two jobs: root@sched1:~# grep -c 'start at' /var/log/slurm/slurmctld.log 126 root@sched1:~# grep -c 'start at' /var/log/slurm/slurmctld.log |grep 3862044 root@sched1:~# grep -c 'start at' /var/log/slurm/slurmctld.log |grep 3826001 root@sched1:~# I've been running this for a while. The newer job by mwpswim has the same start time as rcpacker every time too. ryancox@m7int02:~$ while true; do squeue -ho "%.7i %.9Q %.9P %.8j %.12u %.8T %.10M %.11l %.8D %.5C %.19S %R" -S "-p,-S" --state=pending -u rcpacker; sleep 1m; done 3826001 1445 m7 test rcpacker PENDING 0:00 1-02:00:00 35 560 2014-07-10T08:47:36 (Resources) 3826001 1445 m7 test rcpacker PENDING 0:00 1-02:00:00 35 560 2014-07-10T08:48:36 (Resources) 3826001 1445 m7 test rcpacker PENDING 0:00 1-02:00:00 35 560 2014-07-10T08:49:36 (Resources) 3826001 1445 m7 test rcpacker PENDING 0:00 1-02:00:00 35 560 2014-07-10T08:50:36 (Resources) .......cut for brevity..... the times increased by one minute every minute .... 3826001 1445 m7 test rcpacker PENDING 0:00 1-02:00:00 35 560 2014-07-10T09:24:36 (Resources) 3826001 1445 m7 test rcpacker PENDING 0:00 1-02:00:00 35 560 2014-07-10T09:25:36 (Resources) 3826001 1445 m7 test rcpacker PENDING 0:00 1-02:00:00 35 560 2014-07-10T09:26:36 (Resources) 3826001 1445 m7 test rcpacker PENDING 0:00 1-02:00:00 35 560 2014-07-10T09:27:36 (Resources)
Responses inline - Moe (In reply to Ryan Cox from comment #28) > > Was there some problem with job 3826001 not being scheduled or does it just > > keep getting deferred because higher priority jobs slipping in ahead of it? > > Yes. There are some higher priority jobs that have started in front of it, > but it appears that it will never start. > > Here is the current top of the queue: > ryancox@m7int02:~$ squeue_with_priority |head > JOBID PRIORITY PARTITION NAME USER STATE TIME > TIMELIMIT NODES CPUS START_TIME NODELIST(REASON) > 3862044 1457 m7 fullannu mwpswim PENDING 0:00 > 4:00:00 16 512 2014-07-10T08:46:36 (Resources) > 3826001 1445 m7 test rcpacker PENDING 0:00 > 1-02:00:00 35 560 2014-07-10T08:46:36 (Resources) > 3688141 716 m7 NRTN28O2 bemily2 PENDING 0:00 > 6-16:30:00 1 16 2014-07-10T08:46:36 (Resources) > 3688142 716 m7 NRTN28O2 bemily2 PENDING 0:00 > 6-16:30:00 1 16 2014-07-10T08:46:36 (Resources) > 3688143 716 m7 NRTN28O2 bemily2 PENDING 0:00 > 6-16:30:00 1 16 2014-07-10T08:46:36 (Resources) > 3688144 716 m7 NRTN28O2 bemily2 PENDING 0:00 > 6-16:30:00 1 16 2014-07-10T08:46:36 (Resources) > 3688145 716 m7 NRTN28O2 bemily2 PENDING 0:00 > 6-16:30:00 1 16 2014-07-10T08:46:36 (Resources) > 3688146 716 m7 NRTN28O2 bemily2 PENDING 0:00 > 6-16:30:00 1 16 2014-07-10T08:46:36 (AssociationResourceLimit) > 3688147 716 m7 NRTN28O2 bemily2 PENDING 0:00 > 6-16:30:00 1 16 2014-07-10T08:46:36 (AssociationResourceLimit) > > > mwpswim is in the same account as rcpacker and has higher priority > currently. However, mwpswim's job can run on any m7 node and rcpacker can > only run on the m7 nodes with 64 GB of memory. Since mwpswim's job has a higher priority then rcpacker's job, the backfill scheduler will reserve resources to start that job as soon as possible. Assuming there is nothing in mwpswim's job to prevent it from using the 64 GB nodes, they may well get reserved even if doing so delays the initiation of a lower priority job (i.e. rcpacker's job). I'm not sure if this applies in your case, but the job's priority reported by squeue is only one factor in the sort order of jobs for the backfill scheduler. The jobs are sorted by 1. If job X can preempt job Y, then job X gets preference 2. If job X has ANY reservation and job Y has none, then job X gets preference 3. Priority of the job's partition (jobs submitted to multiple partitions are sorted as a separate job for each partition) 4. Job priority 5. Job ID > Every time the backfill scheduler runs the start times for most jobs gets > pushed back by one minute. It looks like none of the jobs that require m7 > ever get an estimated start time and they also seem not to reserve any > nodes. I would think that an estimated start time that is greater than the > maximum time limit would always result in backfilled jobs starting when they > shouldn't. > > Maybe I'm misunderstanding something, but if the maximum time limit is 7 > days and the bf_window is 14 days (we may want to lower bf_window at some > point), shouldn't that mean that the highest priority job will be guaranteed > to start in 7 days no matter what since all running jobs will be done by > then? Assuming you define priority as I describe above rather than just the job's "priority" field, that is correct, with the following exceptions: The job is blocked by 1. an association or QOS limit 2. an advanced reservation 3. required Down nodes 4. job dependencies 5. licenses Are there any lower "priority" jobs with access to m7 with earlier start times? I would check them with respect to the job prioritization rules and see if that is messing things up. I'm considering adding a new DebugFlag that can output a list of schedulable jobs in priority order. I'm thinking that may be really helpful, even if verbose. > Currently it is looking like they will never start because other jobs > backfill in front of it. Note that glh43 and I just had jobs with lower > priorities than the top job start on nodes that the top priority job can run > on. Mine specifies the max time limit while glh43's is 3 days. Maybe > glh43's won't impact mwpswim's job's start time, but I would think that mine > definitely does. > > ryancox@m7int01:~$ squeue -o "%.18i %.5Q %.9P %.8j %.8u %.8T %.10M %.9l %.6D > %R" -t r -S -S -p m7 |head -15 > JOBID PRIOR PARTITION NAME USER STATE TIME > TIMELIMIT NODES NODELIST(REASON) > 3823877_195 1 m7 xjob glh43 RUNNING 1:54 > 3-00:00:00 1 m7-4-7 > 3877338 1000 m7 sbatch ryancox RUNNING 3:54 > 7-00:00:00 1 m7-5-5 > 3823877_192 1 m7 xjob glh43 RUNNING 18:54 > 3-00:00:00 1 m7-6-14 > 3823877_191 1 m7 xjob glh43 RUNNING 31:54 > 3-00:00:00 1 m7-3-5 > 3823877_190 1 m7 xjob glh43 RUNNING 35:54 > 3-00:00:00 1 m7-3-3 > 3877319 411 m7 04_Intr_ stevenk3 RUNNING 41:54 > 2-00:00:00 1 m7-13-8 > 3877320 411 m7 04_Intr_ stevenk3 RUNNING 41:54 > 2-00:00:00 1 m7-16-3 > 3877321 411 m7 04_Intr_ stevenk3 RUNNING 41:54 > 2-00:00:00 1 m7-16-6 > 3823877_189 1 m7 xjob glh43 RUNNING 57:54 > 3-00:00:00 1 m7-6-2 > 3862018 1104 m7 scd4-042 lundgrer RUNNING 2:58:09 > 4-00:00:00 16 > m7-1-11,m7-2-[4,7],m7-3-14,m7-4-[2,4,10,13],m7-5-[2,4],m7-7-[7,11],m7-8-[7, > 11],m7-10-[11,16] > 3688137 716 m7 NRTN28O2 bemily2 RUNNING 4:03:30 > 6-16:30:00 1 m7-13-6 > 3688138 716 m7 NRTN28O2 bemily2 RUNNING 4:03:30 > 6-16:30:00 1 m7-14-16 > 3688139 716 m7 NRTN28O2 bemily2 RUNNING 4:03:30 > 6-16:30:00 1 m7-15-9 > 3688140 716 m7 NRTN28O2 bemily2 RUNNING 4:03:30 > 6-16:30:00 1 m7-18-7 > > > With backfill debug enabled, the following shows that some jobs get start > times but not the top two jobs: > root@sched1:~# grep -c 'start at' /var/log/slurm/slurmctld.log > 126 > root@sched1:~# grep -c 'start at' /var/log/slurm/slurmctld.log |grep 3862044 > root@sched1:~# grep -c 'start at' /var/log/slurm/slurmctld.log |grep 3826001 > root@sched1:~# > > I've been running this for a while. The newer job by mwpswim has the same > start time as rcpacker every time too. > ryancox@m7int02:~$ while true; do squeue -ho "%.7i %.9Q %.9P %.8j %.12u %.8T > %.10M %.11l %.8D %.5C %.19S %R" -S "-p,-S" --state=pending -u rcpacker; > sleep 1m; done > 3826001 1445 m7 test rcpacker PENDING 0:00 > 1-02:00:00 35 560 2014-07-10T08:47:36 (Resources) > 3826001 1445 m7 test rcpacker PENDING 0:00 > 1-02:00:00 35 560 2014-07-10T08:48:36 (Resources) > 3826001 1445 m7 test rcpacker PENDING 0:00 > 1-02:00:00 35 560 2014-07-10T08:49:36 (Resources) > 3826001 1445 m7 test rcpacker PENDING 0:00 > 1-02:00:00 35 560 2014-07-10T08:50:36 (Resources) > .......cut for brevity..... the times increased by one minute every minute > .... > 3826001 1445 m7 test rcpacker PENDING 0:00 > 1-02:00:00 35 560 2014-07-10T09:24:36 (Resources) > 3826001 1445 m7 test rcpacker PENDING 0:00 > 1-02:00:00 35 560 2014-07-10T09:25:36 (Resources) > 3826001 1445 m7 test rcpacker PENDING 0:00 > 1-02:00:00 35 560 2014-07-10T09:26:36 (Resources) > 3826001 1445 m7 test rcpacker PENDING 0:00 > 1-02:00:00 35 560 2014-07-10T09:27:36 (Resources)
Created attachment 998 [details] Log job priority and partitions You might find this small patch helpful to diagnose the problem. With DebugFlags=backfill, it logs the job priorities and partitions being considered rather than just the job id. Here is a sample: slurmctld: backfill: beginning slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol4 slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol4 slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol4 slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol3 slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol3 slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol3 slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol2 slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol2 slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol2 slurmctld: backfill: reached end of job queue This change will be in version 14.03.5. Note that the map of resources available to the backfill scheduler through time is now only available by using DebugFlags=BackfillMap
Created attachment 999 [details] start times for rcpacker job 3826001 I applied the patch and have looked at the logs with Backfill and BackfillMap. It's getting the right partition, m7. Several people submitted jobs that got higher priorities than rcpacker (mwpsim cancelled his). Oddly enough, once there were more jobs in front of rcpacker, he got a saner start time. I boosted his priority *higher* than everyone else and now his start time is now + bf_window again. Here is some website output we have that shows the order of users in the queue where Wait=MAX(last job to start, submit time of oldest pending job) in H:MM. Start is projected start time in H:MM from now. m6==m6beta, m7==any m7, M7==m7 64GB. User Jobs Procs Sponsor Parts Wait Start rcpacker 1 560 sgorrell M7 143:43 335:59 rroth 4 400 ssc7 m6 0:01 1:49 danilob 1 6 drw32 m6 m7 0:09 1:21 plf1 4814 4814 ekr8 m6 m7 0:00 1:49 marktwe 1 16 jsk24 M7 0:07 21:06 masaki 1 1 mjc22 m6f 0:07 jhogge10 3 12 db6 m6 m7 0:06 21:06 bemily2 693 11088 jchansen M7 0:09 237:06 samparas 12 144 dhe m6 m7 0:11 21:36 fslcollab8 1921 23052 glh43 m6 m7 0:09 21:36 glh43 403 4836 glh43 m6 m7 2:44 21:36 ryancox 1 16 staff M7 2:43 335:59 I attached the projected start times for rcpacker's job, snapshotted every minute. Note that there were a few minutes when the time was 2014-06-29 or 2014-06-30. Quite a few other jobs got in front of him then. After a while some of them started or were cancelled. Then his start time went back to 14 days from now. I boosted his priority part way through and now it's 14 days again consistently. I can't think of a reason why having more jobs in front of him would cause him to get a sooner start time. That seems very backwards but I'm struggling to come up with a good, consistent test case.
I should add that masaki was hitting an association limit.
(In reply to Moe Jette from comment #30) > slurmctld: backfill: beginning > slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol4 > slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol4 > slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol4 > slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol3 > slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol3 > slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol3 > slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol2 > slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol2 > slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol2 > slurmctld: backfill: reached end of job queue Would it be possible to get the slurmctld log entries for a backfill run. I'm particularly interested in seeing what might be in partition m7 ahead of rcpacker's jobs (i.e. some issue with prioritizing jobs)? The only other thing that I can think of is that some jobs submitted to m7 plus another partition created a reservation on the m7 nodes, but I have not seen a problem of that sort.
(In reply to Moe Jette from comment #33) > (In reply to Moe Jette from comment #30) > > slurmctld: backfill: beginning > > slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol4 > > slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol4 > > slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol4 > > slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol3 > > slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol3 > > slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol3 > > slurmctld: backfill test for JobID=1764 Prio=4294901758 Partition=dol2 > > slurmctld: backfill test for JobID=1767 Prio=4294901755 Partition=dol2 > > slurmctld: backfill test for JobID=1768 Prio=4294901754 Partition=dol2 > > slurmctld: backfill: reached end of job queue > > Would it be possible to get the slurmctld log entries for a backfill run. > I'm particularly interested in seeing what might be in partition m7 ahead of > rcpacker's jobs (i.e. some issue with prioritizing jobs)? > Sure, I'll get that to you in a few minutes. > The only other thing that I can think of is that some jobs submitted to m7 > plus another partition created a reservation on the m7 nodes, but I have not > seen a problem of that sort. Now that I think about it, I do recall seeing some jobs get started on m7 nodes but be in the m6beta partition and maybe vice versa. I forgot about that. I just checked on that, and wow: ryancox@m7int01:~$ squeue -l -w m7-[1-20]-[1-16] -p m6beta Thu Jun 26 13:40:28 2014 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 3828180 m6beta TRSTS22_ jtfuller RUNNING 2-22:07:44 3-00:00:00 1 m7-17-15 3845637 m6beta ts1-3fts yzhang8 RUNNING 1-16:47:05 6-00:00:00 1 m7-7-3 3563150_614 m6beta xjob fslcolla RUNNING 3-08:42:01 7-00:00:00 1 m7-9-4 3563150_613 m6beta xjob fslcolla RUNNING 3-08:46:04 7-00:00:00 1 m7-9-15 3563150_612 m6beta xjob fslcolla RUNNING 3-09:08:26 7-00:00:00 1 m7-8-5 3563150_611 m6beta xjob fslcolla RUNNING 3-09:36:54 7-00:00:00 1 m7-12-13 3563150_608 m6beta xjob fslcolla RUNNING 3-10:44:54 7-00:00:00 1 m7-9-2 3563150_602 m6beta xjob fslcolla RUNNING 3-16:16:09 7-00:00:00 1 m7-9-10 3563150_601 m6beta xjob fslcolla RUNNING 3-17:16:45 7-00:00:00 1 m7-8-1 3563150_600 m6beta xjob fslcolla RUNNING 3-21:19:07 7-00:00:00 1 m7-4-5 3563150_597 m6beta xjob fslcolla RUNNING 3-21:41:21 7-00:00:00 1 m7-2-1 3563150_595 m6beta xjob fslcolla RUNNING 4-06:48:22 7-00:00:00 1 m7-4-15 3563150_587 m6beta xjob fslcolla RUNNING 4-07:12:59 7-00:00:00 1 m7-10-2 3563150_590 m6beta xjob fslcolla RUNNING 4-07:12:59 7-00:00:00 1 m7-6-12 3563150_593 m6beta xjob fslcolla RUNNING 4-07:12:59 7-00:00:00 1 m7-6-16 3563150_573 m6beta xjob fslcolla RUNNING 4-07:14:00 7-00:00:00 1 m7-4-3 3563150_574 m6beta xjob fslcolla RUNNING 4-07:14:00 7-00:00:00 1 m7-4-6 3563150_577 m6beta xjob fslcolla RUNNING 4-07:14:00 7-00:00:00 1 m7-4-12 3563150_578 m6beta xjob fslcolla RUNNING 4-07:14:00 7-00:00:00 1 m7-4-14 3563150_580 m6beta xjob fslcolla RUNNING 4-07:14:00 7-00:00:00 1 m7-5-1 3563150_583 m6beta xjob fslcolla RUNNING 4-07:14:00 7-00:00:00 1 m7-5-8 3563150_584 m6beta xjob fslcolla RUNNING 4-07:14:00 7-00:00:00 1 m7-5-10 3563150_585 m6beta xjob fslcolla RUNNING 4-07:14:00 7-00:00:00 1 m7-5-11 3563150_558 m6beta xjob fslcolla RUNNING 4-07:15:01 7-00:00:00 1 m7-7-2 3563150_560 m6beta xjob fslcolla RUNNING 4-07:15:01 7-00:00:00 1 m7-7-8 3563150_563 m6beta xjob fslcolla RUNNING 4-07:15:01 7-00:00:00 1 m7-7-15 3563150_567 m6beta xjob fslcolla RUNNING 4-07:15:01 7-00:00:00 1 m7-10-5 3563150_553 m6beta xjob fslcolla RUNNING 5-20:52:59 7-00:00:00 1 m7-6-4 3563150_548 m6beta xjob fslcolla RUNNING 5-22:20:32 7-00:00:00 1 m7-3-7 3563150_547 m6beta xjob fslcolla RUNNING 5-23:09:54 7-00:00:00 1 m7-4-9 3563150_546 m6beta xjob fslcolla RUNNING 5-23:19:46 7-00:00:00 1 m7-7-14 3563150_542 m6beta xjob fslcolla RUNNING 6-05:44:35 7-00:00:00 1 m7-6-8 3563150_539 m6beta xjob fslcolla RUNNING 6-07:11:03 7-00:00:00 1 m7-11-9 3563150_538 m6beta xjob fslcolla RUNNING 6-07:16:38 7-00:00:00 1 m7-2-6 3563150_534 m6beta xjob fslcolla RUNNING 6-08:24:31 7-00:00:00 1 m7-5-7 3563150_532 m6beta xjob fslcolla RUNNING 6-09:26:58 7-00:00:00 1 m7-8-9 3563150_531 m6beta xjob fslcolla RUNNING 6-09:45:55 7-00:00:00 1 m7-4-11 3685461_279 m6beta xjob glh43 RUNNING 2-08:54:49 3-00:00:00 1 m7-12-2 3685461_278 m6beta xjob glh43 RUNNING 2-08:56:49 3-00:00:00 1 m7-6-9 3685461_269 m6beta xjob glh43 RUNNING 2-09:38:52 3-00:00:00 1 m7-9-16 3685461_268 m6beta xjob glh43 RUNNING 2-09:43:52 3-00:00:00 1 m7-9-13 3685461_262 m6beta xjob glh43 RUNNING 2-09:57:52 3-00:00:00 1 m7-11-7 3685461_177 m6beta xjob glh43 RUNNING 2-16:52:59 3-00:00:00 1 m7-12-12 3563150_620 m6beta xjob fslcolla RUNNING 3-03:57:51 7-00:00:00 1 m7-11-2 3823877_176 m6beta xjob glh43 RUNNING 1-05:02:54 3-00:00:00 1 m7-12-10 3823877_175 m6beta xjob glh43 RUNNING 1-05:16:33 3-00:00:00 1 m7-12-9 3823877_170 m6beta xjob glh43 RUNNING 1-05:39:37 3-00:00:00 1 m7-2-13 3823877_166 m6beta xjob glh43 RUNNING 1-05:56:22 3-00:00:00 1 m7-2-9 3823877_156 m6beta xjob glh43 RUNNING 1-07:11:24 3-00:00:00 1 m7-12-6 3823877_157 m6beta xjob glh43 RUNNING 1-07:11:24 3-00:00:00 1 m7-2-8 3823877_151 m6beta xjob glh43 RUNNING 1-07:13:33 3-00:00:00 1 m7-5-12 3823877_139 m6beta xjob glh43 RUNNING 1-07:15:04 3-00:00:00 1 m7-12-3 3823877_141 m6beta xjob glh43 RUNNING 1-07:15:04 3-00:00:00 1 m7-12-14 3823877_144 m6beta xjob glh43 RUNNING 1-07:15:04 3-00:00:00 1 m7-5-14 3823877_146 m6beta xjob glh43 RUNNING 1-07:15:04 3-00:00:00 1 m7-6-1 3823877_148 m6beta xjob glh43 RUNNING 1-07:15:04 3-00:00:00 1 m7-6-11 3877772 m6beta ox-H2SO4 samparas RUNNING 1:22:57 1-21:00:00 1 m7-1-9 3877751 m6beta ox-H2SO4 samparas RUNNING 2:50:58 1-16:00:00 1 m7-13-8 3877749 m6beta ox-H2SO4 samparas RUNNING 2:54:57 1-16:00:00 1 m7-9-9 3877750 m6beta ox-H2SO4 samparas RUNNING 2:54:57 1-16:00:00 1 m7-16-3 3877748 m6beta ox-H2SO4 samparas RUNNING 2:56:28 1-16:00:00 1 m7-16-6 3877379 m6beta TS1-from deepa RUNNING 3:48:52 2-04:00:00 1 m7-3-8 3851728 m6beta CH_Activ ecahua97 RUNNING 1-03:03:30 4-04:00:00 1 m7-9-1 3851732 m6beta CH_Activ ecahua97 RUNNING 1-03:03:30 4-04:00:00 1 m7-2-5 3851733 m6beta CH_Activ ecahua97 RUNNING 1-03:03:30 4-04:00:00 1 m7-6-3 3823877_196 m6beta xjob glh43 RUNNING 3:15:56 3-00:00:00 1 m7-5-5 3823877_195 m6beta xjob glh43 RUNNING 4:18:52 3-00:00:00 1 m7-4-7 3823877_192 m6beta xjob glh43 RUNNING 4:35:52 3-00:00:00 1 m7-6-14 3823877_191 m6beta xjob glh43 RUNNING 4:48:52 3-00:00:00 1 m7-3-5 3823877_190 m6beta xjob glh43 RUNNING 4:52:52 3-00:00:00 1 m7-3-3 3823877_189 m6beta xjob glh43 RUNNING 5:14:52 3-00:00:00 1 m7-6-2 3823877_187 m6beta xjob glh43 RUNNING 6:01:55 3-00:00:00 1 m7-5-13 3563150_632 m6beta xjob fslcolla RUNNING 1-07:15:04 7-00:00:00 1 m7-9-6 3563150_635 m6beta xjob fslcolla RUNNING 1-07:15:04 7-00:00:00 1 m7-1-4 3563150_639 m6beta xjob fslcolla RUNNING 1-07:15:04 7-00:00:00 1 m7-1-13 3685461_174 m6beta xjob glh43 RUNNING 2-23:10:34 3-00:00:00 1 m7-11-13 ryancox@m7int01:~$ squeue -l -w m6-[1-35]-[1-16] -p m7 Thu Jun 26 13:40:40 2014 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) ryancox@m7int01:~$ ryancox@m7int01:~$ cat /etc/slurm/partitions.conf PartitionName=Default State=UP MaxTime=16-0 DefaultTime=30:00 shared=yes PartitionName=dol4 priority=100 Default=yes Nodes=m5-21-[1-16] AllowGroups=dol4nodes,fslstaff PartitionName=physics priority=90 Nodes=m6-33-[1-5],m6-33-[9-13],m6-34-[1-5],m6-34-[9-13] AllowGroups=rls5physicsnodes,dwn8physicsnodes,ewh5physicsnodes,glh43physicsnodes,mw22physicsnodes,fslstaff PartitionName=dhe priority=80 Nodes=m6-35-[1-16] AllowGroups=dhenodes PartitionName=m6beta priority=71 Nodes=m6-[1-2]-[2-16],m6-[3-32]-[1-16] PartitionName=m6 priority=70 PartitionName=m7 priority=60 Nodes=m7-1-[2-16],m7-[2-12]-[1-16],m7-13-[2-16],m7-[14-20]-[1-16] PartitionName=m5f priority=40 Nodes=m5f-1-[1-2] MaxTime=5-0 PartitionName=m6f priority=30 Nodes=m6f-1-[1-7] MaxTime=5-0 PartitionName=short priority=25 Nodes=m7-13-1,m6-1-1,m6-2-1,m7-1-1 MaxTime=1:00:00 MaxNodes=1 PartitionName=m5gpu priority=20 Nodes=m5gpu-1-[1-2] PartitionName=m6gpu priority=10 Nodes=m6-33-8 PartitionName=blankpart priority=1 ryancox@m7int01:~$ cat /etc/slurm/nodes.conf NodeName=default state=UNKNOWN NodeName=m5-21-[1-16] sockets=2 corespersocket=4 RealMemory=24576 feature=ib,dol4,m5,intel,nehalem,cpu2800MHz,mem1333MHz,private,beta,sse4.1,sse4.2 NodeName=m6-33-[1-5],m6-33-[9-13],m6-34-[1-5],m6-34-[9-13] sockets=2 corespersocket=6 RealMemory=49152 feature=m6,physics,intel,westmere,cpu2667MHz,mem1333MHz,ib,beta,sse4.1,sse4.2 NodeName=m6-35-[1-16] sockets=2 corespersocket=6 RealMemory=24576 feature=m6,dhe,intel,westmere,private,cpu2667MHz,mem1067MHz,beta,sse4.1,sse4.2 NodeName=m6-[1-32]-[1-16] sockets=2 corespersocket=6 RealMemory=24576 feature=m6,intel,westmere,cpu2667MHz,mem1067MHz,beta,sse4.1,sse4.2 NodeName=m7-[1-12]-[1-16] sockets=2 corespersocket=8 weight=1 RealMemory=32768 feature=m7,ib,intel,sandybridge,cpu2200MHz,mem1600MHz,beta,sse4.1,sse4.2,avx NodeName=m7-[13-20]-[1-16] sockets=2 corespersocket=8 weight=2 RealMemory=65536 feature=m7,ib,intel,sandybridge,cpu2200MHz,mem1600MHz,beta,sse4.1,sse4.2,avx NodeName=m5f-1-[1-2] sockets=4 corespersocket=4 RealMemory=131072 feature=amd,barcelona,mem128,cpu2300MHz,mem667MHz,bigmem,nonbeta NodeName=m6f-1-[1-7] sockets=4 corespersocket=8 RealMemory=262144 feature=intel,nehalemex,cpu1870MHz,mem978MHz,mem256,bigmem,beta,sse4.1,sse4.2 NodeName=m5gpu-1-1 sockets=2 corespersocket=4 gres=gpu:4 RealMemory=32768 feature=gpu,nvidia,t10,s1070,cuda,nondol4,intel,harpertown,cpu2667MHz,mem667MHz,nonbeta,sse4.1 NodeName=m5gpu-1-2 sockets=2 corespersocket=4 gres=gpu:4 RealMemory=32768 feature=gpu,nvidia,t10,s1070,cuda,nondol4,intel,harpertown,cpu2667MHz,mem667MHz,beta,sse4.1 NodeName=m6-33-8 sockets=2 corespersocket=6 gres=gpu RealMemory=24576 feature=nvidia,m2050,intel,westmere,cpu2800MHz,mem1333MHz,cuda,gpu,beta,sse4.1,sse4.2
(In reply to Ryan Cox from comment #34) > Now that I think about it, I do recall seeing some jobs get started on m7 > nodes but be in the m6beta partition and maybe vice versa. I forgot about > that. > > I just checked on that, and wow: > ryancox@m7int01:~$ squeue -l -w m7-[1-20]-[1-16] -p m6beta > Thu Jun 26 13:40:28 2014 > JOBID PARTITION NAME USER STATE TIME TIMELIMIT > NODES NODELIST(REASON) > 3828180 m6beta TRSTS22_ jtfuller RUNNING 2-22:07:44 > 3-00:00:00 1 m7-17-15 > 3845637 m6beta ts1-3fts yzhang8 RUNNING 1-16:47:05 > 6-00:00:00 1 m7-7-3 > 3563150_614 m6beta xjob fslcolla RUNNING 3-08:42:01 That's a great clue! Could you run "scontrol show part" and confirm it matches your slurm.conf?
Created attachment 1001 [details] comparison of slurm.conf files and scontrol show part As far as I can tell, they do match. Unfortunately I didn't see any jobs that started after the most recent backfill patch that would show what partition the job was assigned. Here is what I did see for the most recent one. Nothing too useful as far as I can tell: root@sched1:~# grep 3877772 /var/log/slurm/slurmctld.log | tail -30 [2014-06-26T12:08:53.008] backfill test for job 3877772 [2014-06-26T12:08:53.020] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1 [2014-06-26T12:08:54.056] backfill test for job 3877772 [2014-06-26T12:09:54.005] backfill test for job 3877772 [2014-06-26T12:09:54.018] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1 [2014-06-26T12:09:55.076] backfill test for job 3877772 [2014-06-26T12:10:55.005] backfill test for job 3877772 [2014-06-26T12:10:55.017] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1 [2014-06-26T12:10:56.036] backfill test for job 3877772 [2014-06-26T12:11:56.005] backfill test for job 3877772 [2014-06-26T12:11:56.019] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1 [2014-06-26T12:11:57.058] backfill test for job 3877772 [2014-06-26T12:12:57.005] backfill test for job 3877772 [2014-06-26T12:12:57.018] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1 [2014-06-26T12:12:58.109] backfill test for job 3877772 [2014-06-26T12:13:58.005] backfill test for job 3877772 [2014-06-26T12:13:58.017] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1 [2014-06-26T12:13:59.057] backfill test for job 3877772 [2014-06-26T12:14:59.005] backfill test for job 3877772 [2014-06-26T12:14:59.018] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1 [2014-06-26T12:15:00.034] backfill test for job 3877772 [2014-06-26T12:16:00.005] backfill test for job 3877772 [2014-06-26T12:16:00.056] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1 [2014-06-26T12:16:03.167] backfill test for job 3877772 [2014-06-26T12:17:03.051] backfill test for job 3877772 [2014-06-26T12:17:03.065] Job 3877772 to start at 2014-06-26T14:16:00, end at 2014-06-28T11:15:00 on m6-32-1 [2014-06-26T12:17:31.326] backfill test for job 3877772 [2014-06-26T12:17:31.327] backfill: Started JobId=3877772 on m7-1-9 [2014-06-26T12:23:08.746] Recovered job 3877772 597 [2014-06-26T12:31:18.328] Recovered job 3877772 597
And I just did md5sums on our nodes.conf and partitions.conf files. They are the same on the primary and backup ctld/dbd and on all compute nodes.
Created attachment 1002 [details] slurmctld.log.gz with backfill and backfillmap debug
Created attachment 1003 [details] scontrol show job In case it's useful
I compared some of the (In reply to Ryan Cox from comment #39) > Created attachment 1003 [details] > scontrol show job > > In case it's useful Actually, it is. I was comparing your latest squeue output against the "scontrol show job" from yesterday and found a few anomalies. This one is particularly interesting. From Wednesday: JobId=3851728 Name=CH_Activation_trial3_18_F_TSLBS.com UserId=ecahua97(21355) GroupId=ecahua97(21587) Priority=317 Nice=0 Account=dhe QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:16:30 TimeLimit=4-04:00:00 TimeMin=N/A SubmitTime=2014-06-25T10:36:15 EligibleTime=2014-06-25T10:36:15 StartTime=2014-06-25T10:36:58 EndTime=2014-06-29T14:36:58 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=m7 AllocNode:Sid=m7int01:7866 ReqNodeList=(null) ExcNodeList=(null) NodeList=m7-9-1 BatchHost=m7-9-1 Today/Thursday: JobId=3851728 Name=CH_Activation_trial3_18_F_TSLBS.com UserId=ecahua97(21355) GroupId=ecahua97(21587) Priority=317 Nice=0 Account=dhe QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=1-03:27:07 TimeLimit=4-04:00:00 TimeMin=N/A SubmitTime=2014-06-25T10:36:15 EligibleTime=2014-06-25T10:36:15 StartTime=2014-06-25T10:36:58 EndTime=2014-06-29T14:36:58 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=m6beta AllocNode:Sid=m7int01:7866 ReqNodeList=(null) ExcNodeList=(null) NodeList=m7-9-1 BatchHost=m7-9-1 "scontrol update job ..." can't even change a job's partition string, so something else is doing it. There aren't many places that modify the string, so hopefully I can track it down soon.
Found it! The good news is that the scheduling appears to be correct. The output in squeue or scontrol show job is displaying the wrong partition under some situations. The problem is caused when a job is submitted to multiple partitions, the job is scheduled by the backfill scheduler in a partition that is Not first in the list, then the slurmctld is restarted. I should have a fix soon.
(In reply to Moe Jette from comment #41) > Found it! > > The good news is that the scheduling appears to be correct. The output in > squeue or scontrol show job is displaying the wrong partition under some > situations. The problem is caused when a job is submitted to multiple > partitions, the job is scheduled by the backfill scheduler in a partition > that is Not first in the list, then the slurmctld is restarted. I should > have a fix soon. Excellent. So the initial scheduling works, then the partition gets messed up. Would that potentially affect backfill since some nodes would have jobs "from a different partition" on them?
(In reply to Ryan Cox from comment #42) > Excellent. So the initial scheduling works, then the partition gets messed > up. Would that potentially affect backfill since some nodes would have jobs > "from a different partition" on them? No, we are just not seeing all of the jobs actually associated with m7 that we should be.
Created attachment 1004 [details] Correct job's partition info This will correct the name of the partition associated with running jobs under some circumstances. This will not help you for jobs that are already running. They will continue to be reported as being associated with the incorrect partition name, but jobs started after this patch will be reported in the correct partition.
Thanks. We are now running with the patch. It looks like all the affected jobs will be done in 3-4 days. Is there anything else I can send you for the backfill issue? For now I'm creating a reservation to assign rcpacker's job to so it actually gets started. If we need another test case I'll just create a job like rcpacker's. Currently, everyone except for one user has start times that are 14 days out so reproducing it shouldn't be too hard. It's too late in the day for me to really do anything else so I'll work on it in the morning.
(In reply to Ryan Cox from comment #45) > Thanks. We are now running with the patch. It looks like all the affected > jobs will be done in 3-4 days. That should at least provide a clearer picture of what is happening. > Is there anything else I can send you for the backfill issue? No now, thanks. > For now I'm > creating a reservation to assign rcpacker's job to so it actually gets > started. I've been hoping to solve the problem, but an advanced reservation should at least get rcpacker's job run. > If we need another test case I'll just create a job like > rcpacker's. Currently, everyone except for one user has start times that > are 14 days out so reproducing it shouldn't be too hard. Do you have any feeling about the difference between jobs in m7 that are starting and those which are not? Memory size? Node count? Time limit? > It's too late in the day for me to really do anything else so I'll work on > it in the morning.
> Do you have any feeling about the difference between jobs in m7 that are > starting and those which are not? Memory size? Node count? Time limit? > I think it's mostly single node jobs or jobs with small enough node counts that they happen to get a reservation occasionally and then start. Even single node jobs can get a start time that is really far away. The only other pattern I think I'm seeing is that affected jobs need m7 and don't have multiple partitions listed, but I'm not positive about that. Well, that and the more bigger jobs there are queued, the more likely the start times will be sane for everyone. Maybe.
(In reply to Moe Jette from comment #44) > Created attachment 1004 [details] > > This will correct the name of the partition associated with running jobs > under some circumstances. This will not help you for jobs that are already > running. They will continue to be reported as being associated with the > incorrect partition name, but jobs started after this patch will be reported > in the correct partition. I believe this patch fixes the underlying backfill scheduling problem too, although I have not been able to reproduce it. The select/cons_res plugin tracks resource usage by partition, which is needed for preemption, oversubscribing resources, and gang scheduling). If a job's partition pointer is wrong (i.e. the nodes in are the m7 partition, but it's partition pointer references m6), the select/cons_res is going to get a bit confused. Nodes can be moved between partitions, so I'm not expecting slurmctld to crash or anything like that (or you would have already seen that), but I could see the bookkeeping for a partition getting off. Were you able to get rcpacker's job started yet?
(In reply to Moe Jette from comment #48) > I believe this patch fixes the underlying backfill scheduling problem too, > although I have not been able to reproduce it. The select/cons_res plugin > tracks resource usage by partition, which is needed for preemption, > oversubscribing resources, and gang scheduling). If a job's partition > pointer is wrong (i.e. the nodes in are the m7 partition, but it's partition > pointer references m6), the select/cons_res is going to get a bit confused. > Nodes can be moved between partitions, so I'm not expecting slurmctld to > crash or anything like that (or you would have already seen that), but I > could see the bookkeeping for a partition getting off. > Okay. That makes sense. I'll let you know on Monday what I see since most of the jobs in the wrong partition will be done by then, though not all. By the way, have you considered creating some kind of snapshot mechanism that will snapshot information about pending and running jobs, node states, associations, etc to diagnose problems like this? The test suite can already create different setups but it might be nice to ask a customer to snapshot their setup then be able to import it into slurmctld/slurmdbd so you can play around with gdb, extra debug statements, code modifications, or whatever with real data. I have no idea how feasible that is but it seems like the hardest part would be faking that a currently running job that started 6 hours ago is still running, only now with fake hardware at a different time (maybe all times everywhere have now-snapshot_time added to them). Well, I'm sure there are lots of pretty hard parts. > Were you able to get rcpacker's job started yet? Not yet. It should start late Sunday night in the advanced reservation I created; it was the earliest time I could find on that hardware.
(In reply to Ryan Cox from comment #49) > By the way, have you considered creating some kind of snapshot mechanism > that will snapshot information about pending and running jobs, node states, > associations, etc to diagnose problems like this? The test suite can > already create different setups but it might be nice to ask a customer to > snapshot their setup then be able to import it into slurmctld/slurmdbd so > you can play around with gdb, extra debug statements, code modifications, or > whatever with real data. Making a copy of the state save files does this, but it's pretty difficult to work with.
Created attachment 1010 [details] scontrol show job Sadly the problem still exists. All running jobs were wiped out by an extended power outage last night so 1) I can't find any more jobs with the wrong partition and 2) expect a few typos in what I write... I'm attaching the most recent scontrol show node. mwpswim can run on any m7 node and rcpacker can only run on 64 GB m7 nodes. Note that we have an advanced reservation starting at noon that I will soon assign rcpacker's job to. Am I misunderstanding something with how backfill works? My understanding is that no lower priority job should ever start in such a way that it delays the start time of a higher priority job. Under current conditions, the higher priority job gets a start time as far away as bf_window (14 days) and any lower priority job is eligible to start before the higher priority job as long as its time limit is less than 14 days. They are all less than 14 days since 7 days is the association maximum, so anyone can "backfill" in front and consequently delay larger jobs as long as resources are available at the moment for the smaller jobs but not the larger ones. That is not intentional, right? (I hope that made sense... I'm a little tired).
(In reply to Ryan Cox from comment #51) > Am I misunderstanding something with how backfill works? My understanding > is that no lower priority job should ever start in such a way that it delays > the start time of a higher priority job. That is correct, although a job's priority reported by squeue is only one factor in the sort order of jobs for the backfill scheduler. The jobs are sorted by 1. If job X can preempt job Y, then job X gets preference 2. If job X has ANY reservation and job Y has none, then job X gets preference 3. Priority of the job's partition (jobs submitted to multiple partitions are sorted as a separate job for each partition) 4. Job priority 5. Job ID As it goes down the list, resources are reserved for pending jobs based upon when and where they are expected to start. > Under current conditions, the > higher priority job gets a start time as far away as bf_window (14 days) and > any lower priority job is eligible to start before the higher priority job > as long as its time limit is less than 14 days. They are all less than 14 > days since 7 days is the association maximum, so anyone can "backfill" in > front and consequently delay larger jobs as long as resources are available > at the moment for the smaller jobs but not the larger ones. That is not > intentional, right? I really thought the patches that you have would put the job of interest at the head of the queue for jobs that can run in m7. Given the incorrect partition information reported on jobs, it was very difficult to see what was happening earlier. I have no idea right why that isn't happening now other than perhaps some QOS or association limit. Could you check that? Perhaps you could enable the Backfill and BackfillMap debug flags for a while and I can match those logs against matching job info.
Created attachment 1011 [details] Possible fix This patch removes some logic from the backfill scheduler designed to avoid duplicate tests to schedule a job on the the same set of resources as a previous test (e.g. changes to available resources from the previous test occur in a set of nodes outside of the partition used by the job being evaluated). I have not been able to come up with a way to reproduce the failure you see, but am suspicious about this code. Removing the code (i.e. applying this patch) will in the worst case slow down the backfill scheduler a bit. In the best case, it fixes the problem that you see.
Do you have any update on this?
Sorry I haven't gotten back to you. I haven't seen any problems for the last few days but our queue has a very different makeup now than previously due to a power outage Sunday night. We have lots more jobs from lots more users waiting a lot longer than normal. That said, I haven't noticed any weird behavior since Monday. I did apply the patch in comment 53 but I don't know if that fixed things or if it was because the power outage cleared out all the jobs that were in the wrong partition. I'll keep an eye on it next week once the queue gets smaller. Thanks for checking.
I haven't seen any problems in the last week. Maybe the last patch and/or fixing the jobs' partitions has solved the problem.
Thanks for the feedback. I've applied the most recent patch to the version 14.03 code branch. It will be in version 14.03.5 when released. Please re-open if you encounter more problems.
Thanks for all the time and effort you put into this. I know this was a particularly tough one.