Description
Kilian Cavalotti
2020-03-05 09:45:35 MST
Created attachment 13308 [details] Reset AccrueTime on "scontrol update job=XX Start=" (v1) Kilian, I can't reproduce the issue. What age-based priority does it's simply a calculation of a difference between the AccrueTime (visible in scontrol show job) and "now". What is strange for me in the output you attached is that AccrueTime is shifted when compared to SubmitTime, but it doesn't match EliglibleTime/StartTime. Submit time is also not from the day you posted the comment - is it possible that job StartTime was updated by scontrol while the job was pending? This is the only way I see this happening - it should be fixed by the attached patch. The patch didn't pass our Q/A, but I think it's safe to apply and as you know we appreciate users' feedback. Just for completeness since it doesn't sound like you'd be interested. If one whats to use "now - SubmitTime" instead of AccrueTime for age priority factor this can be achieved by ACCRUE_ALWAYS flag[1] cheers, Marcin [1] https://slurm.schedmd.com/slurm.conf.html#OPT_ACCRUE_ALWAYS Hi Marcin, (In reply to Marcin Stolarek from comment #1) > I can't reproduce the issue. Which part? The job priority that increases over time for jobs submitted with --begin? Because this is pretty straightforward to reproduce on my end. And we can rule out "scontrol update" scenarios too: $ sbatch --begin=now+7days --wrap="sleep 1000" Submitted batch job 62955664 $ squeue -j 62955664 -h -o "%.9i %.8Q %32R" 62955664 62125 (BeginTime) $ sleep 500; squeue -j 62955664 -h -o "%.9i %.8Q %32R" 62955664 62154 (BeginTime) That job's priority went from 62125 to 62154 in 5mn, despite not being eligible to start for another week. Here's the full details about the job: scontrol show job 62955664 JobId=62955664 JobName=wrap UserId=kilian(215845) GroupId=ruthm(32264) MCS_label=N/A Priority=62179 Nice=0 Account=ruthm QOS=normal JobState=PENDING Reason=BeginTime Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A SubmitTime=2020-03-09T12:27:29 EligibleTime=2020-03-16T12:27:28 AccrueTime=2020-03-09T12:27:34 StartTime=2020-03-16T12:27:28 EndTime=2020-03-16T14:27:28 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-09T12:27:29 Partition=normal AllocNode:Sid=sh01-ln04:133985 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=6400M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=6400M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/users/kilian StdErr=/home/users/kilian/slurm-62955664.out StdIn=/dev/null StdOut=/home/users/kilian/slurm-62955664.out Power= Here are the priority weights in use: # sprio -w JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION QOS TRES Weights 1 100000 100000 5000000 50000 100000 CPU=0,Mem=0,GRES/gpu And we also have MaxJobsAccruePU=5 and MaxJobsAccruePA=10 on the "normal" (default) QOS, maybe that can explain the AccrueTime/SubmitTime discrepancy? Cheers, -- Kilian Created attachment 13319 [details]
Fix AccrueTime handling for 20.02 (v2)
Kilian,
Yes - Accrue limits were the key (Actually, I should have noticed it checking the code previously). I'm attaching the patch for Slurm 20.02, it will not work on 19.05 because of the use of missing accrue debug flag.
Do you want to apply the fix on 19.05 - if yes, I'll prepare a patch for you, but since it's not a critical functionality I think it won't be merged into 19.05.
cheers,
Marcin
Hi Marcin, Excellent, thanks! We plan to go to 20.02 relatively soon, so I guess we can wait for the patch to be included there and won't need a specific backport for 19.05. Thank you! -- Kilian Kilian, The issue is fixed by the following commits: 3e1c29f1 - Don't accrue time if job begin time is in the future. 64e9e116 - Remove accrue time when updating a job start/eligible time to future. Those were merged to 20.02 branch. cheers, Marcin Excellent, thanks a lot! Cheers, |