Ticket 8629 - Jobs pending with BeginTime continue to accrue age-based priority
Summary: Jobs pending with BeginTime continue to accrue age-based priority
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 19.05.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-03-05 09:45 MST by Kilian Cavalotti
Modified: 2020-03-18 11:01 MDT (History)
1 user (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: Sherlock
CLE Version:
Version Fixed: 20.02.1 20.11pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Reset AccrueTime on "scontrol update job=XX Start=" (v1) (1.18 KB, patch)
2020-03-09 10:52 MDT, Marcin Stolarek
Details | Diff
Fix AccrueTime handling for 20.02 (v2) (2.87 KB, patch)
2020-03-10 08:40 MDT, Marcin Stolarek
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2020-03-05 09:45:35 MST
Hi, 

This is a spin-off from #8621, as the symptoms are similar, although the cause may be different, so here's a separate bug report.

Jobs that are submitted with the --begin option seem to continue accruing age-based priority while pending with (BeginTime). It may be less than optimal to have their priority raise while they're not actively looking for an opportunity to run.

Here's an example:

# scontrol show job 62308791
JobId=62308791 JobName=gdrive-backup
   UserId=[...] GroupId=[...] MCS_label=N/A
   Priority=92094 Nice=0 Account=[...] QOS=normal
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2020-02-29T16:45:40 EligibleTime=2020-03-07T16:45:40
   AccrueTime=2020-02-29T22:36:37
   StartTime=2020-03-07T16:45:40 EndTime=2020-03-14T17:45:40 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-02-29T16:45:40
   Partition=[...] AllocNode:Sid=sh01-15n08:2925
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:4:*
   TRES=cpu=4,mem=20G,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryNode=20G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=[...]
   WorkDir=[...]
   StdErr=[...]
   StdIn=/dev/null
   StdOut=[...]
   Power=

This job has been submitted with "sbatch --begin=now+7days", SubmitTime and EligibleTime are consistent with this.

So right now, it's pending with "BeginTime", but still continues to see its age-base priority increase over time:

# while true; do date; squeue -j 62308791 -o "%.9i %.8Q %32R"; sleep 300; done
Thu Mar  5 08:10:16 PST 2020
    JOBID PRIORITY NODELIST(REASON)                
 62308791    92119 (BeginTime)                     
Thu Mar  5 08:15:21 PST 2020
    JOBID PRIORITY NODELIST(REASON)                
 62308791    92144 (BeginTime)                     
Thu Mar  5 08:20:53 PST 2020
    JOBID PRIORITY NODELIST(REASON)                
 62308791    92169 (BeginTime)                     
Thu Mar  5 08:25:02 PST 2020
    JOBID PRIORITY NODELIST(REASON)                
 62308791    92193 (BeginTime)                     
Thu Mar  5 08:30:25 PST 2020
    JOBID PRIORITY NODELIST(REASON)
 62308791    92218 (BeginTime)

Yet, trying to get details with sprio fails, as sprio seems to explicitly ignore jobs in that state: 

# sprio -j 62308791
Unable to find jobs matching user/id(s) specified

So it's not clear why the priority of those jobs still increase over time, but they definitely end up at the top of the queue, and when their EligibleTime arrives, they're pretty much skipping the whole line.

Wouldn't it be better if age-based priority accrual were to be suspended until EligibleTime?

Thanks!
-- 
Kilian
Comment 1 Marcin Stolarek 2020-03-09 10:52:22 MDT
Created attachment 13308 [details]
Reset AccrueTime on "scontrol update job=XX Start=" (v1)

Kilian,

I can't reproduce the issue. What age-based priority does it's simply a calculation of a difference between the AccrueTime (visible in scontrol show job) and "now".

What is strange for me in the output you attached is that AccrueTime is shifted when compared to SubmitTime, but it doesn't match EliglibleTime/StartTime. Submit time is also not from the day you posted the comment - is it possible that job StartTime was updated by scontrol while the job was pending? This is the only way I see this happening - it should be fixed by the attached patch.

The patch didn't pass our Q/A, but I think it's safe to apply and as you know we appreciate users' feedback.

Just for completeness since it doesn't sound like you'd be interested. If one whats to use "now - SubmitTime" instead of AccrueTime for age priority factor this can be achieved by ACCRUE_ALWAYS flag[1]

cheers,
Marcin

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_ACCRUE_ALWAYS
Comment 2 Kilian Cavalotti 2020-03-09 13:44:52 MDT
Hi Marcin, 

(In reply to Marcin Stolarek from comment #1)
> I can't reproduce the issue. 

Which part? The job priority that increases over time for jobs submitted with --begin? Because this is pretty straightforward to reproduce on my end. And we can rule out "scontrol update" scenarios too:

$ sbatch --begin=now+7days --wrap="sleep 1000"
Submitted batch job 62955664
$ squeue -j 62955664 -h -o "%.9i %.8Q %32R"
 62955664    62125 (BeginTime)
$ sleep 500; squeue -j 62955664 -h -o "%.9i %.8Q %32R"
 62955664    62154 (BeginTime)

That job's priority went from 62125 to 62154 in 5mn, despite not being eligible to start for another week.

Here's the full details about the job:
 scontrol show job 62955664
JobId=62955664 JobName=wrap
   UserId=kilian(215845) GroupId=ruthm(32264) MCS_label=N/A
   Priority=62179 Nice=0 Account=ruthm QOS=normal
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2020-03-09T12:27:29 EligibleTime=2020-03-16T12:27:28
   AccrueTime=2020-03-09T12:27:34
   StartTime=2020-03-16T12:27:28 EndTime=2020-03-16T14:27:28 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-09T12:27:29
   Partition=normal AllocNode:Sid=sh01-ln04:133985
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=6400M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=6400M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/users/kilian
   StdErr=/home/users/kilian/slurm-62955664.out
   StdIn=/dev/null
   StdOut=/home/users/kilian/slurm-62955664.out
   Power=

Here are the priority weights in use:
# sprio -w
          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS                 TRES
        Weights                               1     100000     100000    5000000      50000     100000 CPU=0,Mem=0,GRES/gpu

And we also have MaxJobsAccruePU=5 and MaxJobsAccruePA=10 on the "normal" (default) QOS, maybe that can explain the AccrueTime/SubmitTime discrepancy?


Cheers,
-- 
Kilian
Comment 3 Marcin Stolarek 2020-03-10 08:40:32 MDT
Created attachment 13319 [details]
Fix AccrueTime handling for 20.02 (v2)

Kilian,

Yes - Accrue limits were the key (Actually, I should have noticed it checking the code previously). I'm attaching the patch for Slurm 20.02, it will not work on 19.05 because of the use of missing accrue debug flag.

Do you want to apply the fix on 19.05 - if yes, I'll prepare a patch for you, but since it's not a critical functionality I think it won't be merged into 19.05.

cheers,
Marcin
Comment 6 Kilian Cavalotti 2020-03-10 10:20:55 MDT
Hi Marcin, 

Excellent, thanks!

We plan to go to 20.02 relatively soon, so I guess we can wait for the patch to be included there and won't need a specific backport for 19.05.

Thank you!
-- 
Kilian
Comment 11 Marcin Stolarek 2020-03-18 10:59:25 MDT
Kilian,

The issue is fixed by the following commits:
3e1c29f1 - Don't accrue time if job begin time is in the future.
64e9e116 - Remove accrue time when updating a job start/eligible time to future.

Those were merged to 20.02 branch.

cheers,
Marcin
Comment 12 Kilian Cavalotti 2020-03-18 11:01:42 MDT
Excellent, thanks a lot!

Cheers,