Ticket 15459

Summary: Question about job preemption
Product: Slurm Reporter: Steve Ford <fordste5>
Component: SchedulingAssignee: Ben Glines <ben.glines>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: MSU Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Slurm Configuration

Description Steve Ford 2022-11-18 09:48:35 MST
Created attachment 27837 [details]
Slurm Configuration

Hello SchedMD,

We have job preemption configured on our system and I am wondering if the preemption logic runs in the main scheduler or if it is only during backfill scheduling. Can you clarify?

Thanks,
Steve
Comment 1 Ben Glines 2022-11-18 14:21:59 MST
Hi Steve,

Jobs will attempt to preempt upon submission, regardless of the main/backfill scheduling.

To demonstrate this, I set my backfill and scheduling intervals to higher values to effectively stop them from doing anything. Then I'll submit a job to preempt another job, and show that it still preempts despite the main/backfill scheduling not happening.

slurm.conf
> PreemptType=preempt/partition_prio
> PreemptMode=REQUEUE
> . . . 
> SchedulerParameters=bf_interval=1000,sched_interval=1000
> . . .
> PartitionName=A Nodes=n-[1-3] Default=YES MaxTime=INFINITE State=UP PriorityTier=1
> PartitionName=B Nodes=n-[1-3] Default=no MaxTime=INFINITE State=UP PriorityTier=2


Submit job to lower priority partition:
> $ sbatch --wrap="sleep 100000" -wn-1 --exclusive --partition=A
> Submitted batch job 570
> $ squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>                570         A     wrap benjamin  R       0:02      1 n-1

Submit job to higher priority partition that will preempt previous job.
> $ sbatch --wrap="sleep 100000" -wn-1 --exclusive --partition=B
> Submitted batch job 571
> $ squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>                570         A     wrap benjamin PD       0:00      1 (BeginTime)
>                571         B     wrap benjamin  R       0:04      1 n-1

Note that when I submitted the second job (571), it immediately preempts the first(570), and does not wait for the main/backfill scheduling.

In addition to preemption happening upon submission, it will also happen for jobs started with the backfill scheduler. There is a minor limitation to this though, with which a job may preempt more resources (whole nodes instead of partial nodes) than it requested. Read more about this here: https://slurm.schedmd.com/preempt.html#limitations

Let me know if you have any questions about this.
Comment 2 Steve Ford 2022-11-21 15:26:10 MST
Hello Ben,

Thank you for the information. I have another question. I'm wondering what happens when the main scheduler queue hits max_sched_time. Does whatever portion of the job queue that the main scheduler hasn't evaluated stay unevaluated until the queue is smaller or will the main scheduler continue where it left off on the next cycle? 

Thanks,
Steve
Comment 3 Ben Glines 2022-11-23 14:35:20 MST
The portion of the job queue that hasn't been evaluated will stay "unevaluated" until the queue is smaller.

When using priority/multifactor, the main scheduler will build an unordered list of pending jobs, sort those jobs by priority, and then schedule jobs until it hits the max_sched_time. A new queue of jobs is created and sorted every cycle before any scheduling happens. This queue is then free'd at the end of the scheduling cycle and not considered for the next cycle.

The scheduler places the jobs with the highest priority at the front of the job queue, so that those jobs are scheduled first. Any jobs that the scheduler didn't reach are of lower priority (as defined by the priority weights and options you have set), and thus are not considered for scheduling at that time, but would be as soon as the queue gets smaller.
Comment 4 Ben Glines 2022-12-05 12:57:49 MST
Do you have any other questions about this? If not, I'll close this out.
Comment 5 Steve Ford 2022-12-05 12:59:06 MST
Hello Ben,

Go ahead and close this request.

Thanks,
Steve
Comment 6 Ben Glines 2022-12-05 13:09:14 MST
Closing now