Ticket 2531

Summary: preemption candidate selection order with multiple partitions
Product: Slurm Reporter: Michael Gutteridge <mrg>
Component: SchedulingAssignee: Alejandro Sanchez <alex>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex
Version: 15.08.7   
Hardware: Linux   
OS: Linux   
Site: FHCRC - Fred Hutchinson Cancer Research Center Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: hutch slurm.conf
output of sacctmgr show qos

Description Michael Gutteridge 2016-03-09 04:09:24 MST
I am testing out QOS based preemption in the hopes it will improve our preemption times and it appears that the change from partition-based to qos-based has changed how nodes are selected for preemption.

Some jobs are submitted with multiple partitions (via a job submit plugin)- these will request a private partition and the general purpose public partition named "campus":

    sbatch -p private,campus ...

With partition preemption, the job would preempt jobs on nodes in the private partition (if available), then preempt jobs in the campus partition.  Now it appears that the job does not use that same logic.  My cursory diagnosis suggests that the partition order is alphabetical, in that:

    sbatch -p alpha,campus ...

will preferentially preempt jobs from "alpha" while:

    sbatch -p delta,campus

will preferentially preempt jobs from "campus".

This is despite the fact that I've given "campus" a lower priority than any of the other partitions (not sure if that actually matters, though).  The order of partitions on the command line doesn't seem to make a difference either (again, don't think that matters, but I'm throwing that in there with the other straws I'm grasping).

Not sure what the expected or desired behavior is.  Can you shed some light on this?

Thanks much
Comment 1 Alejandro Sanchez 2016-03-09 18:07:04 MST
Hi Michael,

We're going to check this and come back to you soon.
Comment 2 Alejandro Sanchez 2016-03-09 18:37:35 MST
In the meantime, could you please attach your site's slurm.conf?
Comment 3 Michael Gutteridge 2016-03-10 03:24:07 MST
Created attachment 2850 [details]
hutch slurm.conf
Comment 4 Michael Gutteridge 2016-03-10 03:26:26 MST
Created attachment 2851 [details]
output  of sacctmgr show qos
Comment 5 Alejandro Sanchez 2016-03-10 04:57:42 MST
Isn't partition campus configured with PreemptMode=OFF in your slurm.conf? Also there are some partition names where their name is not shown, so I can't figure out which partitions relate to which QOS.
Comment 6 Michael Gutteridge 2016-03-10 08:18:15 MST
(In reply to Alejandro Sanchez from comment #5)
> Isn't partition campus configured with PreemptMode=OFF in your slurm.conf?

Sorry, probably not really clear on that.  The preemptable jobs are running in a partition called "restart" which overlaps the campus and other partitions.  There's a job submit plugin which gives jobs submitted to the restart partition the "restart" qos.

The "restart" partition does have PreemptMode=requeue

> Also there are some partition names where their name is not shown, so I
> can't figure out which partitions relate to which QOS.

Sorry about that- the partitions are named and I scrubbed them out of an abundance of caution.  These partitions get the "private" QOS for the partition QOS and the default QOS of "normal" for jobs.  The "normal" qos is able to preempt jobs in "restart" and "boneyard_restart" QOS's.

restart and boneyard_restart have a partition QOS of "nolimit" which has... well, no limits.  The submit plugin sets jobs submitted to these partitions to a QOS of the same name.

Hopefully that clears things up a little bit.
Comment 7 Alejandro Sanchez 2016-03-10 22:04:01 MST
The PreemptType=preempt/qos process works as follows:

The plugin first checks that the global PreemptMode!=OFF. In your case it is PreemptMode=REQUEUE, so it's fine and continues. Then the plugin is asked for the submitted job's QOS PreemptMode, just in case it overrides the default.

The plugin then builds a list of preemption job candidates out of all jobs. Every job in the preemption job candidate list must satisfy these 3 conditions:

1. Its job state must be RUNNING or SUSPENDED.
2. Its job QOS must appear in the Preempt parameter of the preemptor job's QOS.
3. It can not be an expanding part of any preemptor job (related to job resize).

Then the plugin assigns a priority to every job in that list and orders the list based upon that priority. This priority is partly based upon the QOS priority and partly based upon the job size. Plugin puts smaller jobs at the top of the preemption queue and use a sort algorithm to minimize the number of job's preempted.

For example, to start an 8 node job, the ordered preemption candidates may be 2 node, 4 node and 8 node. Preempting all three jobs would allow the pending job to start, but by reordering the preemption candidates it is possible to start the pending job after preempting only one job.

Finally, there are two SchedulerParameters that may be of your interest:
preempt_reorder_count=#
preempt_strict_order

Hope this helps you understand how this process works.
Comment 8 Michael Gutteridge 2016-03-14 05:09:26 MDT
Ok, that makes sense.  In most cases I'd expect that minimizing the number of jobs preempted would be valuable.

Thanks

m