| Summary: | preemption candidate selection order with multiple partitions | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Michael Gutteridge <mrg> |
| Component: | Scheduling | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex |
| Version: | 15.08.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | FHCRC - Fred Hutchinson Cancer Research Center | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
hutch slurm.conf
output of sacctmgr show qos |
||
Hi Michael, We're going to check this and come back to you soon. In the meantime, could you please attach your site's slurm.conf? Created attachment 2850 [details]
hutch slurm.conf
Created attachment 2851 [details]
output of sacctmgr show qos
Isn't partition campus configured with PreemptMode=OFF in your slurm.conf? Also there are some partition names where their name is not shown, so I can't figure out which partitions relate to which QOS. (In reply to Alejandro Sanchez from comment #5) > Isn't partition campus configured with PreemptMode=OFF in your slurm.conf? Sorry, probably not really clear on that. The preemptable jobs are running in a partition called "restart" which overlaps the campus and other partitions. There's a job submit plugin which gives jobs submitted to the restart partition the "restart" qos. The "restart" partition does have PreemptMode=requeue > Also there are some partition names where their name is not shown, so I > can't figure out which partitions relate to which QOS. Sorry about that- the partitions are named and I scrubbed them out of an abundance of caution. These partitions get the "private" QOS for the partition QOS and the default QOS of "normal" for jobs. The "normal" qos is able to preempt jobs in "restart" and "boneyard_restart" QOS's. restart and boneyard_restart have a partition QOS of "nolimit" which has... well, no limits. The submit plugin sets jobs submitted to these partitions to a QOS of the same name. Hopefully that clears things up a little bit. The PreemptType=preempt/qos process works as follows: The plugin first checks that the global PreemptMode!=OFF. In your case it is PreemptMode=REQUEUE, so it's fine and continues. Then the plugin is asked for the submitted job's QOS PreemptMode, just in case it overrides the default. The plugin then builds a list of preemption job candidates out of all jobs. Every job in the preemption job candidate list must satisfy these 3 conditions: 1. Its job state must be RUNNING or SUSPENDED. 2. Its job QOS must appear in the Preempt parameter of the preemptor job's QOS. 3. It can not be an expanding part of any preemptor job (related to job resize). Then the plugin assigns a priority to every job in that list and orders the list based upon that priority. This priority is partly based upon the QOS priority and partly based upon the job size. Plugin puts smaller jobs at the top of the preemption queue and use a sort algorithm to minimize the number of job's preempted. For example, to start an 8 node job, the ordered preemption candidates may be 2 node, 4 node and 8 node. Preempting all three jobs would allow the pending job to start, but by reordering the preemption candidates it is possible to start the pending job after preempting only one job. Finally, there are two SchedulerParameters that may be of your interest: preempt_reorder_count=# preempt_strict_order Hope this helps you understand how this process works. Ok, that makes sense. In most cases I'd expect that minimizing the number of jobs preempted would be valuable. Thanks m |
I am testing out QOS based preemption in the hopes it will improve our preemption times and it appears that the change from partition-based to qos-based has changed how nodes are selected for preemption. Some jobs are submitted with multiple partitions (via a job submit plugin)- these will request a private partition and the general purpose public partition named "campus": sbatch -p private,campus ... With partition preemption, the job would preempt jobs on nodes in the private partition (if available), then preempt jobs in the campus partition. Now it appears that the job does not use that same logic. My cursory diagnosis suggests that the partition order is alphabetical, in that: sbatch -p alpha,campus ... will preferentially preempt jobs from "alpha" while: sbatch -p delta,campus will preferentially preempt jobs from "campus". This is despite the fact that I've given "campus" a lower priority than any of the other partitions (not sure if that actually matters, though). The order of partitions on the command line doesn't seem to make a difference either (again, don't think that matters, but I'm throwing that in there with the other straws I'm grasping). Not sure what the expected or desired behavior is. Can you shed some light on this? Thanks much