4358 – preemptible partitions

Ticket 4358 - preemptible partitions

Summary: preemptible partitions

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	17.02.9
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-11-09 16:47 MST by Robert Yelle
Modified:	2017-11-13 10:30 MST (History)
CC List:	1 user (show)

See Also:
Site:	University of Oregon
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm configuration file (8.05 KB, text/plain) 2017-11-09 16:47 MST, Robert Yelle	Details
slurm.conf (8.57 KB, application/octet-stream) 2017-11-13 09:04 MST, Robert Yelle	Details
ATT00001.htm (3.87 KB, text/html) 2017-11-13 09:04 MST, Robert Yelle	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Robert Yelle 2017-11-09 16:47:07 MST

Created attachment 5541 [details]
slurm configuration file

Hello,

I am trying to set up a preemptible "test" partition, and a "hiprio" partition where jobs submitted to the hiprio partition will preempt jobs in the test partition (same nodes in each partition).  I set "PreemptType=preempt/partition_prio" and Global "PreemptMode=SUSPEND,GANG" and PreemptMode for the test partition to "SUSPEND" or (preferably) "CANCEL", and set hiprio as a higher priority tier partition and Oversubscribe=FORCE:1.  However, when I occupy all nodes of the test partition and try reserving a job on the hiprio partition, nothing happens, the job requesting the hiprio partition remains in a pending state (Resources).  What am I missing, or doing wrong?  My latest slurm.conf is attached.  Let me know if you need anything else.

Thanks!

Rob

Comment 1 Alejandro Sanchez 2017-11-10 04:39:37 MST

Hi. For PreemptMode=SUSPEND, suspended jobs remain in memory, meaning that both preemptor and preempted jobs have both to fit at once in the node(s) memory for the preemption to happen. Have you checked that?

For PreemptMode=CANCEL this is not required.

Comment 2 Robert Yelle 2017-11-10 10:18:55 MST

Hi Alejandra,

Thank you for your response.  Yes, that makes sense for PreemptMode=SUSPEND, so this mode may not work out for us with our configuration.  But preemption still does not work with PreemptMode=CANCEL.  Our preferred method for this situation is PreemptMode=CANCEL, though PreemptMode=REQUEUE could also work.  Are these modes possible with “preempt/partition_prio”?  If so, what am I missing that is causing it not to work?  If not, is there another way I can get “preemptmode=cancel” behavior on a per-partition basis?

Thanks,

Rob


On Nov 10, 2017, at 3:39 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:

Alejandro Sanchez<mailto:alex@schedmd.com> changed bug 4358<https://bugs.schedmd.com/show_bug.cgi?id=4358>
What    Removed Added
Assignee        support@schedmd.com<mailto:support@schedmd.com>         alex@schedmd.com<mailto:alex@schedmd.com>
CC              alex@schedmd.com<mailto:alex@schedmd.com>

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=4358#c1> on bug 4358<https://bugs.schedmd.com/show_bug.cgi?id=4358> from Alejandro Sanchez<mailto:alex@schedmd.com>

Hi. For PreemptMode=SUSPEND, suspended jobs remain in memory, meaning that both
preemptor and preempted jobs have both to fit at once in the node(s) memory for
the preemption to happen. Have you checked that?

For PreemptMode=CANCEL this is not required.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 3 Alejandro Sanchez 2017-11-11 05:36:54 MST

Hi Robert. If you don't care about SUSPEND,GANG and you only want to make use of CANCEL and/or REQUEUE, a configuration like this is working for me:

# Cluster-wide config
PreemptType=preempt/partition_prio
PreemptMode=CANCEL

# Compute nodes
NodeName=compute[1-2] SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7837 NodeHostname=ibiza MemSpecLimit=837 State=UNKNOWN Port=61711-61712
# Partitions
PartitionName=lowprio Nodes=ALL Default=YES State=UP PreemptMode=CANCEL PriorityTier=1
PartitionName=medprio Nodes=ALL Default=NO State=UP PreemptMode=REQUEUE PriorityTier=10
PartitionName=hiprio Nodes=ALL Default=NO State=UP PreemptMode=OFF PriorityTier=100

Note that since you don't care about SUSPEND,GANG you don't need OverSubscribe=FORCE:1 anymore.

$ sbatch --exclusive -N2 --wrap "sleep 99999"
Submitted batch job 20017
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20017   lowprio     wrap     alex  R       0:05      2 compute[1-2]
$ sbatch -p medprio --exclusive -N2 --wrap "sleep 99999"
Submitted batch job 20018
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20018   medprio     wrap     alex  R       0:02      2 compute[1-2]

slurmctld: preempted job 20017 has been killed to reclaim resources for job 20018

$ sbatch -p hiprio --exclusive -N2 --wrap "sleep 99999"
Submitted batch job 20019
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20018   medprio     wrap     alex CG       0:00      1 compute1
             20019    hiprio     wrap     alex PD       0:00      2 (Resources)
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20018   medprio     wrap     alex PD       0:00      2 (BeginTime)
             20019    hiprio     wrap     alex  R       0:01      2 compute[1-2]

slurmctld: preempted job 20018 has been requeued to reclaim resources for job 20019

Could you please configure your slurm.conf to something similar to what I have but with the 'test' and 'hiprio' partitions, and then restart slurm?

If it's still not working, could you attach the newly changed slurm.conf and slurmctld.log files?

Thanks.

Comment 4 Robert Yelle 2017-11-13 09:04:41 MST

Created attachment 5547 [details]
slurm.conf

Hi Alejandra,

Thank you for info, and for providing the example.  I set up partitions similar to yours below, but it still does not work.  When I do “scontrol show partition”, I see that PriorityTier never changed, so maybe the problem is related to that:

PartitionName=lowprio
   AllowGroups=hpcrcf AllowAccounts=hpcrcf AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=n[095]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=CANCEL
   State=UP TotalCPUs=28 TotalNodes=1 SelectTypeParameters=NONE
   DefMemPerCPU=3800 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=1.0,Mem=1.0G

PartitionName=medprio
   AllowGroups=hpcrcf AllowAccounts=hpcrcf AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=n[095]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=REQUEUE
   State=UP TotalCPUs=28 TotalNodes=1 SelectTypeParameters=NONE
   DefMemPerCPU=3800 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=1.0,Mem=1.0G

PartitionName=hiprio
   AllowGroups=hpcrcf AllowAccounts=hpcrcf AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=n[095]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=28 TotalNodes=1 SelectTypeParameters=NONE
   DefMemPerCPU=3800 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=1.0,Mem=1.0G

This was true if I did “scontrol reconfig” or “systemctl restart slurmctld”.  What would prevent PriorityTier from being changed?  My latest slurm.conf is attached.

Thanks,

Rob

Comment 5 Robert Yelle 2017-11-13 09:04:41 MST

Created attachment 5548 [details]
ATT00001.htm

Comment 6 Alejandro Sanchez 2017-11-13 09:37:16 MST

Hi Robert. 

Short answer: 
remove "Priority" option from your partition definition lines and only set PriorityJobFactor and/or PriorityTier, then 'scontrol reconfigure'.

Long answer:
Following commit (available since slurm-16-05-0-0pre2 onwards):

https://github.com/SchedMD/slurm/commit/7844563c086919175def

Made it so we split partition's "Priority" field into "PriorityTier" (used to order partitions for scheduling and preemption) plus "PriorityJobFactor" (used by priority/multifactor plugin in calculating job priority, which is used to order jobs within a partition for scheduling).

What is happening is that since you have "Priority" set to 1, its value is overriding both PriorityJobFactor and PriorityTier:

...
if (!s_p_get_uint16(&p->priority_tier, "PriorityTier", tbl) &&
    !s_p_get_uint16(&p->priority_tier, "PriorityTier", dflt)) {
        p->priority_tier = 1;
}
if (s_p_get_uint16(&tmp_16, "Priority", tbl) ||
    s_p_get_uint16(&tmp_16, "Priority", dflt)) {
        p->priority_job_factor = tmp_16;
        p->priority_tier = tmp_16;
}
...

(code snippet from from src/common/read_config.c)

Comment 7 Robert Yelle 2017-11-13 10:22:36 MST

Hi Alejandro,

Yes, that was the culprit.  Once I removed those Priority option on the partition lines and did “scontrol reconfig”, the PriorityTier values changed and preemptmode=CANCEL started working.

Thanks!

Rob


On Nov 13, 2017, at 9:37 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:


Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=4358#c6> on bug 4358<https://bugs.schedmd.com/show_bug.cgi?id=4358> from Alejandro Sanchez<mailto:alex@schedmd.com>

Hi Robert.

Short answer:
remove "Priority" option from your partition definition lines and only set
PriorityJobFactor and/or PriorityTier, then 'scontrol reconfigure'.

Long answer:
Following commit (available since slurm-16-05-0-0pre2 onwards):

https://github.com/SchedMD/slurm/commit/7844563c086919175def

Made it so we split partition's "Priority" field into "PriorityTier" (used to
order partitions for scheduling and preemption) plus "PriorityJobFactor" (used
by priority/multifactor plugin in calculating job priority, which is used to
order jobs within a partition for scheduling).

What is happening is that since you have "Priority" set to 1, its value is
overriding both PriorityJobFactor and PriorityTier:

...
if (!s_p_get_uint16(&p->priority_tier, "PriorityTier", tbl) &&
    !s_p_get_uint16(&p->priority_tier, "PriorityTier", dflt)) {
        p->priority_tier = 1;
}
if (s_p_get_uint16(&tmp_16, "Priority", tbl) ||
    s_p_get_uint16(&tmp_16, "Priority", dflt)) {
        p->priority_job_factor = tmp_16;
        p->priority_tier = tmp_16;
}
...

(code snippet from from src/common/read_config.c)

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 8 Alejandro Sanchez 2017-11-13 10:30:50 MST

I'm marking this as resolved. Please, reopen if you have further questions.