| Summary: | preemptible partitions | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Robert Yelle <ryelle> |
| Component: | Scheduling | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex |
| Version: | 17.02.9 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Oregon | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm configuration file
slurm.conf ATT00001.htm |
||
Hi. For PreemptMode=SUSPEND, suspended jobs remain in memory, meaning that both preemptor and preempted jobs have both to fit at once in the node(s) memory for the preemption to happen. Have you checked that? For PreemptMode=CANCEL this is not required. Hi Alejandra, Thank you for your response. Yes, that makes sense for PreemptMode=SUSPEND, so this mode may not work out for us with our configuration. But preemption still does not work with PreemptMode=CANCEL. Our preferred method for this situation is PreemptMode=CANCEL, though PreemptMode=REQUEUE could also work. Are these modes possible with “preempt/partition_prio”? If so, what am I missing that is causing it not to work? If not, is there another way I can get “preemptmode=cancel” behavior on a per-partition basis? Thanks, Rob On Nov 10, 2017, at 3:39 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Alejandro Sanchez<mailto:alex@schedmd.com> changed bug 4358<https://bugs.schedmd.com/show_bug.cgi?id=4358> What Removed Added Assignee support@schedmd.com<mailto:support@schedmd.com> alex@schedmd.com<mailto:alex@schedmd.com> CC alex@schedmd.com<mailto:alex@schedmd.com> Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=4358#c1> on bug 4358<https://bugs.schedmd.com/show_bug.cgi?id=4358> from Alejandro Sanchez<mailto:alex@schedmd.com> Hi. For PreemptMode=SUSPEND, suspended jobs remain in memory, meaning that both preemptor and preempted jobs have both to fit at once in the node(s) memory for the preemption to happen. Have you checked that? For PreemptMode=CANCEL this is not required. ________________________________ You are receiving this mail because: * You reported the bug. Hi Robert. If you don't care about SUSPEND,GANG and you only want to make use of CANCEL and/or REQUEUE, a configuration like this is working for me:
# Cluster-wide config
PreemptType=preempt/partition_prio
PreemptMode=CANCEL
# Compute nodes
NodeName=compute[1-2] SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7837 NodeHostname=ibiza MemSpecLimit=837 State=UNKNOWN Port=61711-61712
# Partitions
PartitionName=lowprio Nodes=ALL Default=YES State=UP PreemptMode=CANCEL PriorityTier=1
PartitionName=medprio Nodes=ALL Default=NO State=UP PreemptMode=REQUEUE PriorityTier=10
PartitionName=hiprio Nodes=ALL Default=NO State=UP PreemptMode=OFF PriorityTier=100
Note that since you don't care about SUSPEND,GANG you don't need OverSubscribe=FORCE:1 anymore.
$ sbatch --exclusive -N2 --wrap "sleep 99999"
Submitted batch job 20017
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
20017 lowprio wrap alex R 0:05 2 compute[1-2]
$ sbatch -p medprio --exclusive -N2 --wrap "sleep 99999"
Submitted batch job 20018
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
20018 medprio wrap alex R 0:02 2 compute[1-2]
slurmctld: preempted job 20017 has been killed to reclaim resources for job 20018
$ sbatch -p hiprio --exclusive -N2 --wrap "sleep 99999"
Submitted batch job 20019
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
20018 medprio wrap alex CG 0:00 1 compute1
20019 hiprio wrap alex PD 0:00 2 (Resources)
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
20018 medprio wrap alex PD 0:00 2 (BeginTime)
20019 hiprio wrap alex R 0:01 2 compute[1-2]
slurmctld: preempted job 20018 has been requeued to reclaim resources for job 20019
Could you please configure your slurm.conf to something similar to what I have but with the 'test' and 'hiprio' partitions, and then restart slurm?
If it's still not working, could you attach the newly changed slurm.conf and slurmctld.log files?
Thanks.
Created attachment 5547 [details]
slurm.conf
Hi Alejandra,
Thank you for info, and for providing the example. I set up partitions similar to yours below, but it still does not work. When I do “scontrol show partition”, I see that PriorityTier never changed, so maybe the problem is related to that:
PartitionName=lowprio
AllowGroups=hpcrcf AllowAccounts=hpcrcf AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=n[095]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=CANCEL
State=UP TotalCPUs=28 TotalNodes=1 SelectTypeParameters=NONE
DefMemPerCPU=3800 MaxMemPerNode=UNLIMITED
TRESBillingWeights=CPU=1.0,Mem=1.0G
PartitionName=medprio
AllowGroups=hpcrcf AllowAccounts=hpcrcf AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=n[095]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=REQUEUE
State=UP TotalCPUs=28 TotalNodes=1 SelectTypeParameters=NONE
DefMemPerCPU=3800 MaxMemPerNode=UNLIMITED
TRESBillingWeights=CPU=1.0,Mem=1.0G
PartitionName=hiprio
AllowGroups=hpcrcf AllowAccounts=hpcrcf AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=n[095]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=28 TotalNodes=1 SelectTypeParameters=NONE
DefMemPerCPU=3800 MaxMemPerNode=UNLIMITED
TRESBillingWeights=CPU=1.0,Mem=1.0G
This was true if I did “scontrol reconfig” or “systemctl restart slurmctld”. What would prevent PriorityTier from being changed? My latest slurm.conf is attached.
Thanks,
Rob
Created attachment 5548 [details]
ATT00001.htm
Hi Robert. Short answer: remove "Priority" option from your partition definition lines and only set PriorityJobFactor and/or PriorityTier, then 'scontrol reconfigure'. Long answer: Following commit (available since slurm-16-05-0-0pre2 onwards): https://github.com/SchedMD/slurm/commit/7844563c086919175def Made it so we split partition's "Priority" field into "PriorityTier" (used to order partitions for scheduling and preemption) plus "PriorityJobFactor" (used by priority/multifactor plugin in calculating job priority, which is used to order jobs within a partition for scheduling). What is happening is that since you have "Priority" set to 1, its value is overriding both PriorityJobFactor and PriorityTier: ... if (!s_p_get_uint16(&p->priority_tier, "PriorityTier", tbl) && !s_p_get_uint16(&p->priority_tier, "PriorityTier", dflt)) { p->priority_tier = 1; } if (s_p_get_uint16(&tmp_16, "Priority", tbl) || s_p_get_uint16(&tmp_16, "Priority", dflt)) { p->priority_job_factor = tmp_16; p->priority_tier = tmp_16; } ... (code snippet from from src/common/read_config.c) Hi Alejandro, Yes, that was the culprit. Once I removed those Priority option on the partition lines and did “scontrol reconfig”, the PriorityTier values changed and preemptmode=CANCEL started working. Thanks! Rob On Nov 13, 2017, at 9:37 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=4358#c6> on bug 4358<https://bugs.schedmd.com/show_bug.cgi?id=4358> from Alejandro Sanchez<mailto:alex@schedmd.com> Hi Robert. Short answer: remove "Priority" option from your partition definition lines and only set PriorityJobFactor and/or PriorityTier, then 'scontrol reconfigure'. Long answer: Following commit (available since slurm-16-05-0-0pre2 onwards): https://github.com/SchedMD/slurm/commit/7844563c086919175def Made it so we split partition's "Priority" field into "PriorityTier" (used to order partitions for scheduling and preemption) plus "PriorityJobFactor" (used by priority/multifactor plugin in calculating job priority, which is used to order jobs within a partition for scheduling). What is happening is that since you have "Priority" set to 1, its value is overriding both PriorityJobFactor and PriorityTier: ... if (!s_p_get_uint16(&p->priority_tier, "PriorityTier", tbl) && !s_p_get_uint16(&p->priority_tier, "PriorityTier", dflt)) { p->priority_tier = 1; } if (s_p_get_uint16(&tmp_16, "Priority", tbl) || s_p_get_uint16(&tmp_16, "Priority", dflt)) { p->priority_job_factor = tmp_16; p->priority_tier = tmp_16; } ... (code snippet from from src/common/read_config.c) ________________________________ You are receiving this mail because: * You reported the bug. I'm marking this as resolved. Please, reopen if you have further questions. |
Created attachment 5541 [details] slurm configuration file Hello, I am trying to set up a preemptible "test" partition, and a "hiprio" partition where jobs submitted to the hiprio partition will preempt jobs in the test partition (same nodes in each partition). I set "PreemptType=preempt/partition_prio" and Global "PreemptMode=SUSPEND,GANG" and PreemptMode for the test partition to "SUSPEND" or (preferably) "CANCEL", and set hiprio as a higher priority tier partition and Oversubscribe=FORCE:1. However, when I occupy all nodes of the test partition and try reserving a job on the hiprio partition, nothing happens, the job requesting the hiprio partition remains in a pending state (Resources). What am I missing, or doing wrong? My latest slurm.conf is attached. Let me know if you need anything else. Thanks! Rob