Created attachment 5541 [details] slurm configuration file Hello, I am trying to set up a preemptible "test" partition, and a "hiprio" partition where jobs submitted to the hiprio partition will preempt jobs in the test partition (same nodes in each partition). I set "PreemptType=preempt/partition_prio" and Global "PreemptMode=SUSPEND,GANG" and PreemptMode for the test partition to "SUSPEND" or (preferably) "CANCEL", and set hiprio as a higher priority tier partition and Oversubscribe=FORCE:1. However, when I occupy all nodes of the test partition and try reserving a job on the hiprio partition, nothing happens, the job requesting the hiprio partition remains in a pending state (Resources). What am I missing, or doing wrong? My latest slurm.conf is attached. Let me know if you need anything else. Thanks! Rob
Hi. For PreemptMode=SUSPEND, suspended jobs remain in memory, meaning that both preemptor and preempted jobs have both to fit at once in the node(s) memory for the preemption to happen. Have you checked that? For PreemptMode=CANCEL this is not required.
Hi Alejandra, Thank you for your response. Yes, that makes sense for PreemptMode=SUSPEND, so this mode may not work out for us with our configuration. But preemption still does not work with PreemptMode=CANCEL. Our preferred method for this situation is PreemptMode=CANCEL, though PreemptMode=REQUEUE could also work. Are these modes possible with “preempt/partition_prio”? If so, what am I missing that is causing it not to work? If not, is there another way I can get “preemptmode=cancel” behavior on a per-partition basis? Thanks, Rob On Nov 10, 2017, at 3:39 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Alejandro Sanchez<mailto:alex@schedmd.com> changed bug 4358<https://bugs.schedmd.com/show_bug.cgi?id=4358> What Removed Added Assignee support@schedmd.com<mailto:support@schedmd.com> alex@schedmd.com<mailto:alex@schedmd.com> CC alex@schedmd.com<mailto:alex@schedmd.com> Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=4358#c1> on bug 4358<https://bugs.schedmd.com/show_bug.cgi?id=4358> from Alejandro Sanchez<mailto:alex@schedmd.com> Hi. For PreemptMode=SUSPEND, suspended jobs remain in memory, meaning that both preemptor and preempted jobs have both to fit at once in the node(s) memory for the preemption to happen. Have you checked that? For PreemptMode=CANCEL this is not required. ________________________________ You are receiving this mail because: * You reported the bug.
Hi Robert. If you don't care about SUSPEND,GANG and you only want to make use of CANCEL and/or REQUEUE, a configuration like this is working for me: # Cluster-wide config PreemptType=preempt/partition_prio PreemptMode=CANCEL # Compute nodes NodeName=compute[1-2] SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7837 NodeHostname=ibiza MemSpecLimit=837 State=UNKNOWN Port=61711-61712 # Partitions PartitionName=lowprio Nodes=ALL Default=YES State=UP PreemptMode=CANCEL PriorityTier=1 PartitionName=medprio Nodes=ALL Default=NO State=UP PreemptMode=REQUEUE PriorityTier=10 PartitionName=hiprio Nodes=ALL Default=NO State=UP PreemptMode=OFF PriorityTier=100 Note that since you don't care about SUSPEND,GANG you don't need OverSubscribe=FORCE:1 anymore. $ sbatch --exclusive -N2 --wrap "sleep 99999" Submitted batch job 20017 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20017 lowprio wrap alex R 0:05 2 compute[1-2] $ sbatch -p medprio --exclusive -N2 --wrap "sleep 99999" Submitted batch job 20018 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20018 medprio wrap alex R 0:02 2 compute[1-2] slurmctld: preempted job 20017 has been killed to reclaim resources for job 20018 $ sbatch -p hiprio --exclusive -N2 --wrap "sleep 99999" Submitted batch job 20019 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20018 medprio wrap alex CG 0:00 1 compute1 20019 hiprio wrap alex PD 0:00 2 (Resources) $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20018 medprio wrap alex PD 0:00 2 (BeginTime) 20019 hiprio wrap alex R 0:01 2 compute[1-2] slurmctld: preempted job 20018 has been requeued to reclaim resources for job 20019 Could you please configure your slurm.conf to something similar to what I have but with the 'test' and 'hiprio' partitions, and then restart slurm? If it's still not working, could you attach the newly changed slurm.conf and slurmctld.log files? Thanks.
Created attachment 5547 [details] slurm.conf Hi Alejandra, Thank you for info, and for providing the example. I set up partitions similar to yours below, but it still does not work. When I do “scontrol show partition”, I see that PriorityTier never changed, so maybe the problem is related to that: PartitionName=lowprio AllowGroups=hpcrcf AllowAccounts=hpcrcf AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=n[095] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=CANCEL State=UP TotalCPUs=28 TotalNodes=1 SelectTypeParameters=NONE DefMemPerCPU=3800 MaxMemPerNode=UNLIMITED TRESBillingWeights=CPU=1.0,Mem=1.0G PartitionName=medprio AllowGroups=hpcrcf AllowAccounts=hpcrcf AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=n[095] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=REQUEUE State=UP TotalCPUs=28 TotalNodes=1 SelectTypeParameters=NONE DefMemPerCPU=3800 MaxMemPerNode=UNLIMITED TRESBillingWeights=CPU=1.0,Mem=1.0G PartitionName=hiprio AllowGroups=hpcrcf AllowAccounts=hpcrcf AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=n[095] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=28 TotalNodes=1 SelectTypeParameters=NONE DefMemPerCPU=3800 MaxMemPerNode=UNLIMITED TRESBillingWeights=CPU=1.0,Mem=1.0G This was true if I did “scontrol reconfig” or “systemctl restart slurmctld”. What would prevent PriorityTier from being changed? My latest slurm.conf is attached. Thanks, Rob
Created attachment 5548 [details] ATT00001.htm
Hi Robert. Short answer: remove "Priority" option from your partition definition lines and only set PriorityJobFactor and/or PriorityTier, then 'scontrol reconfigure'. Long answer: Following commit (available since slurm-16-05-0-0pre2 onwards): https://github.com/SchedMD/slurm/commit/7844563c086919175def Made it so we split partition's "Priority" field into "PriorityTier" (used to order partitions for scheduling and preemption) plus "PriorityJobFactor" (used by priority/multifactor plugin in calculating job priority, which is used to order jobs within a partition for scheduling). What is happening is that since you have "Priority" set to 1, its value is overriding both PriorityJobFactor and PriorityTier: ... if (!s_p_get_uint16(&p->priority_tier, "PriorityTier", tbl) && !s_p_get_uint16(&p->priority_tier, "PriorityTier", dflt)) { p->priority_tier = 1; } if (s_p_get_uint16(&tmp_16, "Priority", tbl) || s_p_get_uint16(&tmp_16, "Priority", dflt)) { p->priority_job_factor = tmp_16; p->priority_tier = tmp_16; } ... (code snippet from from src/common/read_config.c)
Hi Alejandro, Yes, that was the culprit. Once I removed those Priority option on the partition lines and did “scontrol reconfig”, the PriorityTier values changed and preemptmode=CANCEL started working. Thanks! Rob On Nov 13, 2017, at 9:37 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=4358#c6> on bug 4358<https://bugs.schedmd.com/show_bug.cgi?id=4358> from Alejandro Sanchez<mailto:alex@schedmd.com> Hi Robert. Short answer: remove "Priority" option from your partition definition lines and only set PriorityJobFactor and/or PriorityTier, then 'scontrol reconfigure'. Long answer: Following commit (available since slurm-16-05-0-0pre2 onwards): https://github.com/SchedMD/slurm/commit/7844563c086919175def Made it so we split partition's "Priority" field into "PriorityTier" (used to order partitions for scheduling and preemption) plus "PriorityJobFactor" (used by priority/multifactor plugin in calculating job priority, which is used to order jobs within a partition for scheduling). What is happening is that since you have "Priority" set to 1, its value is overriding both PriorityJobFactor and PriorityTier: ... if (!s_p_get_uint16(&p->priority_tier, "PriorityTier", tbl) && !s_p_get_uint16(&p->priority_tier, "PriorityTier", dflt)) { p->priority_tier = 1; } if (s_p_get_uint16(&tmp_16, "Priority", tbl) || s_p_get_uint16(&tmp_16, "Priority", dflt)) { p->priority_job_factor = tmp_16; p->priority_tier = tmp_16; } ... (code snippet from from src/common/read_config.c) ________________________________ You are receiving this mail because: * You reported the bug.
I'm marking this as resolved. Please, reopen if you have further questions.