Comment # 3
on bug 4358
from Alejandro Sanchez
Hi Robert. If you don't care about SUSPEND,GANG and you only want to make use
of CANCEL and/or REQUEUE, a configuration like this is working for me:
# Cluster-wide config
PreemptType=preempt/partition_prio
PreemptMode=CANCEL
# Compute nodes
NodeName=compute[1-2] SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1
RealMemory=7837 NodeHostname=ibiza MemSpecLimit=837 State=UNKNOWN
Port=61711-61712
# Partitions
PartitionName=lowprio Nodes=ALL Default=YES State=UP PreemptMode=CANCEL
PriorityTier=1
PartitionName=medprio Nodes=ALL Default=NO State=UP PreemptMode=REQUEUE
PriorityTier=10
PartitionName=hiprio Nodes=ALL Default=NO State=UP PreemptMode=OFF
PriorityTier=100
Note that since you don't care about SUSPEND,GANG you don't need
OverSubscribe=FORCE:1 anymore.
$ sbatch --exclusive -N2 --wrap "sleep 99999"
Submitted batch job 20017
$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
20017 lowprio wrap alex R 0:05 2
compute[1-2]
$ sbatch -p medprio --exclusive -N2 --wrap "sleep 99999"
Submitted batch job 20018
$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
20018 medprio wrap alex R 0:02 2
compute[1-2]
slurmctld: preempted job 20017 has been killed to reclaim resources for job
20018
$ sbatch -p hiprio --exclusive -N2 --wrap "sleep 99999"
Submitted batch job 20019
$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
20018 medprio wrap alex CG 0:00 1 compute1
20019 hiprio wrap alex PD 0:00 2 (Resources)
$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
20018 medprio wrap alex PD 0:00 2 (BeginTime)
20019 hiprio wrap alex R 0:01 2
compute[1-2]
slurmctld: preempted job 20018 has been requeued to reclaim resources for job
20019
Could you please configure your slurm.conf to something similar to what I have
but with the 'test' and 'hiprio' partitions, and then restart slurm?
If it's still not working, could you attach the newly changed slurm.conf and
slurmctld.log files?
Thanks.