On Nov 11, 2017, at 5:36 AM, bugs@schedmd.com wrote:


Comment # 3 on bug 4358 from
Hi Robert. If you don't care about SUSPEND,GANG and you only want to make use
of CANCEL and/or REQUEUE, a configuration like this is working for me:

# Cluster-wide config
PreemptType=preempt/partition_prio
PreemptMode=CANCEL

# Compute nodes
NodeName=compute[1-2] SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1
RealMemory=7837 NodeHostname=ibiza MemSpecLimit=837 State=UNKNOWN
Port=61711-61712
# Partitions
PartitionName=lowprio Nodes=ALL Default=YES State=UP PreemptMode=CANCEL
PriorityTier=1
PartitionName=medprio Nodes=ALL Default=NO State=UP PreemptMode=REQUEUE
PriorityTier=10
PartitionName=hiprio Nodes=ALL Default=NO State=UP PreemptMode=OFF
PriorityTier=100

Note that since you don't care about SUSPEND,GANG you don't need
OverSubscribe=FORCE:1 anymore.

$ sbatch --exclusive -N2 --wrap "sleep 99999"
Submitted batch job 20017
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
             20017   lowprio     wrap     alex  R       0:05      2
compute[1-2]
$ sbatch -p medprio --exclusive -N2 --wrap "sleep 99999"
Submitted batch job 20018
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
             20018   medprio     wrap     alex  R       0:02      2
compute[1-2]

slurmctld: preempted job 20017 has been killed to reclaim resources for job
20018

$ sbatch -p hiprio --exclusive -N2 --wrap "sleep 99999"
Submitted batch job 20019
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
             20018   medprio     wrap     alex CG       0:00      1 compute1
             20019    hiprio     wrap     alex PD       0:00      2 (Resources)
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
             20018   medprio     wrap     alex PD       0:00      2 (BeginTime)
             20019    hiprio     wrap     alex  R       0:01      2
compute[1-2]

slurmctld: preempted job 20018 has been requeued to reclaim resources for job
20019

Could you please configure your slurm.conf to something similar to what I have
but with the 'test' and 'hiprio' partitions, and then restart slurm?

If it's still not working, could you attach the newly changed slurm.conf and
slurmctld.log files?

Thanks.


You are receiving this mail because:
  • You reported the bug.