Created attachment 10542 [details] scontrol show config for the Dogwood cluster We have a cluster set aside for MPI jobs where backfill of smaller jobs is overtaking the scheduling of the larger HPC jobs. The config parameters Sched* are as follows: # scontrol show config |egrep Sched FastSchedule = 0 SchedulerParameters = (null) SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill The 1 day backfill window is likely the issue - I would appreciate a recommendation of how to tune backfill so that the larger MPI jobs will still schedule. The config file for this cluster ( dogwood ) is attached. The two main partitions are here: # scontrol show partitions 528_queue PartitionName=528_queue AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=528_qos DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=YES GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=c-206-[1-24],c-207-[1-24],c-208-[1-24],c-209-[1-15],c-201-[20-21],c-204-[17-18,21-24] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=4180 TotalNodes=95 SelectTypeParameters=NONE DefMemPerCPU=11704 MaxMemPerNode=UNLIMITED [root@dogwood-sched bin]# scontrol show partitions 2112_queue PartitionName=2112_queue AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=2112_qos DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=YES GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=c-201-[1-24],c-202-[1-24],c-203-[1-24],c-204-[1-24] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=4224 TotalNodes=96 SelectTypeParameters=NONE DefMemPerCPU=11704 MaxMemPerNode=UNLIMITED
Jenny, Could you please attach scontrol show job,sdiag and squeue --start and sinfo outputs? cheers, Marcin
Created attachment 10543 [details] scontrol show job,sdiag and squeue --start and sinfo outputs
Jenny, I took a look at the configuration of your cluster and the situation in the queue. Yes, you should increase your bf_window parameter to reflect maximal time limit allowed on your cluster. I'd suggest setting it to 7 days. Based on the time limits of jobs you have in the queue I think that you can also increase bf_resolution to 10 minutes. Finally the SchedulerParameters line in your slurm.conf: > SchedulerParameters=bf_window=10080,bf_resolution=600 Checking your scontrol show job output I've also noticed that you have a number of multinode jobs waiting in the queue because of their low priority. In your configuration priority comes mostly from fair-share factor, with quite long utilization history taken into consideration (PriorityDacayHalfLife = 8 days). If you'd like to prefer large jobs you should increase the value of PriorityWeightJobSize in your slurm.conf. If your concern comes mostly from jobs 1199751, 1199752 then you may consider tuning PriorityMaxAge and PriorityWeightAge values.[1] If you require any further information, feel free to contact me. cheers, Marcin [1]https://slurm.schedmd.com/priority_multifactor.html#age