Ticket 7207

Summary: backfill for HPC cluster
Product: Slurm Reporter: Jenny Williams <jennyw>
Component: SchedulingAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
Site: University of North Carolina at Chapel Hill Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: scontrol show config for the Dogwood cluster
scontrol show job,sdiag and squeue --start and sinfo outputs

Description Jenny Williams 2019-06-07 11:59:33 MDT
Created attachment 10542 [details]
scontrol show config for the Dogwood cluster

We have a cluster set aside for MPI jobs where backfill of smaller jobs is overtaking the scheduling of the larger HPC jobs.

The config parameters Sched* are as follows:

# scontrol show config |egrep Sched
FastSchedule            = 0
SchedulerParameters     = (null)
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill

The 1 day backfill window is likely the issue - I would appreciate a recommendation of how to tune backfill so that the larger MPI jobs will still schedule.

The config file for this cluster ( dogwood ) is attached.

The two main partitions are here:

# scontrol show partitions 528_queue           
PartitionName=528_queue
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=528_qos
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=YES GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c-206-[1-24],c-207-[1-24],c-208-[1-24],c-209-[1-15],c-201-[20-21],c-204-[17-18,21-24]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=4180 TotalNodes=95 SelectTypeParameters=NONE
   DefMemPerCPU=11704 MaxMemPerNode=UNLIMITED

[root@dogwood-sched bin]# scontrol show partitions 2112_queue
PartitionName=2112_queue
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=2112_qos
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=YES GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c-201-[1-24],c-202-[1-24],c-203-[1-24],c-204-[1-24]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=4224 TotalNodes=96 SelectTypeParameters=NONE
   DefMemPerCPU=11704 MaxMemPerNode=UNLIMITED
Comment 1 Marcin Stolarek 2019-06-07 13:02:06 MDT
Jenny, 

Could you please attach scontrol show job,sdiag and squeue --start and sinfo outputs? 

cheers,
Marcin
Comment 2 Jenny Williams 2019-06-07 13:30:50 MDT
Created attachment 10543 [details]
scontrol show job,sdiag and squeue --start and sinfo outputs
Comment 3 Marcin Stolarek 2019-06-10 03:13:07 MDT
Jenny, 

I took a look at the configuration of your cluster and the situation in the queue. 

Yes, you should increase your bf_window parameter to reflect maximal time limit allowed on your cluster. I'd suggest setting it to 7 days. Based on the time limits of jobs you have in the queue I think that you can also increase bf_resolution to 10 minutes. Finally the SchedulerParameters line in your slurm.conf:

> SchedulerParameters=bf_window=10080,bf_resolution=600


Checking your scontrol show job output I've also noticed that you have a number of multinode jobs waiting in the queue because of their low priority. In your configuration priority comes mostly from fair-share factor, with quite long utilization history taken into consideration (PriorityDacayHalfLife = 8 days). If you'd like to prefer large jobs you should increase the value of PriorityWeightJobSize in your slurm.conf. 

If your concern comes mostly from jobs 1199751, 1199752 then you may consider tuning PriorityMaxAge and PriorityWeightAge values.[1]

If you require any further information, feel free to contact me.

cheers,
Marcin

[1]https://slurm.schedmd.com/priority_multifactor.html#age