7207 – backfill for HPC cluster

Ticket 7207 - backfill for HPC cluster

Summary: backfill for HPC cluster

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	17.11.5
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-06-07 11:59 MDT by Jenny Williams
Modified:	2019-06-10 03:13 MDT (History)
CC List:	0 users

See Also:
Site:	University of North Carolina at Chapel Hill
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
scontrol show config for the Dogwood cluster (6.82 KB, text/plain) 2019-06-07 11:59 MDT, Jenny Williams	Details
scontrol show job,sdiag and squeue --start and sinfo outputs (33.06 KB, application/x-compressed) 2019-06-07 13:30 MDT, Jenny Williams	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Jenny Williams 2019-06-07 11:59:33 MDT

Created attachment 10542 [details]
scontrol show config for the Dogwood cluster

We have a cluster set aside for MPI jobs where backfill of smaller jobs is overtaking the scheduling of the larger HPC jobs.

The config parameters Sched* are as follows:

# scontrol show config |egrep Sched
FastSchedule            = 0
SchedulerParameters     = (null)
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill

The 1 day backfill window is likely the issue - I would appreciate a recommendation of how to tune backfill so that the larger MPI jobs will still schedule.

The config file for this cluster ( dogwood ) is attached.

The two main partitions are here:

# scontrol show partitions 528_queue           
PartitionName=528_queue
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=528_qos
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=YES GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c-206-[1-24],c-207-[1-24],c-208-[1-24],c-209-[1-15],c-201-[20-21],c-204-[17-18,21-24]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=4180 TotalNodes=95 SelectTypeParameters=NONE
   DefMemPerCPU=11704 MaxMemPerNode=UNLIMITED

[root@dogwood-sched bin]# scontrol show partitions 2112_queue
PartitionName=2112_queue
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=2112_qos
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=YES GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c-201-[1-24],c-202-[1-24],c-203-[1-24],c-204-[1-24]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=4224 TotalNodes=96 SelectTypeParameters=NONE
   DefMemPerCPU=11704 MaxMemPerNode=UNLIMITED

Comment 1 Marcin Stolarek 2019-06-07 13:02:06 MDT

Jenny, 

Could you please attach scontrol show job,sdiag and squeue --start and sinfo outputs? 

cheers,
Marcin

Comment 2 Jenny Williams 2019-06-07 13:30:50 MDT

Created attachment 10543 [details]
scontrol show job,sdiag and squeue --start and sinfo outputs

Comment 3 Marcin Stolarek 2019-06-10 03:13:07 MDT

Jenny, 

I took a look at the configuration of your cluster and the situation in the queue. 

Yes, you should increase your bf_window parameter to reflect maximal time limit allowed on your cluster. I'd suggest setting it to 7 days. Based on the time limits of jobs you have in the queue I think that you can also increase bf_resolution to 10 minutes. Finally the SchedulerParameters line in your slurm.conf:

> SchedulerParameters=bf_window=10080,bf_resolution=600


Checking your scontrol show job output I've also noticed that you have a number of multinode jobs waiting in the queue because of their low priority. In your configuration priority comes mostly from fair-share factor, with quite long utilization history taken into consideration (PriorityDacayHalfLife = 8 days). If you'd like to prefer large jobs you should increase the value of PriorityWeightJobSize in your slurm.conf. 

If your concern comes mostly from jobs 1199751, 1199752 then you may consider tuning PriorityMaxAge and PriorityWeightAge values.[1]

If you require any further information, feel free to contact me.

cheers,
Marcin

[1]https://slurm.schedmd.com/priority_multifactor.html#age