Ticket 5194

Summary: Advice on the management of short jobs in SLURM
Product: Slurm Reporter: David Baker <d.j.baker>
Component: ConfigurationAssignee: Alejandro Sanchez <alex>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex
Version: 17.02.8   
Hardware: Linux   
OS: Linux   
Site: OCF Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: Southampton University
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description David Baker 2018-05-21 08:47:09 MDT
Hello,

I wondered if you could please advise me on the management of short jobs in SLURM.

We have recently migrated from a TORQUE/Moab cluster. In that cluster we make use of the XFACTOR. This ensures that shorter jobs don't get starved of resources by longer jobs, and it works very well.

As far as I know, there is no direct equivalent in SLURM, but we'll need to be able to manage our work load so that shorter jobs don't lose out. The obvious thing that comes to mind is to have backfill properly configured and that may be sufficient. 

Some time ago when I mailed to the SLURM forum for advice I was given the following advice...

In this respect, a user replied to me indicating that the Slurm priority settings may be significant:

"PriorityFavorSmall=NO
PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME

PriorityFavorSmall and SMALL_RELATIVE_TO_TIME are used by us to favour both short and large jobs.  So if two jobs are equal in size, the shorter of the two is favoured.  Also if two jobs are equal in time, the larger is favoured. We use this as a way to get short jobs in and out of the queues quickly as well as help large jobs (typically MPI) have priority over small serial jobs."

Does the above make sense? More generally, what is your advice on managing a large cluster with a very diverse workload, please?

Best regards,
David
Comment 1 Alejandro Sanchez 2018-05-22 05:15:01 MDT
Hi David.

As you mentioned, having sched/backfill properly configured is key. We usually recommend forcing users to set a --time through a Job Submit plugin. This is a C example but you could use a job_submit.lua equivalent as well:

https://github.com/SchedMD/slurm/blob/slurm-17.11/src/plugins/job_submit/require_timelimit/job_submit_require_timelimit.c

We prefer each user to set up their own and different estimated --time over having a DefaultTime which tends to end up in a bad situation where all users have the same TimeLimit and then backfill doesn't work efficiently.

Our usual starting points for tuning SchedulerParameters are:

bf_continue
bf_window=(enough minutes to cover the highest MaxTime on the cluster)
bf_resolution=(usually at least 600), and if you increase bf_window, make sure to also increase bf_resolution, otherwise the overhead will increase.
bf_min_[age|prio]_reserve could be considered as well.

In Slurm, all jobs are placed on a single queue, ordered by:

1. Preemption order (preemptor higher priority than preemptee)
2. Advanced reservation (jobs with an advanced reservation are higher priority than other jobs)
3. Partition PriorityTier
4. Job Priority (result of priority/multifactor sum of factors)
5. Job ID

Point 4 (Job Priority) can be disaggregated as documented here:

https://slurm.schedmd.com/priority_multifactor.html

Slurm has no exact equivalent to Moab's XFACTOR - Expansion Factor, which looking at their documentation follows this formula:

XFACTOR = 1 + (EffQueueTime / WallClockLimit)

Perhaps the closer option to the XFACTOR is the Age Factor:

https://slurm.schedmd.com/priority_multifactor.html#age

In general, the longer a job waits in the queue, the larger its age factor grows. There are also two flags affecting this:

ACCRUE_ALWAYS If set, priority age factor will be increased despite job dependencies or holds. If set, it also starts computing the age since the submit time, instead of since the time the job was eligible to run (begin_time).

and

PriorityMaxAge Specifies the job age which will be given the maximum age factor in computing priority.

But currently, the Age Factor in Slurm isn't proportional to the job's TimeLimit as the Moab's XFACTOR. I've opened a separate sev-5 bug 5202 to consider the addition of this flag for a future release, but lacking any sponsor we can't estimate when and/or if it will ever be addressed. If you are interested in pursuing that path we could talk about it further outside the bug.

Continuing with the advice for the priority/multifactor plugin, we generally recommend ordering each of the PriorityWeight<something> factors from most to least important, then setting them each an order of magnitude apart. This should help some more jobs get scheduled. The weight values should be high enough to get a good set of significant digits since all the factors are floating point numbers from 0.0 to 1.0. Starting around 1000 or so for those factors you want to make predominant, as stated in the web documentation. 

Without any specific site requirements, perhaps what makes more sense is to set the highest weight to the QOS factor and the next one to the FairShare factor. We also usually recommend to set the PriorityFlags=FAIR_TREE.

With regards to the PriorityFavorSmall option and the PriorityFlags SMALL_RELATIVE_TO_TIME:

1. Note that they only take effect if the Job Size factor is set.
2. Here's the documentation related to these options and flags, which I think is pretty well explained:

https://slurm.schedmd.com/priority_multifactor.html#jobsize

Please, let me know if you have further questions and/or if you are interested in sponsoring that flag addition. Thanks!
Comment 2 David Baker 2018-05-25 07:48:55 MDT
Hello,

Thank you for this detailed reply. I’ve taken an additional look through, but I will not be able to get my teeth in to this issue until I get back from leave in a week’s time. I will continue the investigation/discussion then.

Thank you for your interest and advice re an equivalent to XFACTOR in  SLURM.

Best regards,
David

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: 22 May 2018 12:15
To: Baker D.J. <D.J.Baker@soton.ac.uk>
Subject: [Bug 5194] Advice on the management of short jobs in SLURM

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5194#c1> on bug 5194<https://bugs.schedmd.com/show_bug.cgi?id=5194> from Alejandro Sanchez<mailto:alex@schedmd.com>

Hi David.



As you mentioned, having sched/backfill properly configured is key. We usually

recommend forcing users to set a --time through a Job Submit plugin. This is a

C example but you could use a job_submit.lua equivalent as well:



https://github.com/SchedMD/slurm/blob/slurm-17.11/src/plugins/job_submit/require_timelimit/job_submit_require_timelimit.c



We prefer each user to set up their own and different estimated --time over

having a DefaultTime which tends to end up in a bad situation where all users

have the same TimeLimit and then backfill doesn't work efficiently.



Our usual starting points for tuning SchedulerParameters are:



bf_continue

bf_window=(enough minutes to cover the highest MaxTime on the cluster)

bf_resolution=(usually at least 600), and if you increase bf_window, make sure

to also increase bf_resolution, otherwise the overhead will increase.

bf_min_[age|prio]_reserve could be considered as well.



In Slurm, all jobs are placed on a single queue, ordered by:



1. Preemption order (preemptor higher priority than preemptee)

2. Advanced reservation (jobs with an advanced reservation are higher priority

than other jobs)

3. Partition PriorityTier

4. Job Priority (result of priority/multifactor sum of factors)

5. Job ID



Point 4 (Job Priority) can be disaggregated as documented here:



https://slurm.schedmd.com/priority_multifactor.html



Slurm has no exact equivalent to Moab's XFACTOR - Expansion Factor, which

looking at their documentation follows this formula:



XFACTOR = 1 + (EffQueueTime / WallClockLimit)



Perhaps the closer option to the XFACTOR is the Age Factor:



https://slurm.schedmd.com/priority_multifactor.html#age



In general, the longer a job waits in the queue, the larger its age factor

grows. There are also two flags affecting this:



ACCRUE_ALWAYS If set, priority age factor will be increased despite job

dependencies or holds. If set, it also starts computing the age since the

submit time, instead of since the time the job was eligible to run

(begin_time).



and



PriorityMaxAge Specifies the job age which will be given the maximum age factor

in computing priority.



But currently, the Age Factor in Slurm isn't proportional to the job's

TimeLimit as the Moab's XFACTOR. I've opened a separate sev-5 bug 5202<show_bug.cgi?id=5202> to

consider the addition of this flag for a future release, but lacking any

sponsor we can't estimate when and/or if it will ever be addressed. If you are

interested in pursuing that path we could talk about it further outside the

bug.



Continuing with the advice for the priority/multifactor plugin, we generally

recommend ordering each of the PriorityWeight<something> factors from most to

least important, then setting them each an order of magnitude apart. This

should help some more jobs get scheduled. The weight values should be high

enough to get a good set of significant digits since all the factors are

floating point numbers from 0.0 to 1.0. Starting around 1000 or so for those

factors you want to make predominant, as stated in the web documentation.



Without any specific site requirements, perhaps what makes more sense is to set

the highest weight to the QOS factor and the next one to the FairShare factor.

We also usually recommend to set the PriorityFlags=FAIR_TREE.



With regards to the PriorityFavorSmall option and the PriorityFlags

SMALL_RELATIVE_TO_TIME:



1. Note that they only take effect if the Job Size factor is set.

2. Here's the documentation related to these options and flags, which I think

is pretty well explained:



https://slurm.schedmd.com/priority_multifactor.html#jobsize



Please, let me know if you have further questions and/or if you are interested

in sponsoring that flag addition. Thanks!

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 3 Alejandro Sanchez 2018-06-06 02:18:01 MDT
Hi David. Is there anything else you need from here? Thanks.
Comment 4 David Baker 2018-06-06 02:23:08 MDT
Hello,

Apologies, for the late response. I’ve just got back from leave and so I’ll need to catch up with this ticket. I’ll take a look today and so how I get on.

Best regards,
David

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Wednesday, June 06, 2018 9:18 AM
To: Baker D.J. <D.J.Baker@soton.ac.uk>
Subject: [Bug 5194] Advice on the management of short jobs in SLURM

Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=5194#c3> on bug 5194<https://bugs.schedmd.com/show_bug.cgi?id=5194> from Alejandro Sanchez<mailto:alex@schedmd.com>

Hi David. Is there anything else you need from here? Thanks.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 5 Alejandro Sanchez 2018-06-20 04:03:29 MDT
Hi David. Is there anything you need from this bug? thanks.
Comment 6 David Baker 2018-06-20 08:13:06 MDT
Hi,

My apologies not to have got back to you earlier. I’m afraid I’ve not had much time to look at this area properly, and I would like to revisit this matter once I’m less busy. Could you please put the ticket on hold or close it – depending upon your policy? At this rate I’ll probably have time to look at this next week at the earliest.

Best regards,
David

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Wednesday, June 20, 2018 11:03 AM
To: Baker D.J. <D.J.Baker@soton.ac.uk>
Subject: [Bug 5194] Advice on the management of short jobs in SLURM

Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=5194#c5> on bug 5194<https://bugs.schedmd.com/show_bug.cgi?id=5194> from Alejandro Sanchez<mailto:alex@schedmd.com>

Hi David. Is there anything you need from this bug? thanks.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 7 Alejandro Sanchez 2018-07-17 00:56:28 MDT
David, I'm gonna close this for now. Please, reopen if you have any further questions. Thanks.