Ticket 16789

Summary: Partition Node Preemption Tuning
Product: Slurm Reporter: Paul Edmon <pedmon>
Component: SchedulingAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: lyeager, tim
Version: 23.02.2   
Hardware: Linux   
OS: Linux   
Site: Harvard University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Paul Edmon 2023-05-22 09:45:11 MDT
We have a group here that has noticed that the jobs from lower priority partitions are causing fragmentation in the higher priority partition due to the fact that Slurm when it preempts tries to preserve the running lower priority jobs for as long as it can before preempting them. Instead what they would like is that the jobs from the higher priority partition would preempt jobs from the lower partitions in a way that would reduce fragmentation on the higher priority partition to the greatest extent.

I noticed that the latest version of slurm has some retooling of its preemption logic (we haven't upgraded to it yet but intend to in a few months). I was wondering if that upgrade would implement the above feature (I haven't looked in detail at what was changed regarding preemption only that the slurm.conf was changed). If it isn't do you know of any way I can implement this or will it need to be a feature request.
Comment 1 Ben Roberts 2023-05-22 13:05:33 MDT
Hi Paul,

We don't have a mechanism that would make preemption prioritize reducing fragmentation when it's looking for eligible jobs to preempt.  It is possible that there are some flags set that could be causing behavior you're not expecting.  You mention that Slurm tries to preserve lower priority jobs as long as possible before preempting them.  This could be due to a setting called PreemptExemptTime, which tells the scheduler not to consider jobs eligible for preemption until they have run for at least X amount of time.
https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptExemptTime

This could be causing the behavior you're describing.  If you unset that parameter then any job would be eligible for preemption at any time, rather than guaranteeing that jobs run for a certain amount of time first.

There are also a few preemption specific parameters that could affect the behavior.  Among these there is a 'youngest_first' option.  If there is an amount of time that jobs are exempt from preemption and it tries to preempt the youngest first it could prevent any preemption from happening for a while.  
https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptParameters

There is also an option called 'reorder_count' that controls how many times the scheduler looks at the preemptable jobs to try and minimize the number of jobs that are preempted.  This won't necessarily reduce fragmentation of the partition, but could help by reducing the number of times one or two nodes are left free when preemption happens, which would then be filled by queued jobs.  

Let me know if you have any of these parameters set already, or if it sounds like any of them would be beneficial for you.  I'm happy to answer any additional questions you might have about them.

Thanks,
Ben
Comment 2 Paul Edmon 2023-05-22 13:28:07 MDT
Just for my information if you don't have any preemption ordering flags 
set how does the scheduler decide to preempt? Does it by default try to 
preempt in a way that minimizes fragmentation?

-Paul Edmon-

On 5/22/23 3:05 PM, bugs@schedmd.com wrote:
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c1> on 
> bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben 
> Roberts <mailto:ben@schedmd.com> *
> Hi Paul,
>
> We don't have a mechanism that would make preemption prioritize reducing
> fragmentation when it's looking for eligible jobs to preempt.  It is possible
> that there are some flags set that could be causing behavior you're not
> expecting.  You mention that Slurm tries to preserve lower priority jobs as
> long as possible before preempting them.  This could be due to a setting called
> PreemptExemptTime, which tells the scheduler not to consider jobs eligible for
> preemption until they have run for at least X amount of time.
> https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptExemptTime
>
> This could be causing the behavior you're describing.  If you unset that
> parameter then any job would be eligible for preemption at any time, rather
> than guaranteeing that jobs run for a certain amount of time first.
>
> There are also a few preemption specific parameters that could affect the
> behavior.  Among these there is a 'youngest_first' option.  If there is an
> amount of time that jobs are exempt from preemption and it tries to preempt the
> youngest first it could prevent any preemption from happening for a while.
> https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptParameters
>
> There is also an option called 'reorder_count' that controls how many times the
> scheduler looks at the preemptable jobs to try and minimize the number of jobs
> that are preempted.  This won't necessarily reduce fragmentation of the
> partition, but could help by reducing the number of times one or two nodes are
> left free when preemption happens, which would then be filled by queued jobs.
>
> Let me know if you have any of these parameters set already, or if it sounds
> like any of them would be beneficial for you.  I'm happy to answer any
> additional questions you might have about them.
>
> Thanks,
> Ben
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 3 Ben Roberts 2023-05-22 14:09:23 MDT
It depends on whether you're using QOS or partition based preemption, but you can see a brief description of how it orders the jobs here:
https://slurm.schedmd.com/preempt.html#operation

Let me know if you have questions beyond what it covers in the documentation.

Thanks,
Ben
Comment 4 Paul Edmon 2023-05-22 14:15:11 MDT
Right now we have set (this is for 22.05.7):

pack_serial_at_end,\
preempt_strict_order,\
preempt_youngest_first,\

So I'm guessing that the strict_order and youngest_first would increase 
fragmentation caused by lower priority jobs as slurm would be looking to 
preempt those of lowest priority and those of youngest age first, 
instead of preempting based on what would fill the nodes best from the 
higher priority parition

I suppose I can try removing both of these and see if they make a 
positive difference. It's sort of a shame but since the jobs I'm 
preempting are preemptable anyways favoring strict_order and 
youngest_first over less fragmentation may be a favorable trade off.

-Paul Edmon-

On 5/22/23 3:05 PM, bugs@schedmd.com wrote:
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c1> on 
> bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben 
> Roberts <mailto:ben@schedmd.com> *
> Hi Paul,
>
> We don't have a mechanism that would make preemption prioritize reducing
> fragmentation when it's looking for eligible jobs to preempt.  It is possible
> that there are some flags set that could be causing behavior you're not
> expecting.  You mention that Slurm tries to preserve lower priority jobs as
> long as possible before preempting them.  This could be due to a setting called
> PreemptExemptTime, which tells the scheduler not to consider jobs eligible for
> preemption until they have run for at least X amount of time.
> https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptExemptTime
>
> This could be causing the behavior you're describing.  If you unset that
> parameter then any job would be eligible for preemption at any time, rather
> than guaranteeing that jobs run for a certain amount of time first.
>
> There are also a few preemption specific parameters that could affect the
> behavior.  Among these there is a 'youngest_first' option.  If there is an
> amount of time that jobs are exempt from preemption and it tries to preempt the
> youngest first it could prevent any preemption from happening for a while.
> https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptParameters
>
> There is also an option called 'reorder_count' that controls how many times the
> scheduler looks at the preemptable jobs to try and minimize the number of jobs
> that are preempted.  This won't necessarily reduce fragmentation of the
> partition, but could help by reducing the number of times one or two nodes are
> left free when preemption happens, which would then be filled by queued jobs.
>
> Let me know if you have any of these parameters set already, or if it sounds
> like any of them would be beneficial for you.  I'm happy to answer any
> additional questions you might have about them.
>
> Thanks,
> Ben
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 5 Paul Edmon 2023-05-23 08:09:00 MDT
Thanks. That's very informative. For the record we use partition based 
preemption and not QOS.

I think what we would like to request is a feature that would add the 
additional constraint, or consideration, of seeking to minimize the 
fragmentation. So order of consideration might go something like this:

Can you schedule with out preemption. If so schedule.

If not, minimize the following:

1. Fragmentation in the partition being scheduled for.

2. Number of jobs preempted

3. Size of job preempted

4. Priority of job preempted

5. Age of job preempted

So adding fragmentation as an additional factor for consideration. It 
would also be nice to be able to order which of these is most important 
when doing the optimization problem.

-Paul Edmon-

On 5/22/23 4:09 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c3> on 
> bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben 
> Roberts <mailto:ben@schedmd.com> *
> It depends on whether you're using QOS or partition based preemption, but you
> can see a brief description of how it orders the jobs here:
> https://slurm.schedmd.com/preempt.html#operation
>
> Let me know if you have questions beyond what it covers in the documentation.
>
> Thanks,
> Ben
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 6 Ben Roberts 2023-05-23 08:51:34 MDT
Hi Paul,

Let me clarify one thing before discussing internally whether this is something we are interested and able to add to Slurm.  When you say you want to minimize fragmentation, is that just based on the numerical order of the node names, or are you using a topology tree where you would want the placement of the nodes in switches to be the primary concern?

Thanks,
Ben
Comment 7 Paul Edmon 2023-05-23 08:54:43 MDT
For use it would be by node name, but I could see cases where some one 
would want to start caring about topology. In this instance though we 
are just looking for the simplest version of fragmentation minimization, 
namely packing all the jobs on the minimum number of nodes.

-Paul Edmon-

On 5/23/23 10:51 AM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c6> on 
> bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben 
> Roberts <mailto:ben@schedmd.com> *
> Hi Paul,
>
> Let me clarify one thing before discussing internally whether this is something
> we are interested and able to add to Slurm.  When you say you want to minimize
> fragmentation, is that just based on the numerical order of the node names, or
> are you using a topology tree where you would want the placement of the nodes
> in switches to be the primary concern?
>
> Thanks,
> Ben
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 11 Ben Roberts 2023-05-29 14:15:45 MDT
Hi Paul,

I'd like to clarify a little more.  When you're talking about fragmentation, you mean at the node level rather than at the CPU level.  Is that right?  As an example, I mean that you're worried about a 5 node job preempting other jobs so that you can get 5 contiguous nodes for this job to start on, as opposed to a case where you want preemption to give you 5 contiguous CPUs on the same node.

Assuming you are talking about preempting sets of contiguous nodes, is this something that your organization has enough interest in to sponsor the development of this functionality?

Thanks,
Ben
Comment 12 Paul Edmon 2023-05-30 07:49:08 MDT
Well both instances. In our case we are concerned about GPU's. For 
instance what we see happening is the following. We have 2 partitions 
the first is a gpu partition named gpu, and an underlying partition 
called requeue.  The gpu parititon is the higher priority of the two, 
and people submit jobs that are both single gpu, multigpu single node, 
and multigpu multinode.  The requeue partition on the other hand can 
submit gpu and cpu jobs but those jobs must be constrained to a single 
node (so you could have a multicore or multigpu job but then but they 
can't span multiple nodes).

In normal operating the scheduler will try to fill up the gpu partition 
and the requeue partition with both not requeuing each other.  However 
what can happen is that since the gpu partition can submit both single 
gpu jobs and multigpu jobs which don't fill up the node, you can end up 
in a situation where you have:

node 1:

gpu job: 1 gpu

gpu job: 1 gpu

requeue job: 1 gpu

requeue job: 1 gpu

node 2:

gpu job: 2 gpu, 1 node

requeue job: 1 gpu

requeue job: 1gpu

Pending job

gpu job: 4 gpu, 1 node

You can see in this scenario the pending job could have gone if all the 
requeue jobs ended up on one node and all the gpu jobs ended up on the 
other. But because of fragmentation its blocked. Now if the requeue 
logic instead prioritized defragmentation when it scheduled the 2 gpu 
job, it would have preempted the two requeue jobs on node 1 rather than 
just thrown it on the nearest open resource it could find.

Now in this scenario we are also doing something a little bit different 
than I described originally in that the preemption would happen even if 
there were open slots on say node 2 because it would value defragmenting 
over not preempting jobs.

So there are 2 styles of preemption ordering. The first is where its a 
consideration in the hierarchy of deciding which jobs to preempt when it 
has to preempted, i.e. there are no open slots that work with out 
requeuing something. The second is to have it preempt regardless of if 
there are open slots or not because it values defragmentation highly and 
will just preempt jobs to maintain that defragmented state (or at least 
to extent it can).

We are interested in both scenarios. The first is less impactful for 
jobs in the requeue partition as it would still try to schedule things 
until it was full and then preempt but defragmentation would be a factor 
for preemption. The second more impactful as it would always preempt 
things in order to reduce fragmentation on the higher priority partition.

As for funding. The group I'm working with may be amenable to funding 
this but I need an ballpark estimate (not an actual quote) to go back to 
them with so they can figure out if it will fit into their budget and if 
they want to pursue this more earnestly.

-Paul Edmon-

On 5/29/23 4:15 PM, bugs@schedmd.com wrote:
>
> *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c11> on 
> bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben 
> Roberts <mailto:ben@schedmd.com> *
> Hi Paul,
>
> I'd like to clarify a little more.  When you're talking about fragmentation,
> you mean at the node level rather than at the CPU level.  Is that right?  As an
> example, I mean that you're worried about a 5 node job preempting other jobs so
> that you can get 5 contiguous nodes for this job to start on, as opposed to a
> case where you want preemption to give you 5 contiguous CPUs on the same node.
>
> Assuming you are talking about preempting sets of contiguous nodes, is this
> something that your organization has enough interest in to sponsor the
> development of this functionality?
>
> Thanks,
> Ben
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 13 Ben Roberts 2023-05-31 11:16:41 MDT
Hi Paul,

I think it would be hard to get support for the behavior you're describing, where jobs on a shared node would be preempted before being placed on available resources on another node.  I do think there is something we could do that would give you this behavior without preemption.  We have the Multi-Category Security (MCS) Plugin that will allow you to have different types of jobs only run on nodes of a similar type.  The plugin was designed with security in mind, but the functionality will also work for the use case you're describing.

It would require that you have unique accounts created for each partition.  You could either require users to select the appropriate account based on the partition they're using, or you could create a submit filter that assigned the appropriate account based on the partition chosen.

With the MCSPlugin configured to look at the accounts, the scheduler would only allow jobs with the same account to run on the same node, which would effectively reduce fragmentation for the case you've described.  Here's an example of how it might look.

I configure Slurm to use the mcs/account plugin with the MCSParameters that enforce the behavior rather than relying on users to opt-in to the behavior.

$ scontrol show config | grep -i mcs
MCSPlugin               = mcs/account
MCSParameters           = enforced,select,privatedata




You can see that I have users in different accounts.  I'm just showing the 'sub1' and 'sub2' accounts for this example.

$ sacctmgr show assoc tree format=cluster,account,user
   Cluster Account                    User 
---------- -------------------- ---------- 
    knight root                            
    knight  root                      root 
    knight  a1                             
    knight   sub1                          
    knight    sub1                     ben 
    knight    sub1                   user1 
    knight    sub1                   user2 
    knight   sub2                          
    knight    sub2                     ben 
    knight    sub2                   user2 
[snip...]




I submit jobs to two different accounts with the same user (ben).  The jobs go to different nodes rather than being allocated to the same one as they normally would.

$ sbatch -n1 -pdebug -Asub1 --wrap='srun sleep 120'
Submitted batch job 11053

$ sbatch -n1 -pdebug -Asub2 --wrap='srun sleep 120'
Submitted batch job 11054

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             11054     debug     wrap      ben  R       0:01      1 node11
             11053     debug     wrap      ben  R       0:05      1 node10





I become user1 and user2 and submit to the sub1 and sub2 accounts respectively.  These jobs go to the nodes that already have a job from the same account running on them.

user1@kitt:~$ sbatch -n1 -pdebug -Asub1 --wrap='srun sleep 120'
Submitted batch job 11055

user2@kitt:~$ sbatch -n1 -pdebug -Asub2 --wrap='srun sleep 120'
Submitted batch job 11056

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             11056     debug     wrap    user2  R       0:20      1 node11
             11054     debug     wrap      ben  R       0:42      1 node11
             11055     debug     wrap    user1  R       0:29      1 node10
             11053     debug     wrap      ben  R       0:46      1 node10





You can find more information on the MCS Plugin here:
https://slurm.schedmd.com/mcs.html

Does this sound like it would work for you to address that aspect of the behavior you want to change?  Feel free to let me know if you have any questions about how implementation might look or scenarios you think it might not cover.

I'm not sure about cost at this point.  I can get in touch with our sales team for more information there, but they would want to know the scope of the work to be done, so I would like to work out whether this sounds like it would cover part of what you are trying to do.

Thanks,
Ben
Comment 14 Paul Edmon 2023-05-31 11:39:14 MDT
Interesting. That's a use of MCS I hadn't thought of.  We do have MCS 
turned on, but given the nature of our cluster we couldn't universally 
enforce MCS.  That said in this case the entire partition is used by a 
single lab so this might be workable by having their users use MCS for 
their jobs.

I will ask them and see what they think.

-Paul Edmon-

On 5/31/23 1:16 PM, bugs@schedmd.com wrote:
>
> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c13> on 
> bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben 
> Roberts <mailto:ben@schedmd.com> *
> Hi Paul,
>
> I think it would be hard to get support for the behavior you're describing,
> where jobs on a shared node would be preempted before being placed on available
> resources on another node.  I do think there is something we could do that
> would give you this behavior without preemption.  We have the Multi-Category
> Security (MCS) Plugin that will allow you to have different types of jobs only
> run on nodes of a similar type.  The plugin was designed with security in mind,
> but the functionality will also work for the use case you're describing.
>
> It would require that you have unique accounts created for each partition.  You
> could either require users to select the appropriate account based on the
> partition they're using, or you could create a submit filter that assigned the
> appropriate account based on the partition chosen.
>
> With the MCSPlugin configured to look at the accounts, the scheduler would only
> allow jobs with the same account to run on the same node, which would
> effectively reduce fragmentation for the case you've described.  Here's an
> example of how it might look.
>
> I configure Slurm to use the mcs/account plugin with the MCSParameters that
> enforce the behavior rather than relying on users to opt-in to the behavior.
>
> $ scontrol show config | grep -i mcs
> MCSPlugin               = mcs/account
> MCSParameters           = enforced,select,privatedata
>
>
>
>
> You can see that I have users in different accounts.  I'm just showing the
> 'sub1' and 'sub2' accounts for this example.
>
> $ sacctmgr show assoc tree format=cluster,account,user
>     Cluster Account                    User
> ---------- -------------------- ----------
>      knight root
>      knight  root                      root
>      knight  a1
>      knight   sub1
>      knight    sub1                     ben
>      knight    sub1                   user1
>      knight    sub1                   user2
>      knight   sub2
>      knight    sub2                     ben
>      knight    sub2                   user2
> [snip...]
>
>
>
>
> I submit jobs to two different accounts with the same user (ben).  The jobs go
> to different nodes rather than being allocated to the same one as they normally
> would.
>
> $ sbatch -n1 -pdebug -Asub1 --wrap='srun sleep 120'
> Submitted batch job 11053
>
> $ sbatch -n1 -pdebug -Asub2 --wrap='srun sleep 120'
> Submitted batch job 11054
>
> $ squeue
>               JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>               11054     debug     wrap      ben  R       0:01      1 node11
>               11053     debug     wrap      ben  R       0:05      1 node10
>
>
>
>
>
> I become user1 and user2 and submit to the sub1 and sub2 accounts respectively.
>   These jobs go to the nodes that already have a job from the same account
> running on them.
>
> user1@kitt:~$ sbatch -n1 -pdebug -Asub1 --wrap='srun sleep 120'
> Submitted batch job 11055
>
> user2@kitt:~$ sbatch -n1 -pdebug -Asub2 --wrap='srun sleep 120'
> Submitted batch job 11056
>
> $ squeue
>               JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>               11056     debug     wrap    user2  R       0:20      1 node11
>               11054     debug     wrap      ben  R       0:42      1 node11
>               11055     debug     wrap    user1  R       0:29      1 node10
>               11053     debug     wrap      ben  R       0:46      1 node10
>
>
>
>
>
> You can find more information on the MCS Plugin here:
> https://slurm.schedmd.com/mcs.html
>
> Does this sound like it would work for you to address that aspect of the
> behavior you want to change?  Feel free to let me know if you have any
> questions about how implementation might look or scenarios you think it might
> not cover.
>
> I'm not sure about cost at this point.  I can get in touch with our sales team
> for more information there, but they would want to know the scope of the work
> to be done, so I would like to work out whether this sounds like it would cover
> part of what you are trying to do.
>
> Thanks,
> Ben
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 15 Ben Roberts 2023-06-21 14:13:08 MDT
Hi Paul,

I wanted to follow up and see if you've had a chance to discuss the possibility of using MCS to consolidate different types of jobs.  Let me know if there's anything else I can do to help in this ticket.

Thanks,
Ben
Comment 16 Paul Edmon 2023-06-22 07:25:09 MDT
Yes, the group that I was working with decided to go with the MCS 
solution. Thanks for all the help!

-Paul Edmon-

On 6/21/2023 4:13 PM, bugs@schedmd.com wrote:
>
> *Comment # 15 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c15> on 
> bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben 
> Roberts <mailto:ben@schedmd.com> *
> Hi Paul,
>
> I wanted to follow up and see if you've had a chance to discuss the possibility
> of using MCS to consolidate different types of jobs.  Let me know if there's
> anything else I can do to help in this ticket.
>
> Thanks,
> Ben
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 17 Ben Roberts 2023-06-22 08:05:54 MDT
Great, I'm glad that worked for you.  Let us know if there's anything else we can do to help.

Thanks,
Ben
Comment 18 Ben Roberts 2023-06-22 08:06:14 MDT
Closing