| Summary: | Partition Node Preemption Tuning | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
| Component: | Scheduling | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | lyeager, tim |
| Version: | 23.02.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Harvard University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Paul Edmon
2023-05-22 09:45:11 MDT
Hi Paul, We don't have a mechanism that would make preemption prioritize reducing fragmentation when it's looking for eligible jobs to preempt. It is possible that there are some flags set that could be causing behavior you're not expecting. You mention that Slurm tries to preserve lower priority jobs as long as possible before preempting them. This could be due to a setting called PreemptExemptTime, which tells the scheduler not to consider jobs eligible for preemption until they have run for at least X amount of time. https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptExemptTime This could be causing the behavior you're describing. If you unset that parameter then any job would be eligible for preemption at any time, rather than guaranteeing that jobs run for a certain amount of time first. There are also a few preemption specific parameters that could affect the behavior. Among these there is a 'youngest_first' option. If there is an amount of time that jobs are exempt from preemption and it tries to preempt the youngest first it could prevent any preemption from happening for a while. https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptParameters There is also an option called 'reorder_count' that controls how many times the scheduler looks at the preemptable jobs to try and minimize the number of jobs that are preempted. This won't necessarily reduce fragmentation of the partition, but could help by reducing the number of times one or two nodes are left free when preemption happens, which would then be filled by queued jobs. Let me know if you have any of these parameters set already, or if it sounds like any of them would be beneficial for you. I'm happy to answer any additional questions you might have about them. Thanks, Ben Just for my information if you don't have any preemption ordering flags set how does the scheduler decide to preempt? Does it by default try to preempt in a way that minimizes fragmentation? -Paul Edmon- On 5/22/23 3:05 PM, bugs@schedmd.com wrote: > > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c1> on > bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben > Roberts <mailto:ben@schedmd.com> * > Hi Paul, > > We don't have a mechanism that would make preemption prioritize reducing > fragmentation when it's looking for eligible jobs to preempt. It is possible > that there are some flags set that could be causing behavior you're not > expecting. You mention that Slurm tries to preserve lower priority jobs as > long as possible before preempting them. This could be due to a setting called > PreemptExemptTime, which tells the scheduler not to consider jobs eligible for > preemption until they have run for at least X amount of time. > https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptExemptTime > > This could be causing the behavior you're describing. If you unset that > parameter then any job would be eligible for preemption at any time, rather > than guaranteeing that jobs run for a certain amount of time first. > > There are also a few preemption specific parameters that could affect the > behavior. Among these there is a 'youngest_first' option. If there is an > amount of time that jobs are exempt from preemption and it tries to preempt the > youngest first it could prevent any preemption from happening for a while. > https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptParameters > > There is also an option called 'reorder_count' that controls how many times the > scheduler looks at the preemptable jobs to try and minimize the number of jobs > that are preempted. This won't necessarily reduce fragmentation of the > partition, but could help by reducing the number of times one or two nodes are > left free when preemption happens, which would then be filled by queued jobs. > > Let me know if you have any of these parameters set already, or if it sounds > like any of them would be beneficial for you. I'm happy to answer any > additional questions you might have about them. > > Thanks, > Ben > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > It depends on whether you're using QOS or partition based preemption, but you can see a brief description of how it orders the jobs here: https://slurm.schedmd.com/preempt.html#operation Let me know if you have questions beyond what it covers in the documentation. Thanks, Ben Right now we have set (this is for 22.05.7): pack_serial_at_end,\ preempt_strict_order,\ preempt_youngest_first,\ So I'm guessing that the strict_order and youngest_first would increase fragmentation caused by lower priority jobs as slurm would be looking to preempt those of lowest priority and those of youngest age first, instead of preempting based on what would fill the nodes best from the higher priority parition I suppose I can try removing both of these and see if they make a positive difference. It's sort of a shame but since the jobs I'm preempting are preemptable anyways favoring strict_order and youngest_first over less fragmentation may be a favorable trade off. -Paul Edmon- On 5/22/23 3:05 PM, bugs@schedmd.com wrote: > > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c1> on > bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben > Roberts <mailto:ben@schedmd.com> * > Hi Paul, > > We don't have a mechanism that would make preemption prioritize reducing > fragmentation when it's looking for eligible jobs to preempt. It is possible > that there are some flags set that could be causing behavior you're not > expecting. You mention that Slurm tries to preserve lower priority jobs as > long as possible before preempting them. This could be due to a setting called > PreemptExemptTime, which tells the scheduler not to consider jobs eligible for > preemption until they have run for at least X amount of time. > https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptExemptTime > > This could be causing the behavior you're describing. If you unset that > parameter then any job would be eligible for preemption at any time, rather > than guaranteeing that jobs run for a certain amount of time first. > > There are also a few preemption specific parameters that could affect the > behavior. Among these there is a 'youngest_first' option. If there is an > amount of time that jobs are exempt from preemption and it tries to preempt the > youngest first it could prevent any preemption from happening for a while. > https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptParameters > > There is also an option called 'reorder_count' that controls how many times the > scheduler looks at the preemptable jobs to try and minimize the number of jobs > that are preempted. This won't necessarily reduce fragmentation of the > partition, but could help by reducing the number of times one or two nodes are > left free when preemption happens, which would then be filled by queued jobs. > > Let me know if you have any of these parameters set already, or if it sounds > like any of them would be beneficial for you. I'm happy to answer any > additional questions you might have about them. > > Thanks, > Ben > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Thanks. That's very informative. For the record we use partition based preemption and not QOS. I think what we would like to request is a feature that would add the additional constraint, or consideration, of seeking to minimize the fragmentation. So order of consideration might go something like this: Can you schedule with out preemption. If so schedule. If not, minimize the following: 1. Fragmentation in the partition being scheduled for. 2. Number of jobs preempted 3. Size of job preempted 4. Priority of job preempted 5. Age of job preempted So adding fragmentation as an additional factor for consideration. It would also be nice to be able to order which of these is most important when doing the optimization problem. -Paul Edmon- On 5/22/23 4:09 PM, bugs@schedmd.com wrote: > > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c3> on > bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben > Roberts <mailto:ben@schedmd.com> * > It depends on whether you're using QOS or partition based preemption, but you > can see a brief description of how it orders the jobs here: > https://slurm.schedmd.com/preempt.html#operation > > Let me know if you have questions beyond what it covers in the documentation. > > Thanks, > Ben > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Hi Paul, Let me clarify one thing before discussing internally whether this is something we are interested and able to add to Slurm. When you say you want to minimize fragmentation, is that just based on the numerical order of the node names, or are you using a topology tree where you would want the placement of the nodes in switches to be the primary concern? Thanks, Ben For use it would be by node name, but I could see cases where some one would want to start caring about topology. In this instance though we are just looking for the simplest version of fragmentation minimization, namely packing all the jobs on the minimum number of nodes. -Paul Edmon- On 5/23/23 10:51 AM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c6> on > bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben > Roberts <mailto:ben@schedmd.com> * > Hi Paul, > > Let me clarify one thing before discussing internally whether this is something > we are interested and able to add to Slurm. When you say you want to minimize > fragmentation, is that just based on the numerical order of the node names, or > are you using a topology tree where you would want the placement of the nodes > in switches to be the primary concern? > > Thanks, > Ben > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Hi Paul, I'd like to clarify a little more. When you're talking about fragmentation, you mean at the node level rather than at the CPU level. Is that right? As an example, I mean that you're worried about a 5 node job preempting other jobs so that you can get 5 contiguous nodes for this job to start on, as opposed to a case where you want preemption to give you 5 contiguous CPUs on the same node. Assuming you are talking about preempting sets of contiguous nodes, is this something that your organization has enough interest in to sponsor the development of this functionality? Thanks, Ben Well both instances. In our case we are concerned about GPU's. For instance what we see happening is the following. We have 2 partitions the first is a gpu partition named gpu, and an underlying partition called requeue. The gpu parititon is the higher priority of the two, and people submit jobs that are both single gpu, multigpu single node, and multigpu multinode. The requeue partition on the other hand can submit gpu and cpu jobs but those jobs must be constrained to a single node (so you could have a multicore or multigpu job but then but they can't span multiple nodes). In normal operating the scheduler will try to fill up the gpu partition and the requeue partition with both not requeuing each other. However what can happen is that since the gpu partition can submit both single gpu jobs and multigpu jobs which don't fill up the node, you can end up in a situation where you have: node 1: gpu job: 1 gpu gpu job: 1 gpu requeue job: 1 gpu requeue job: 1 gpu node 2: gpu job: 2 gpu, 1 node requeue job: 1 gpu requeue job: 1gpu Pending job gpu job: 4 gpu, 1 node You can see in this scenario the pending job could have gone if all the requeue jobs ended up on one node and all the gpu jobs ended up on the other. But because of fragmentation its blocked. Now if the requeue logic instead prioritized defragmentation when it scheduled the 2 gpu job, it would have preempted the two requeue jobs on node 1 rather than just thrown it on the nearest open resource it could find. Now in this scenario we are also doing something a little bit different than I described originally in that the preemption would happen even if there were open slots on say node 2 because it would value defragmenting over not preempting jobs. So there are 2 styles of preemption ordering. The first is where its a consideration in the hierarchy of deciding which jobs to preempt when it has to preempted, i.e. there are no open slots that work with out requeuing something. The second is to have it preempt regardless of if there are open slots or not because it values defragmentation highly and will just preempt jobs to maintain that defragmented state (or at least to extent it can). We are interested in both scenarios. The first is less impactful for jobs in the requeue partition as it would still try to schedule things until it was full and then preempt but defragmentation would be a factor for preemption. The second more impactful as it would always preempt things in order to reduce fragmentation on the higher priority partition. As for funding. The group I'm working with may be amenable to funding this but I need an ballpark estimate (not an actual quote) to go back to them with so they can figure out if it will fit into their budget and if they want to pursue this more earnestly. -Paul Edmon- On 5/29/23 4:15 PM, bugs@schedmd.com wrote: > > *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c11> on > bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben > Roberts <mailto:ben@schedmd.com> * > Hi Paul, > > I'd like to clarify a little more. When you're talking about fragmentation, > you mean at the node level rather than at the CPU level. Is that right? As an > example, I mean that you're worried about a 5 node job preempting other jobs so > that you can get 5 contiguous nodes for this job to start on, as opposed to a > case where you want preemption to give you 5 contiguous CPUs on the same node. > > Assuming you are talking about preempting sets of contiguous nodes, is this > something that your organization has enough interest in to sponsor the > development of this functionality? > > Thanks, > Ben > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Hi Paul,
I think it would be hard to get support for the behavior you're describing, where jobs on a shared node would be preempted before being placed on available resources on another node. I do think there is something we could do that would give you this behavior without preemption. We have the Multi-Category Security (MCS) Plugin that will allow you to have different types of jobs only run on nodes of a similar type. The plugin was designed with security in mind, but the functionality will also work for the use case you're describing.
It would require that you have unique accounts created for each partition. You could either require users to select the appropriate account based on the partition they're using, or you could create a submit filter that assigned the appropriate account based on the partition chosen.
With the MCSPlugin configured to look at the accounts, the scheduler would only allow jobs with the same account to run on the same node, which would effectively reduce fragmentation for the case you've described. Here's an example of how it might look.
I configure Slurm to use the mcs/account plugin with the MCSParameters that enforce the behavior rather than relying on users to opt-in to the behavior.
$ scontrol show config | grep -i mcs
MCSPlugin = mcs/account
MCSParameters = enforced,select,privatedata
You can see that I have users in different accounts. I'm just showing the 'sub1' and 'sub2' accounts for this example.
$ sacctmgr show assoc tree format=cluster,account,user
Cluster Account User
---------- -------------------- ----------
knight root
knight root root
knight a1
knight sub1
knight sub1 ben
knight sub1 user1
knight sub1 user2
knight sub2
knight sub2 ben
knight sub2 user2
[snip...]
I submit jobs to two different accounts with the same user (ben). The jobs go to different nodes rather than being allocated to the same one as they normally would.
$ sbatch -n1 -pdebug -Asub1 --wrap='srun sleep 120'
Submitted batch job 11053
$ sbatch -n1 -pdebug -Asub2 --wrap='srun sleep 120'
Submitted batch job 11054
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
11054 debug wrap ben R 0:01 1 node11
11053 debug wrap ben R 0:05 1 node10
I become user1 and user2 and submit to the sub1 and sub2 accounts respectively. These jobs go to the nodes that already have a job from the same account running on them.
user1@kitt:~$ sbatch -n1 -pdebug -Asub1 --wrap='srun sleep 120'
Submitted batch job 11055
user2@kitt:~$ sbatch -n1 -pdebug -Asub2 --wrap='srun sleep 120'
Submitted batch job 11056
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
11056 debug wrap user2 R 0:20 1 node11
11054 debug wrap ben R 0:42 1 node11
11055 debug wrap user1 R 0:29 1 node10
11053 debug wrap ben R 0:46 1 node10
You can find more information on the MCS Plugin here:
https://slurm.schedmd.com/mcs.html
Does this sound like it would work for you to address that aspect of the behavior you want to change? Feel free to let me know if you have any questions about how implementation might look or scenarios you think it might not cover.
I'm not sure about cost at this point. I can get in touch with our sales team for more information there, but they would want to know the scope of the work to be done, so I would like to work out whether this sounds like it would cover part of what you are trying to do.
Thanks,
Ben
Interesting. That's a use of MCS I hadn't thought of. We do have MCS turned on, but given the nature of our cluster we couldn't universally enforce MCS. That said in this case the entire partition is used by a single lab so this might be workable by having their users use MCS for their jobs. I will ask them and see what they think. -Paul Edmon- On 5/31/23 1:16 PM, bugs@schedmd.com wrote: > > *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c13> on > bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben > Roberts <mailto:ben@schedmd.com> * > Hi Paul, > > I think it would be hard to get support for the behavior you're describing, > where jobs on a shared node would be preempted before being placed on available > resources on another node. I do think there is something we could do that > would give you this behavior without preemption. We have the Multi-Category > Security (MCS) Plugin that will allow you to have different types of jobs only > run on nodes of a similar type. The plugin was designed with security in mind, > but the functionality will also work for the use case you're describing. > > It would require that you have unique accounts created for each partition. You > could either require users to select the appropriate account based on the > partition they're using, or you could create a submit filter that assigned the > appropriate account based on the partition chosen. > > With the MCSPlugin configured to look at the accounts, the scheduler would only > allow jobs with the same account to run on the same node, which would > effectively reduce fragmentation for the case you've described. Here's an > example of how it might look. > > I configure Slurm to use the mcs/account plugin with the MCSParameters that > enforce the behavior rather than relying on users to opt-in to the behavior. > > $ scontrol show config | grep -i mcs > MCSPlugin = mcs/account > MCSParameters = enforced,select,privatedata > > > > > You can see that I have users in different accounts. I'm just showing the > 'sub1' and 'sub2' accounts for this example. > > $ sacctmgr show assoc tree format=cluster,account,user > Cluster Account User > ---------- -------------------- ---------- > knight root > knight root root > knight a1 > knight sub1 > knight sub1 ben > knight sub1 user1 > knight sub1 user2 > knight sub2 > knight sub2 ben > knight sub2 user2 > [snip...] > > > > > I submit jobs to two different accounts with the same user (ben). The jobs go > to different nodes rather than being allocated to the same one as they normally > would. > > $ sbatch -n1 -pdebug -Asub1 --wrap='srun sleep 120' > Submitted batch job 11053 > > $ sbatch -n1 -pdebug -Asub2 --wrap='srun sleep 120' > Submitted batch job 11054 > > $ squeue > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 11054 debug wrap ben R 0:01 1 node11 > 11053 debug wrap ben R 0:05 1 node10 > > > > > > I become user1 and user2 and submit to the sub1 and sub2 accounts respectively. > These jobs go to the nodes that already have a job from the same account > running on them. > > user1@kitt:~$ sbatch -n1 -pdebug -Asub1 --wrap='srun sleep 120' > Submitted batch job 11055 > > user2@kitt:~$ sbatch -n1 -pdebug -Asub2 --wrap='srun sleep 120' > Submitted batch job 11056 > > $ squeue > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 11056 debug wrap user2 R 0:20 1 node11 > 11054 debug wrap ben R 0:42 1 node11 > 11055 debug wrap user1 R 0:29 1 node10 > 11053 debug wrap ben R 0:46 1 node10 > > > > > > You can find more information on the MCS Plugin here: > https://slurm.schedmd.com/mcs.html > > Does this sound like it would work for you to address that aspect of the > behavior you want to change? Feel free to let me know if you have any > questions about how implementation might look or scenarios you think it might > not cover. > > I'm not sure about cost at this point. I can get in touch with our sales team > for more information there, but they would want to know the scope of the work > to be done, so I would like to work out whether this sounds like it would cover > part of what you are trying to do. > > Thanks, > Ben > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Hi Paul, I wanted to follow up and see if you've had a chance to discuss the possibility of using MCS to consolidate different types of jobs. Let me know if there's anything else I can do to help in this ticket. Thanks, Ben Yes, the group that I was working with decided to go with the MCS solution. Thanks for all the help! -Paul Edmon- On 6/21/2023 4:13 PM, bugs@schedmd.com wrote: > > *Comment # 15 <https://bugs.schedmd.com/show_bug.cgi?id=16789#c15> on > bug 16789 <https://bugs.schedmd.com/show_bug.cgi?id=16789> from Ben > Roberts <mailto:ben@schedmd.com> * > Hi Paul, > > I wanted to follow up and see if you've had a chance to discuss the possibility > of using MCS to consolidate different types of jobs. Let me know if there's > anything else I can do to help in this ticket. > > Thanks, > Ben > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Great, I'm glad that worked for you. Let us know if there's anything else we can do to help. Thanks, Ben Closing |