Currently EnforcePartLimits=ALL only looks at definitions that are a part of a partition not the nodes in the partition. Thus if a user were to submit a job to two partitions asking for 200G but one partition had nodes that only were 128G and other other had nodes that were 256G it would still try to schedule the job in both partitions. From an admin point of view you would want it to reject that job because one of the partitions cannot ever schedule that job and job may end up blocking up the other partition. The description of MaxMemPerNode for a partition description indicates its really for oversubscription purposes, which means its not ideal for partitions that are non-uniform. It would be better if EnforcePartLimits also made an internal list of the maximums for each partition and then used those for checking when users submit a job. Or else if we must use MaxMemPerNode to get this behavior, is it acceptable to use it with a nonuniform partition? Will it oversubscribe? As all we really want to happen is to have jobs that won't ever run in a partition to be denied outright when doing multipartition submission. This clearly is being done when you submit to the partitions individually already, so if some one does a multipartition submission it should run the same logic as the individual but for each partition it is submitting to. I'd rather not have to go through all 100+ partitions we run and set MaxMemPerNode, especially if that will cause oversubscription. This seems like something Slurm should autodetect.
> Currently EnforcePartLimits=ALL only looks at definitions that are a part of > a partition not the nodes in the partition. Thus if a user were to submit a > job to two partitions asking for 200G but one partition had nodes that only > were 128G and other other had nodes that were 256G it would still try to > schedule the job in both partitions. From an admin point of view you would > want it to reject that job because one of the partitions cannot ever schedule > that job and job may end up blocking up the other partition. Yes, this is the current behavior, so the scheduler will waste time on trying to schedule the job in a partition that it can never run in. However, the backfill scheduler doesn't reserve resources for the jobs on nodes that it can't schedule them on, so it shouldn't block the other partition. In other words, the backfill scheduler will only reserve nodes for a job where the job can run. So really the concerns are the extra log messages and wasting some time in the scheduler. > The > description of MaxMemPerNode for a partition description indicates its > really for oversubscription purposes, which means its not ideal for > partitions that are non-uniform. It would be better if EnforcePartLimits > also made an internal list of the maximums for each partition and then used > those for checking when users submit a job. Or else if we must use > MaxMemPerNode to get this behavior, is it acceptable to use it with a > nonuniform partition? Will it oversubscribe? No, it won't oversubscribe. The documentation for MaxMemPerNode mentions that it is useful for an environment where oversubscription is used, but setting MaxMemPerNode doesn't enable oversubscription. So, you can set MaxMemPerNode safely. However, as you pointed out it's hard to use MaxMemPerNode with a non-homogeneous partition. We've actually had some discussions about this in bug 7248. One reason we're hesitant to change this is because EnforcePartLimits has never looked at node definitions, only partition definitions. This would be a change in behavior for EnforcePartLimits and the name "EnforcePartLimits" may not make sense anymore. There's also been a proposal to add another parameter to cause the partition to be dropped or reject the job outright. This is still undecided and the discussion is happening internally in bug 7248. Actually there has been some internal disagreement on the conclusion in bug 7248 comment 27. Would it be okay if I mark this as a duplicate of bug 7248? I did find one issue with MaxMemPerNode. If I set MaxMemPerNode on a partition, the job's reason is actually toggling between "MaxMemPerLimit" set by the backfill scheduler and "Resources" set by the main scheduler. I'll look at fixing this. If you're okay making this bug a duplicate of bug 7248, then I'll create a new public bug for this other issue and add you to CC if you want to track it. If you don't want to track it I'll just create a private bug to handle it.
Sure. That makes sense. Multipartition submission is tricky business and I get the hesitancy. I look forward to future developments. -Paul Edmon- On 5/13/2020 6:49 PM, bugs@schedmd.com wrote: > > *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=9024#c4> on bug > 9024 <https://bugs.schedmd.com/show_bug.cgi?id=9024> from Marshall > Garey <mailto:marshall@schedmd.com> * > > Currently EnforcePartLimits=ALL only looks at definitions that are a part of > a partition not the nodes in the partition. Thus if a user were to > submit a > job to two partitions asking for 200G but one partition had > nodes that only > were 128G and other other had nodes that were 256G > it would still try to > schedule the job in both partitions. From an > admin point of view you would > want it to reject that job because one > of the partitions cannot ever schedule > that job and job may end up > blocking up the other partition. > > Yes, this is the current behavior, so the scheduler will waste time on trying > to schedule the job in a partition that it can never run in. However, the > backfill scheduler doesn't reserve resources for the jobs on nodes that it > can't schedule them on, so it shouldn't block the other partition. In other > words, the backfill scheduler will only reserve nodes for a job where the job > can run. > > So really the concerns are the extra log messages and wasting some time in the > scheduler. > > > > The > description of MaxMemPerNode for a partition description indicates > its > really for oversubscription purposes, which means its not ideal > for > partitions that are non-uniform. It would be better if > EnforcePartLimits > also made an internal list of the maximums for > each partition and then used > those for checking when users submit a > job. Or else if we must use > MaxMemPerNode to get this behavior, is > it acceptable to use it with a > nonuniform partition? Will it > oversubscribe? > > No, it won't oversubscribe. > > The documentation for MaxMemPerNode mentions that it is useful for an > environment where oversubscription is used, but setting MaxMemPerNode doesn't > enable oversubscription. So, you can set MaxMemPerNode safely. > > However, as you pointed out it's hard to use MaxMemPerNode with a > non-homogeneous partition. > > We've actually had some discussions about this inbug 7248 <show_bug.cgi?id=7248>. One reason we're > hesitant to change this is because EnforcePartLimits has never looked at node > definitions, only partition definitions. This would be a change in behavior for > EnforcePartLimits and the name "EnforcePartLimits" may not make sense anymore. > There's also been a proposal to add another parameter to cause the partition to > be dropped or reject the job outright. > > This is still undecided and the discussion is happening internally inbug 7248 <show_bug.cgi?id=7248>. > Actually there has been some internal disagreement on the conclusion in bug > 7248comment 27 <show_bug.cgi?id=9024#c27>. Would it be okay if I mark this as a duplicate ofbug 7248 <show_bug.cgi?id=7248>? > > I did find one issue with MaxMemPerNode. If I set MaxMemPerNode on a partition, > the job's reason is actually toggling between "MaxMemPerLimit" set by the > backfill scheduler and "Resources" set by the main scheduler. I'll look at > fixing this. If you're okay making this bug a duplicate ofbug 7248 <show_bug.cgi?id=7248>, then I'll > create a new public bug for this other issue and add you to CC if you want to > track it. If you don't want to track it I'll just create a private bug to > handle it. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Okay, I've created bug 9085 to track the issue of the job reason toggling, and I'm closing this as a duplicate of bug 7248. Thanks for your input on this issue. *** This ticket has been marked as a duplicate of ticket 7248 ***