| Summary: | Single core reservation on any node | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Davide Vanzo <davide.vanzo> |
| Component: | Scheduling | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 15.08.11 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Vanderbilt | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Davide Vanzo
2016-07-12 13:39:12 MDT
Reservations are always full-node. If you have a reservation for a node, and the node "assigned" to that reservation becomes unavailable the reservation logic will select another node to use - admittedly this may not work on a tight deadline if there are jobs in the way. There's no way to limit the reservation to just a single-core at present. You could potentially use preemption to ensure that resources become available though if necessary: http://slurm.schedmd.com/preempt.html The "NonStop" extension to Slurm may also be of interest - http://slurm.schedmd.com/nonstop.html . NonStop is not part of the normal Slurm distribution, but the is made available to SchedMD customers on request. Although this would still reserve whole nodes as spares, not individual cores. I will probably create a QOS that bumps up the job priority. In this case, what is the parameter that corresponds to QOS_factor in the priority formula for the multifactor plugin? I'm asking because from the documentation it seems that "Priority" overrides the whole priority, while "UsageFactor" is only for accounting purposes... DV (In reply to Davide Vanzo from comment #2) > I will probably create a QOS that bumps up the job priority. In this case, > what is the parameter that corresponds to QOS_factor in the priority formula > for the multifactor plugin? I'm asking because from the documentation it > seems that "Priority" overrides the whole priority, while "UsageFactor" is > only for accounting purposes... You'd want to set PriorityWeightQOS to something (or potentially significantly higher than all other PriorityWeights), along with a higher priority in the QOS itself. Tim,
So, my understanding was that the priority field in the QOS correspond to the multiplication factor of PriorityWeightQOS in calculating the final job priority. However that seems to work only for priority=0 and priority=1. For higher priority values the QOS contribution to the total priority will always be equal to PriorityWeightQOS. Here is an example:
$ scontrol show config | grep QOS
PriorityWeightQOS = 1000
$ scontrol show config | grep PriorityType
PriorityType = priority/multifactor
Here is the QOS:
Name Priority GraceTime PreemptMode Flags UsageFactor
cms_samtest_hp 100 00:00:00 cluster OverPartQOS 1.000000
and the user association:
User Def Acct Admin Cluster Account Partition Share QOS
vanzod accre Administ+ accre accre production 1 cms_samtest_hp
When I run a job, this is the priority contributions:
$ sprio -j 9478314
JOBID PRIORITY AGE FAIRSHARE JOBSIZE QOS
9478314 1052 0 52 0 1000
So where am I doing it wrong?
Davide
I now get what's happening here. Each factor in the multifactor priority scheme is the normalized value (0 to 1) for the given aspect, which is then multiplied by the PriorityWeight value. As you increase the QOS Priority value, the normalization continues to reset the scale; this new larger factor is now normalized back to one again. Lower priority values would start being reduced. I think you can address this by creating a 'dummy' QOS with an artificially high priority, then adjusting your PriorityWeight to compensate. (In reply to Davide Vanzo from comment #4) > Tim, > So, my understanding was that the priority field in the QOS correspond to > the multiplication factor of PriorityWeightQOS in calculating the final job > priority. However that seems to work only for priority=0 and priority=1. For > higher priority values the QOS contribution to the total priority will > always be equal to PriorityWeightQOS. Here is an example: > > $ scontrol show config | grep QOS > PriorityWeightQOS = 1000 > > $ scontrol show config | grep PriorityType > PriorityType = priority/multifactor > > Here is the QOS: > > Name Priority GraceTime PreemptMode Flags UsageFactor > cms_samtest_hp 100 00:00:00 cluster OverPartQOS 1.000000 > > and the user association: > > User Def Acct Admin Cluster Account Partition Share QOS > vanzod accre Administ+ accre accre production 1 > cms_samtest_hp > > When I run a job, this is the priority contributions: > > $ sprio -j 9478314 > JOBID PRIORITY AGE FAIRSHARE JOBSIZE QOS > 9478314 1052 0 52 0 1000 > > So where am I doing it wrong? > > Davide Now it is clear to me how it works. I would suggest adding this explanation to the multifactor plugin documentation. You can close the ticket now. Have a great weekend! Davide (In reply to Tim Wickberg from comment #5) > I now get what's happening here. > > Each factor in the multifactor priority scheme is the normalized value (0 to > 1) for the given aspect, which is then multiplied by the PriorityWeight > value. > > As you increase the QOS Priority value, the normalization continues to reset > the scale; this new larger factor is now normalized back to one again. Lower > priority values would start being reduced. > > I think you can address this by creating a 'dummy' QOS with an artificially > high priority, then adjusting your PriorityWeight to compensate. Tim, I reopen this old ticket to see if anything has changed since then. Unfortunately bumping up the priority does not ensure that the job starts fast enough. We are deploying Slurm on a new cluster and we are trying to set up a "floating" reservation again. Is there anything new? The only other option would be to have a pool of pre-emptable dummy jobs that keep resources available for the actual job to run. Thanks Davide > We are deploying Slurm on a new cluster and we are trying to set up a
> "floating" reservation again. Is there anything new?
> The only other option would be to have a pool of pre-emptable dummy jobs
> that keep resources available for the actual job to run.
Using a FLEX reservation, along with some additional logic to remove the FLEX flag off the reservation automatically on job submittal, may give you another way to get similar behavior.
Some other flags may assist with this as well - take a look at the REPLACE flag as well.
Tim, Could you please provide more specific descriptions on how we could use the suggested options to achieve our goal? Thanks DV Sorry, I'd misspoke previously - I was thinking of TIME_FLOAT instead of FLEX. Assuming you have an SLA to hold to, you can setup a reservation with TIME_FLOAT for now + (some number of minutes), which would still allow jobs to backfill on to those nodes, but that when the TIME_FLOAT flag was removed from the reservation, you'd be able to guarantee nodes became available within a certain time frame. With REPLACE, if your concern is that the node the reservation lands on may become unavailable during the reservation, Slurm would pick different nodes as soon as resources freed up again to satisfy the reservation. I'm a little unclear on your exact use case - if you're able to describe some of the motivation behind this and required behavior I may be able to come up with something else. But with any of these I've suggested, there's a bit of external scripting you may need to provide - we're stepping outside of the usual behavior here and you'd be modifying it to suit. Longer-term, the better I understand the end goal, the easier it would be to make an enhancement request to suit, and the more likely it'd get addressed. - Tim Tim, Thanks for the explanation. The underlying problem is the following. We have a test system that submits a job every 5-10 minutes to check certain cluster functionality. If the test ran in the job does not return its answer back within few minutes from job submission, the cluster is flagged as inoperative and this causes issues upstream. If such job is submitted like any other job and even if it is bumped at the top of the priority list, we experienced that the wait time was still excessive when cluster utilization was high. What we need is to have at least one CPU core always available for the test job, no matter on which node. Davide (In reply to Davide Vanzo from comment #12) > Tim, > > Thanks for the explanation. > > The underlying problem is the following. We have a test system that submits > a job every 5-10 minutes to check certain cluster functionality. If the test > ran in the job does not return its answer back within few minutes from job > submission, the cluster is flagged as inoperative and this causes issues > upstream. If such job is submitted like any other job and even if it is > bumped at the top of the priority list, we experienced that the wait time > was still excessive when cluster utilization was high. > > What we need is to have at least one CPU core always available for the test > job, no matter on which node. Okay - that makes sense. I'd suggest you create a reservation for 5-minutes in the future reserving one core, and with TIME_FLOAT set on it. Your monitoring job should then be submitted against that hidden partition, and have a small time limit set on it. That should lead to it being backfilled into the 'shadow' of that reservation, assuming your bf_resolution is small enough for that slot to be found. (If not, move the reservation out further into the future.) I think that'll give you the behavior you're looking for, and only require one reservation and no additional scripting. If you find that user jobs block that out, make sure to have something (QOS potential) in place to increase that job's priority significantly, or consider adding a hidden partition with a higher PriorityTier setting. Although that could lead to some unexpected scheduling delays on the rest of the system - please test carefully. But the reservation on its own shouldn't cause any problems. - Tim Tim, Let us run some tests and I will let you know if this solution works. DV I realized I hadn't updated this in a bit - tagging resolved/infogiven now. Please reopen if you have any further questions. - Tim |