Ticket 2893

Summary:	Single core reservation on any node
Product:	Slurm	Reporter:	Davide Vanzo <davide.vanzo>
Component:	Scheduling	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	15.08.11
Hardware:	Linux
OS:	Linux
Site:	Vanderbilt	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Davide Vanzo 2016-07-12 13:39:12 MDT

Hi guys,
we need to create a reservation for a single core for a job that needs immediate allocation. The problem is that SLURM automatically assigns a node to that reservation, while what we would like is to have one core free no matter on which node. In this way if the node goes down for any reason the job will still be able to run.
I tried to find a way to do so but I could not find any. Suggestions?

Thanks!

Davide

Comment 1 Tim Wickberg 2016-07-12 13:47:02 MDT

Reservations are always full-node. If you have a reservation for a node, and the node "assigned" to that reservation becomes unavailable the reservation logic will select another node to use - admittedly this may not work on a tight deadline if there are jobs in the way. There's no way to limit the reservation to just a single-core at present.

You could potentially use preemption to ensure that resources become available though if necessary:
http://slurm.schedmd.com/preempt.html


The "NonStop" extension to Slurm may also be of interest - 
http://slurm.schedmd.com/nonstop.html .

NonStop is not part of the normal Slurm distribution, but the is made available to SchedMD customers on request. Although this would still reserve whole nodes as spares, not individual cores.

Comment 2 Davide Vanzo 2016-07-12 14:33:50 MDT

I will probably create a QOS that bumps up the job priority. In this case, what is the parameter that corresponds to QOS_factor in the priority formula for the multifactor plugin? I'm asking because from the documentation it seems that "Priority" overrides the whole priority, while "UsageFactor" is only for accounting purposes...

DV

Comment 3 Tim Wickberg 2016-07-12 14:39:29 MDT

(In reply to Davide Vanzo from comment #2)
> I will probably create a QOS that bumps up the job priority. In this case,
> what is the parameter that corresponds to QOS_factor in the priority formula
> for the multifactor plugin? I'm asking because from the documentation it
> seems that "Priority" overrides the whole priority, while "UsageFactor" is
> only for accounting purposes...

You'd want to set PriorityWeightQOS to something (or potentially significantly higher than all other PriorityWeights), along with a higher priority in the QOS itself.

Comment 4 Davide Vanzo 2016-07-13 13:50:09 MDT

Tim,
So, my understanding was that the priority field in the QOS correspond to the multiplication factor of PriorityWeightQOS in calculating the final job priority. However that seems to work only for priority=0 and priority=1. For higher priority values the QOS contribution to the total priority will always be equal to PriorityWeightQOS. Here is an example:

$ scontrol show config | grep QOS
PriorityWeightQOS       = 1000

$ scontrol show config | grep PriorityType
PriorityType            = priority/multifactor

Here is the QOS:

Name            Priority  GraceTime  PreemptMode  Flags        UsageFactor
cms_samtest_hp  100       00:00:00   cluster      OverPartQOS  1.000000

and the user association:

User    Def Acct   Admin      Cluster  Account  Partition   Share  QOS
vanzod  accre      Administ+  accre    accre    production  1      cms_samtest_hp

When I run a job, this is the priority contributions:

$ sprio -j 9478314
          JOBID   PRIORITY        AGE  FAIRSHARE    JOBSIZE        QOS
        9478314       1052          0         52          0       1000

So where am I doing it wrong?

Davide

Comment 5 Tim Wickberg 2016-07-21 14:27:06 MDT

I now get what's happening here.

Each factor in the multifactor priority scheme is the normalized value (0 to 1) for the given aspect, which is then multiplied by the PriorityWeight value.

As you increase the QOS Priority value, the normalization continues to reset the scale; this new larger factor is now normalized back to one again. Lower priority values would start being reduced.

I think you can address this by creating a 'dummy' QOS with an artificially high priority, then adjusting your PriorityWeight to compensate.

(In reply to Davide Vanzo from comment #4)
> Tim,
> So, my understanding was that the priority field in the QOS correspond to
> the multiplication factor of PriorityWeightQOS in calculating the final job
> priority. However that seems to work only for priority=0 and priority=1. For
> higher priority values the QOS contribution to the total priority will
> always be equal to PriorityWeightQOS. Here is an example:
> 
> $ scontrol show config | grep QOS
> PriorityWeightQOS       = 1000
> 
> $ scontrol show config | grep PriorityType
> PriorityType            = priority/multifactor
> 
> Here is the QOS:
> 
> Name            Priority  GraceTime  PreemptMode  Flags        UsageFactor
> cms_samtest_hp  100       00:00:00   cluster      OverPartQOS  1.000000
> 
> and the user association:
> 
> User    Def Acct   Admin      Cluster  Account  Partition   Share  QOS
> vanzod  accre      Administ+  accre    accre    production  1     
> cms_samtest_hp
> 
> When I run a job, this is the priority contributions:
> 
> $ sprio -j 9478314
>           JOBID   PRIORITY        AGE  FAIRSHARE    JOBSIZE        QOS
>         9478314       1052          0         52          0       1000
> 
> So where am I doing it wrong?
> 
> Davide

Comment 6 Davide Vanzo 2016-07-22 15:07:05 MDT

Now it is clear to me how it works.
I would suggest adding this explanation to the multifactor plugin documentation.
You can close the ticket now.

Have a great weekend!

Davide


(In reply to Tim Wickberg from comment #5)
> I now get what's happening here.
> 
> Each factor in the multifactor priority scheme is the normalized value (0 to
> 1) for the given aspect, which is then multiplied by the PriorityWeight
> value.
> 
> As you increase the QOS Priority value, the normalization continues to reset
> the scale; this new larger factor is now normalized back to one again. Lower
> priority values would start being reduced.
> 
> I think you can address this by creating a 'dummy' QOS with an artificially
> high priority, then adjusting your PriorityWeight to compensate.

Comment 8 Davide Vanzo 2018-06-20 11:30:15 MDT

Tim,

I reopen this old ticket to see if anything has changed since then. Unfortunately bumping up the priority does not ensure that the job starts fast enough.

We are deploying Slurm on a new cluster and we are trying to set up a "floating" reservation again. Is there anything new?
The only other option would be to have a pool of pre-emptable dummy jobs that keep resources available for the actual job to run.

Thanks

Davide

Comment 9 Tim Wickberg 2018-06-27 11:33:50 MDT

> We are deploying Slurm on a new cluster and we are trying to set up a
> "floating" reservation again. Is there anything new?
> The only other option would be to have a pool of pre-emptable dummy jobs
> that keep resources available for the actual job to run.

Using a FLEX reservation, along with some additional logic to remove the FLEX flag off the reservation automatically on job submittal, may give you another way to get similar behavior.

Some other flags may assist with this as well - take a look at the REPLACE flag as well.

Comment 10 Davide Vanzo 2018-06-27 12:22:21 MDT

Tim,

Could you please provide more specific descriptions on how we could use the suggested options to achieve our goal?

Thanks

DV

Comment 11 Tim Wickberg 2018-06-27 13:51:39 MDT

Sorry, I'd misspoke previously - I was thinking of TIME_FLOAT instead of FLEX.

Assuming you have an SLA to hold to, you can setup a reservation with TIME_FLOAT for now + (some number of minutes), which would still allow jobs to backfill on to those nodes, but that when the TIME_FLOAT flag was removed from the reservation, you'd be able to guarantee nodes became available within a certain time frame.

With REPLACE, if your concern is that the node the reservation lands on may become unavailable during the reservation, Slurm would pick different nodes as soon as resources freed up again to satisfy the reservation.

I'm a little unclear on your exact use case - if you're able to describe some of the motivation behind this and required behavior I may be able to come up with something else.

But with any of these I've suggested, there's a bit of external scripting you may need to provide - we're stepping outside of the usual behavior here and you'd be modifying it to suit.

Longer-term, the better I understand the end goal, the easier it would be to make an enhancement request to suit, and the more likely it'd get addressed.

- Tim

Comment 12 Davide Vanzo 2018-06-27 14:34:54 MDT

Tim,

Thanks for the explanation.

The underlying problem is the following. We have a test system that submits a job every 5-10 minutes to check certain cluster functionality. If the test ran in the job does not return its answer back within few minutes from job submission, the cluster is flagged as inoperative and this causes issues upstream. If such job is submitted like any other job and even if it is bumped at the top of the priority list, we experienced that the wait time was still excessive when cluster utilization was high.

What we need is to have at least one CPU core always available for the test job, no matter on which node.

Davide

Comment 13 Tim Wickberg 2018-06-27 14:40:53 MDT

(In reply to Davide Vanzo from comment #12)
> Tim,
> 
> Thanks for the explanation.
> 
> The underlying problem is the following. We have a test system that submits
> a job every 5-10 minutes to check certain cluster functionality. If the test
> ran in the job does not return its answer back within few minutes from job
> submission, the cluster is flagged as inoperative and this causes issues
> upstream. If such job is submitted like any other job and even if it is
> bumped at the top of the priority list, we experienced that the wait time
> was still excessive when cluster utilization was high.
> 
> What we need is to have at least one CPU core always available for the test
> job, no matter on which node.

Okay - that makes sense.

I'd suggest you create a reservation for 5-minutes in the future reserving one core, and with TIME_FLOAT set on it.

Your monitoring job should then be submitted against that hidden partition, and have a small time limit set on it. That should lead to it being backfilled into the 'shadow' of that reservation, assuming your bf_resolution is small enough for that slot to be found. (If not, move the reservation out further into the future.)

I think that'll give you the behavior you're looking for, and only require one reservation and no additional scripting.

If you find that user jobs block that out, make sure to have something (QOS potential) in place to increase that job's priority significantly, or consider adding a hidden partition with a higher PriorityTier setting. Although that could lead to some unexpected scheduling delays on the rest of the system - please test carefully. But the reservation on its own shouldn't cause any problems.

- Tim

Comment 14 Davide Vanzo 2018-06-27 14:44:52 MDT

Tim,

Let us run some tests and I will let you know if this solution works.

DV

Comment 15 Tim Wickberg 2018-08-20 19:17:11 MDT

I realized I hadn't updated this in a bit - tagging resolved/infogiven now. Please reopen if you have any further questions.

- Tim