Ticket 11385

Summary: Time spent in reserved state
Product: Slurm Reporter: Michael Ver Haag <verhamp1>
Component: ConfigurationAssignee: Albert Gil <albert.gil>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
Site: Johns Hopkins University Applied Physics Laboratory Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Slurm.conf
Slurm.conf cont'd

Description Michael Ver Haag 2021-04-15 07:28:30 MDT
Created attachment 18984 [details]
Slurm.conf

We are trying to assess the effectiveness of our scheduler configuration. 
When we take a look at our environment we typically see 40-50% of our "sreport cluster utilization" in the reserved state. 
I expect that there are two problems
1 Understanding 
The description of this category is a little light to understand what all the reasons that a node would be marked as Reserved. Can we get a deeper description of what the reserved state captures?
2 Funtino
Ideally we would like to reduce the amount of time in this state and get more into the allocated space. 
Our job distribution has a strong mix of HTC and HPC but we are attempting to be agnostic on priority of that work.
Can we get some guidance or suggestion on how to adjust the scheduler to minimize the wasted cycles there?
Comment 1 Michael Ver Haag 2021-04-15 07:29:09 MDT
Created attachment 18985 [details]
Slurm.conf cont'd
Comment 2 Albert Gil 2021-04-19 04:45:49 MDT
Hi Michael,

> 1 Understanding 
> The description of this category is a little light to understand what all
> the reasons that a node would be marked as Reserved. Can we get a deeper
> description of what the reserved state captures?

Yes, one could argue that the Reserved category is actually something quite internal on Slurm and in general you can group it with Idle.
The main difference is that internally the scheduler "reserved" the node for a job in the queue and other job with less priority cannot use it (unless the time limitations allows to backfill).

Note that we have a (private) enhancement on bug 7592 to improve this, and as you can see in bug 9869 we also mention this in slide 39 of the last SLUG roadmap presentation:
https://slurm.schedmd.com/SLUG20/Roadmap.pdf

> 2 Funtino
> Ideally we would like to reduce the amount of time in this state and get
> more into the allocated space. 
> Our job distribution has a strong mix of HTC and HPC but we are attempting
> to be agnostic on priority of that work.
> Can we get some guidance or suggestion on how to adjust the scheduler to
> minimize the wasted cycles there?

Regarding HTC, in case you haven't, my initial comment would be to take a look at these slides of SLUG'19:
https://slurm.schedmd.com/SLUG19/High_Throughput_Computing.pdf

Related to reducing the Reserved time, I would say that the key part is backfilling as many jobs as possible.
There are several parameters that we could look at but, because most of your partitions have MaxTime=INFINITE, I'm wondering if your users tend to request too long time limits?
That is, if TimeLimit(Raw) tends to be much bigger than Elapsed(Raw), then the backfill scheduler won't be able to scheduler/backfill some jobs because they requested too much time, but they actually didn't need to.
Could it be your case?

Regards,
Albert
Comment 3 Michael Ver Haag 2021-04-21 06:25:16 MDT
I am guessing that the enhancement will be widely available in a future release do you have a targeted 20.11 release for that feature?

I do recall the presentation on HTC but thank you for the reminder. We have implemented a few of the fixes (ie upgrading the scheduler clock speed) to get better throughput.

My first gut instinct was over requesting time. 
We dropped our default partition time to 10 minutes to help address that problem. It doubled our request efficiency but we are still having a bit of issue getting the users to choose wisely. (~9% request efficiency)

We had hoped that the backfill scheduler would take up the slack once we dropped the defaults. Is there a way to easily present the scheduler that scheduled a job? I know I can use the -F and make several queries with sacct but is there a way to simply print the scheduler with the -o option? Would that be a simple feature request?

Are there effective mechanisms to determine what types of jobs or which users are causing the scheduler to spend a lot of time in the the reserved state?
Comment 4 Albert Gil 2021-04-21 09:41:48 MDT
Hi Michael,

> I am guessing that the enhancement will be widely available in a future
> release do you have a targeted 20.11 release for that feature?

Such new features are typically targeted for major releases.
This one is targeted for 21.08.

> I do recall the presentation on HTC but thank you for the reminder. We have
> implemented a few of the fixes (ie upgrading the scheduler clock speed) to
> get better throughput.

Good!
I'm curious to know if you noticed some performance improvement with that?

> My first gut instinct was over requesting time.

You have a good gut instinct! ;-)

> We dropped our default partition time to 10 minutes to help address that
> problem.

That's a good initial strategy.
But I would say that using MaxTime is also a key element.
With almost all partitions without MaxTime, users don't (feel to) have any avantatge on requesting small/adjusted times (unless they know about backfill).
If you define some partitions or qos strategy to allow smaller jobs to have more resources or more priority, then users may start trying to request the smaller possible time to ensure that their jobs have (even) more chances to be scheduled.

> It doubled our request efficiency but we are still having a bit of
> issue getting the users to choose wisely. (~9% request efficiency)

This 9% means that Elapsed use to be 9% of TimeLimit requested?
Then I think that this is the main metric to improve backfill performance, and reducing the Reserved time.

> We had hoped that the backfill scheduler would take up the slack once we
> dropped the defaults.

The defaults typically force some users to request more time than the default, but backfill uses the requested for each job.
As users have no MaxTime limit and no clear reason to set lower values, they seem to be requesting 10 times more time than actually needed.

> Is there a way to easily present the scheduler that
> scheduled a job? I know I can use the -F and make several queries with sacct
> but is there a way to simply print the scheduler with the -o option?

Yes, you can use the "-o Flags" option.

> Are there effective mechanisms to determine what types of jobs or which
> users are causing the scheduler to spend a lot of time in the the reserved
> state?

Well, one could argue that this is almost impossible to tell.
Note that jobs doing the actual reserved time cannot always be "blamed" for it.
Let me put an example to clarify:

Imagine that you have 10 idle nodes.
Someone submits a job A requesting for 11 nodes for 1 day, and later (or with less priority) another job B is submitted requesting 10 nodes for 2 days.
Technically, job A is the one that is reserving the nodes until 11 are available, but can we blame it?

Imagine that once both jobs run and end, it turns out that job A really run steps/processes on those 11 nodes for a whole day, but job B only run steps/processes on 1 node and only for 1 hour.
In that case, job A cannot be blamed at all, but job B was the actual one to blame because with the right request it would be backfilled.

By the other side, if at the end it turns our that job A run on a single node only for 1 hour, then we can blame it because it wasted resources in reserved.

So, in general I would say that the way jobs/users that can be blamed for unnecessary reserved time, to less backfill, are the ones that provide bigger requests than they need, and specially those that does that and have high priority.

Regards,
Albert
Comment 5 Albert Gil 2021-05-03 08:35:31 MDT
Hi Michael,

Just following up on this ticket.
I hope that my comments solved some of your questions.

Do you still need further support on this?

Regards,
Albert
Comment 6 Michael Ver Haag 2021-05-04 11:50:05 MDT
We have not had the chance to implement a hard time limit yet but We do have monthly patching requirement that ensures that we are not encountering the very long run time requests (just pretty long). I will work on putting together a metric for the understanding the distribution of request times  to see what we can turn the limit down to.
Comment 7 Albert Gil 2021-05-05 03:24:09 MDT
Hi Michael,

> We have not had the chance to implement a hard time limit yet but We do have
> monthly patching requirement that ensures that we are not encountering the
> very long run time requests (just pretty long). I will work on putting
> together a metric for the understanding the distribution of request times 
> to see what we can turn the limit down to.

Ok.
Please note that the main idea is not to *reduce the limit*, but to *promote that users request shorter times* (or times more closer to the actual time that they need). You can do this in many ways, for example with several QoS where the ones with smaller time limits are also the ones with higher priority or allow you to access to more resources/partitions, etc...
This way the final user can see the benefit of not asking much more time that what they really need, and that will help the scheduler to optimize the utilization of resources.

Regards,
Albert
Comment 8 Albert Gil 2021-05-13 02:44:43 MDT
Hi Michael,

If this is ok for you I'm closing this ticket as infogiven, but please don't hesitate to reopen it if/when you need further support.

Regards,
Albert