7477 – How do I setup local disk space as consumable GRES?

Ticket 7477 - How do I setup local disk space as consumable GRES?

Summary: How do I setup local disk space as consumable GRES?

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	18.08.0
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Felip Moll
QA Contact:	Tim Wickberg

URL:

Depends on:
Blocks:

Reported:	2019-07-26 11:19 MDT by George Hwa
Modified:	2021-12-27 08:33 MST (History)
CC List:	2 users (show)

See Also:	13085
Site:	KLA-Tencor RAPID
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description George Hwa 2019-07-26 11:19:48 MDT

We have jobs that use local disks as scratch space. We need to ensure there is sufficient free space(in real time) when such a job is dispatched to a given node. Is such thing built-in, similar to node memory? If plugin is needed, could you provide an example?

Thanks
George

Comment 1 Felip Moll 2019-07-29 08:08:18 MDT

(In reply to George Hwa from comment #0)
> We have jobs that use local disks as scratch space. We need to ensure there
> is sufficient free space(in real time) when such a job is dispatched to a
> given node. Is such thing built-in, similar to node memory? If plugin is
> needed, could you provide an example?
> 
> Thanks
> George

Hi George,

One option would be to set a prolog script (see Prolog in slurm.conf) which could check the disk space in the node for each step run for the job. It could return error if the disk was full which would result in the node being set to a DRAIN state and the job being requeued in a held state, unless nohold_on_prolog_fail was configured in SchedulerParameters which would allow the job to be rescheduled in another node(s).

But I don't recommend this approach. My recommendation is to keep looking at the nodes status periodically with a Health Check program and drain the unhealthy ones. This program can run every HealthCheckInterval, before and after a job submission, and can mark the nodes to some desired state if something is wrong with them. One of the most used programs is LBNL Node Health Check (NHC) (external to SchedMD), a set of bash scripts easy to setup which checks the memory, disk, network interfaces, and so on, on a node. Using this approach you will ensure the health of your cluster, and also drain the nodes which have filled their disk space.

Just look at 'man slurm.conf' for these parameters.

If otherwise what you want (I don't think so) is to account for used disk space, we should look to other approaches, like using a GRES.


Tell me if the NHC approach does fit to you.

Comment 2 George Hwa 2019-07-29 10:14:20 MDT

Felip,

GRES is what I'm looking for.
The minimum requirement is to periodically get the "free" disk space and let SLURM to hold the job if available space is less than requested by that job.
The nice to have is that SLURM would deduct the requested amount at dispatching the job so as not to dispatch too many jobs to that node before all the jobs start to compete for the available space. I'd expect it to work just like the memory resource.

Thanks
George

Comment 3 Felip Moll 2019-07-30 07:10:54 MDT

(In reply to George Hwa from comment #2)
> Felip,
> 
> GRES is what I'm looking for.
> The minimum requirement is to periodically get the "free" disk space and let
> SLURM to hold the job if available space is less than requested by that job.

That's achieved by the previous commented solutions.

> The nice to have is that SLURM would deduct the requested amount at
> dispatching the job so as not to dispatch too many jobs to that node before
> all the jobs start to compete for the available space. I'd expect it to work
> just like the memory resource.

That's possible but requires a considerably amount of effort compared to the other solution, since GRES does not enforce any limit by its own.

As commented the GRES mechanism seems a good fit for your use-case to manage the disk resources among jobs: http://slurm.schedmd.com/gres.html
Let's say we create a new gres type called 'storage' (disk is already used as a TRES). 

You'd want to set in slurm.conf:
GresTypes=storage
NodeName=node[000-099] ... Gres=storage:100GB
NodeName=node[100-200] ... Gres=storage:200GB

You will also need a gres.conf file that matches this config and looks like:
NodeName=node[000-099] Name=storage Count=100G
NodeName=node[100-200] Name=storage Count=200G

After that you'll need to restart slurmctld and slurmd to pick up the changes.

Then, you'll be able to specify your job requests as:
sbatch --gres=storage:1G 

and have Slurm track the amount of space that has been allocated. You may want to force setting the storage or set a default if the job didn't request any, this can be achieved by a simple job submit plugin (Lua). After that you will have a new resource in the nodes which will limit and manage the requests.

But note that, unlike the GPU plugin, your 'storage' GRES won't enforce anything on the node itself. You will need to build a 'storage' plugin to do that, or you could optionally manage this through an epilog script. That means that a job will still be able to use all the storage they want.

To develop the plugin you can get inspired by others found in slurm/src/plugins/gres/, specially the GPU one. Note that SchedMD hasn't developed any due to the complexity that can exist in the different sites and setups. It's up to you when you design the plugin to decide which filesystems to control and which method to use for each of them (quota?, over which fs, ext, gpfs, lustre...?). I know there was a try to introduce quota through cgroups, but it wasn't successful afaik.

https://lkml.org/lkml/2009/2/22/13

You can see some (very)older requests like yours and other sites using the GRES for tmp storage e.g.:
Bug 1671
Bug 2142
Bug 2549
Bug 5645

I recommend you to read through:

https://slurm.schedmd.com/gres_design.html
https://slurm.schedmd.com/gres_plugins.html

Comment 4 George Hwa 2019-07-30 19:33:02 MDT

Thanks for the clarification.

Controlling(limiting) actual disk space usage for jobs is NOT a requirement. The primary objective is to "hold" jobs in queue until sufficient free space is available. Jobs are supposed to clean up after complete, completely free up used space. However, jobs may fail, leaving scratch files behind (useful for debugging). Thus real time monitoring and updating free space is also a MUST.

I'll start setting up GRES based on your suggestion. We'll try STORAGE GRES only (w/o real time updating) to see how it would work out.

Thanks
George

Comment 5 George Hwa 2019-07-30 20:01:56 MDT

Again the objective is not to limit the amount of disk space used by jobs. But rather have the scheduler hold the jobs until adequate free space is available.

Comment 6 Felip Moll 2019-07-31 02:21:59 MDT

(In reply to George Hwa from comment #5)
> Again the objective is not to limit the amount of disk space used by jobs.
> But rather have the scheduler hold the jobs until adequate free space is
> available.

Note that the example I put in comment 3 creates a virtual GRES called 'storage' (it could be called anything), and it does not really track the real disk space.

For example:

NodeName=node[000-099] Name=storage Count=2G

job 1: sbatch --gres=storage:1G
job 2: sbatch --gres=storage:1G
job 3: sbatch --gres=storage:1G


Job 3 will remain Pending until job1 or job2 finishes. But running job1, and then job2, won't ensure the disk space is really available, since we're not really limiting anything. That means job1 can use more than 1G in the reality, even if Slurm thinks it will only use 1G.

As you know, without enforcing limits, the user can really do what he wants.

I can imagine other simpler solutions, like from the Prolog just set the quota depending on how the job was submitted (I should investigate how exactly do that).

Let me know how your tests go.

Comment 7 Felip Moll 2019-07-31 02:54:07 MDT

I've been thinking about other alternatives, but none fits your needs.

What I think you really need is the GRES thing plus an enforcing mechanism.

This is how memory works: there's the information of how much memory the node has and Slurms schedules based on this. Then in the node, cgroup (or the jobacctgather/linux plugin), controls the jobs memory usage.

Start by the GRES thing, then it may be worth to develop the plugin to control the disk usage, which can be just setting a quota on a filesystem.

If you don't control how much disk is used you may have oversubscription. That's the same that would happen with memory if you disabled the jobacctgather/linux or cgroup plugins.

Comment 8 Felip Moll 2019-08-06 11:13:01 MDT

Hi George,

Have you been able to apply the solution? If everything is clear I will close this bug as INFOGIVEN.

Comment 9 Felip Moll 2019-08-08 11:37:34 MDT

Closing as INFOGIVEN. If you have more questions related to this issue, please mark it as OPEN again.

Comment 10 George Hwa 2020-09-17 11:42:15 MDT

Hi Felip,

Sorry it's been awhile since I last had a chance to work on this issue.

We do not want to "enforce" the disk space usage, per se. Our usage/objective of SLURM is probably quite different from typical of others. If a user says he need 10G disk space to run a job, we want SLURM to dispatch the job to a node that actually has 10+G disk space at the time of sending the job to the node. If the job uses more and kills itself, too bad. But subquent jobs that do require certain amount disk space would not be dispatched to that node. We don't necessarily need to "drain" the node since there are jobs that do not require disk space and should be able to run on that node.

Note that we are trying to manage the "local" disk space usage on compute nodes, where jobs are using local disk space as temp files.

So we really just want to have a mechanism to put the "df /local" output into some kind of GRES, say "localstorage" that SLURM would then uses when deciding whether a node has the necessary resource to run a job that requires certain amount of disk space(if requested). Some jobs would require local disk space to run, similar to a GPU job requires a GPU resource, while others may not require local disk space at all.

We don't want to "drain" a node while local disk space runs out since that would require administator to "undrain" it. We just want the self-cleaning mechanism to kick in and after disk space gets freed up eventually, SLURM is free to dispatch jobs to that node again.

We have other legacy clusters that use SGE(Sun GridEngine) and we've been able to accomplish such feat quite easily.

Thanks
George

Comment 11 Felip Moll 2020-09-18 05:30:41 MDT

(In reply to George Hwa from comment #10)
> Hi Felip,
> 
> Sorry it's been awhile since I last had a chance to work on this issue.
> 
> We do not want to "enforce" the disk space usage, per se. Our
> usage/objective of SLURM is probably quite different from typical of others.
> If a user says he need 10G disk space to run a job, we want SLURM to
> dispatch the job to a node that actually has 10+G disk space at the time of
> sending the job to the node. If the job uses more and kills itself, too bad.
> But subquent jobs that do require certain amount disk space would not be
> dispatched to that node. We don't necessarily need to "drain" the node since
> there are jobs that do not require disk space and should be able to run on
> that node.
> 
> Note that we are trying to manage the "local" disk space usage on compute
> nodes, where jobs are using local disk space as temp files.
> 
> So we really just want to have a mechanism to put the "df /local" output
> into some kind of GRES, say "localstorage" that SLURM would then uses when
> deciding whether a node has the necessary resource to run a job that
> requires certain amount of disk space(if requested). Some jobs would require
> local disk space to run, similar to a GPU job requires a GPU resource, while
> others may not require local disk space at all. 
> 
> We don't want to "drain" a node while local disk space runs out since that
> would require administator to "undrain" it. We just want the self-cleaning
> mechanism to kick in and after disk space gets freed up eventually, SLURM is
> free to dispatch jobs to that node again.
> 
> We have other legacy clusters that use SGE(Sun GridEngine) and we've been
> able to accomplish such feat quite easily.
> 
> 
> Thanks
> George

Hi George,

I do understand your request. Unfortunately there's no way to schedule based on real free disk space.

Something similar happens with Memory and CPU: Slurm schedules based on the configured and consumed resources but not on the realtime usage of the resources (maybe by processes outside slurm control).

I understand SGE implements this by using a h_fsize consumable resource + a load sensor, like explained here:

https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html#i1000033

Is that what you were using?

We have the "consumable resources" part (GRES for us), but not a "load sensor" which updates this GRES in real time.

This is an enhancement territory for us, I'm afraid.

Let me know if you want me to discuss internally if we should consider this for an enhancement.

Thanks

Comment 12 George Hwa 2020-09-23 14:47:23 MDT

Yes, load sensor + consumable is exactly what we are looking for.

Is there a way for me to implement my own load sensor (free local disk) and update consumable GRES periodically?

Also using real-time resource metrics to complement configured+requested resource calculation is HIGHLY desirable(I always thought all schedulers, SLURM included, do that by default:). This s a more tolerant way to allow misbehaving jobs to continue while avoiding over subscribe resource by dispatching more jobs. We do expect applications(user jobs) to observe resource usage policy. However, there are always cases where unexpected misbehavior happen unintentionally. In production environments, it is more important to allow jobs run through, as much as possible, rather than enforcing very strict policies.

Thanks!

Comment 17 Felip Moll 2020-12-07 14:12:28 MST

Hi George,

What we are discussing here is to include a new feature in Slurm to update a disk consumable resource (GRES) in real-time and make the scheduler allocate nodes based on this.

I can see many problems with this:

- real scheduling may become ineffective because disk spaces varies over time, and when it is time to run a job it may need to be postponed.

- an enforcement mechanism would be needed to do it correctly, like cgroups with memory and cpus. So, quota, but due to the variety of filesystems it is not a trivial matter.

- we should identify which filesystem(s) and muntpoints can/must be taken into account.

- actually, there's no such control with memory or cpus, this is something completely new. i.e. we do not monitor real memory consumption or allocated cpus.

After discussing internally this enhancement won't be possible at the moment. I would suggest to looking at my alternatives, and if they don't fit you should look into another custom-unsupported solution.

Thanks for your understanding and sorry for the late response,
I am closing the issue now.

Regards