Ticket 14241 - RFE: new flag to allocate subset of GRES on exclusive nodes
Summary: RFE: new flag to allocate subset of GRES on exclusive nodes
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 22.05.0
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-06-03 14:47 MDT by Felix Abecassis
Modified: 2023-01-26 04:12 MST (History)
4 users (show)

See Also:
Site: NVIDIA (PSLA)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02pre1
Target Release: 23.02
DevPrio: 1 - Paid
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Felix Abecassis 2022-06-03 14:47:35 MDT
The OverSubscribe=EXCLUSIVE documentation (https://slurm.schedmd.com/slurm.conf.html#OPT_EXCLUSIVE) states the following:
> EXCLUSIVE
>    Allocates entire nodes to jobs even with SelectType=select/cons_res or SelectType=select/cons_tres configured.
>    Jobs that run in partitions with OverSubscribe=EXCLUSIVE will have exclusive access to all allocated nodes.
>    These jobs are allocated all CPUs and GRES on the nodes, but they are only allocated as much memory as they ask for.

We have a use case where jobs allocate nodes as exclusive, but we don't want to allocate all the GRES of one custom type. The reason is that enabling this GRES requires extra work and a limited number of nodes can use this GRES concurrently on the cluster
Therefore we don't want to "bill" a user for a GRES that is optionally available on the nodes but is not in use for the current job, and we want to enforce resource limits such as GrpTRES based on the actual usage.

There is the no_consume flag today, but it will allocate 0 GRES, so it doesn't work for accounting limits. For this use case I believe we would need a new flag (could be called "consume_required") and would be used this way:
NodeName=ioctl Gres=widget:consume_required:10
PartitionName=debug Nodes=ioctl OverSubscribe=EXCLUSIVE

$ srun --gres=widget:4 -p debug # Will consume 4/10 widget GRES
$ srun -p debug # Will consume 0/10 widget GRES

This could perhaps be achieved with licenses, but this is inherently a per-node resource unlike licenses that are per-cluster. Node A might support 10 widgets but node B might support 20 widgets.
Comment 1 Felix Abecassis 2022-06-03 15:15:13 MDT
"consume_requested" might be a better name, I was thinking of `ReqTres` but Req stands for "Requested" and not "Required".
Comment 4 Jason Booth 2022-06-08 16:31:14 MDT
Felix, this would be possible, however would require being sponsored as paid development by Nvidia. Is this something you are interested in sponsoring?
Comment 5 Tim Wickberg 2022-06-21 14:46:17 MDT
Updating ticket metadata to reflect status as a potential future enhancement.
Comment 32 Tim Wickberg 2023-01-23 16:10:15 MST
Hey Felix -

We're working on wrapping this up, but stumbled on one subtle implementation detail that we wanted to check with you on.

Each Gres can have an (optional) Type field. Common device definitions look like:

Name=gpu Type=k20 File=/dev/nvidia0

Internally, the internal flags field - where this new "explicit" flag is being added - is mapped to the Gres - not each individual (Gres,Type) tuple. This means that, in our current implementation, if you specify any Explicit Gres like:

Name=gpu Type=k20 File=/dev/nvidia0 Flags=Explicit

That the explicit flag applies not only to the k20 types, but all Gres=gpu defined on the node. So any further definitions like:

Name=gpu Type=h100 File=/dev/nvidia1 

would automatically inherit the "explicit" flag, and be treated as such in the configuration. We're hoping that's not an issue for your expected use case here, but wanted to confirm that with you in case you have some use for this flag that doesn't match up to this.

- Tim
Comment 33 Felix Abecassis 2023-01-23 16:16:32 MST
Thanks for asking, I think that's fine.
Comment 46 Marcin Stolarek 2023-01-26 04:12:21 MST
Felix,

I'm happy to let you know that requested feature got merged into our public repository[1] and will be part of Slurm 23.02 release.

I'll go ahead and mark the ticket as fixed. Should you have any questions please don't hesitate to reopen.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/75be81090106b9b083698e66e8821f0113af72b1