Ticket 9964 - Add support for managing multiple device files as a single GRES
Summary: Add support for managing multiple device files as a single GRES
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other tickets)
Version: 20.11.x
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-10-08 23:33 MDT by Tim Wickberg
Modified: 2020-10-09 09:54 MDT (History)
4 users (show)

See Also:
Site: SchedMD
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.0pre1
Target Release: 20.11
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Tim Wickberg 2020-10-08 23:33:41 MDT
Certain cards, potentially such as the A100 running with MIG, will present multiple device files that should be managed as a single GRES. For the cgroup device enforcement to work, a new gres.conf option will be needed to support this.

Note that this is different from the current GRES syntax that supports a range of device entries, such as:

Name=gpu Type=k20 File=/dev/nvidia[0-3]

which establishes three separate gpu/k20 GRES on the node.

The new syntax instead will look like:

Name=gpu Type=newcard MultipleFiles=/dev/nvidia0,/dev/other-device-entry

which will establish a single gpu/newcard GRES managing the pair of device files.
Comment 1 Tim Wickberg 2020-10-08 23:36:31 MDT
The three core commits to enable this follow. This will be in 20.11 when released:

ff4bf3e085e0f8638e1e9cba7e1437665f2cd8c9
Author:     Tim Wickberg <tim@schedmd.com>
AuthorDate: Thu Oct 8 23:34:16 2020 -0600

    gres.conf - add new MultipleFiles configuration option.
    
    Bug 9964.

commit d3d7b7d516d702f74731ea7f5143a0676a7036de
Author:     Tim Wickberg <tim@schedmd.com>
AuthorDate: Thu Oct 8 23:24:26 2020 -0600

    Allow get_devices() to return more gres_device_t entries than we have GRES.
    
    So that we can support multiple device files mapped into a single GRES
    entrie.

commit 5fbb2ca90aaec157defd85f485569b94e9c8f61c
Author:     Tim Wickberg <tim@schedmd.com>
AuthorDate: Thu Oct 8 23:21:57 2020 -0600

    Add an index value to gres_device_t.
    
    Needed to add support for managing access to multiple device files
    underneath a single GRES. In such a case the index value will let
    us map the GRES allocated bitmap back to the gres_device_t entries.