Ticket 9964

Summary: Add support for managing multiple device files as a single GRES
Product: Slurm Reporter: Tim Wickberg <tim>
Component: GPUAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: fabecassis, jbernauer, jess, lyeager
Version: 20.11.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=9965
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.11.0pre1 Target Release: 20.11
DevPrio: --- Emory-Cloud Sites: ---

Description Tim Wickberg 2020-10-08 23:33:41 MDT
Certain cards, potentially such as the A100 running with MIG, will present multiple device files that should be managed as a single GRES. For the cgroup device enforcement to work, a new gres.conf option will be needed to support this.

Note that this is different from the current GRES syntax that supports a range of device entries, such as:

Name=gpu Type=k20 File=/dev/nvidia[0-3]

which establishes three separate gpu/k20 GRES on the node.

The new syntax instead will look like:

Name=gpu Type=newcard MultipleFiles=/dev/nvidia0,/dev/other-device-entry

which will establish a single gpu/newcard GRES managing the pair of device files.
Comment 1 Tim Wickberg 2020-10-08 23:36:31 MDT
The three core commits to enable this follow. This will be in 20.11 when released:

ff4bf3e085e0f8638e1e9cba7e1437665f2cd8c9
Author:     Tim Wickberg <tim@schedmd.com>
AuthorDate: Thu Oct 8 23:34:16 2020 -0600

    gres.conf - add new MultipleFiles configuration option.
    
    Bug 9964.

commit d3d7b7d516d702f74731ea7f5143a0676a7036de
Author:     Tim Wickberg <tim@schedmd.com>
AuthorDate: Thu Oct 8 23:24:26 2020 -0600

    Allow get_devices() to return more gres_device_t entries than we have GRES.
    
    So that we can support multiple device files mapped into a single GRES
    entrie.

commit 5fbb2ca90aaec157defd85f485569b94e9c8f61c
Author:     Tim Wickberg <tim@schedmd.com>
AuthorDate: Thu Oct 8 23:21:57 2020 -0600

    Add an index value to gres_device_t.
    
    Needed to add support for managing access to multiple device files
    underneath a single GRES. In such a case the index value will let
    us map the GRES allocated bitmap back to the gres_device_t entries.