16188 – cgroup device list not properly enforced with MIG

Ticket 16188 - cgroup device list not properly enforced with MIG

Summary: cgroup device list not properly enforced with MIG

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	GPU (show other tickets)
Version:	22.05.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Ben Glines
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-03-03 15:38 MST by Kilian Cavalotti
Modified:	2023-03-07 12:30 MST (History)
CC List:	2 users (show)

See Also:
Site:	Stanford
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kilian Cavalotti 2023-03-03 15:38:28 MST

Hi SchedMD,

For MIG-enabled GPUs, it looks like the ContrainDevice cgroup enforcement only does part of the job :)

More precisely, it correctly allows access to the allocated MIG device and denies access to the other ones. But only for those located on the same GPU as the allocated MIG instance. Access to all the other GPUs on the node is still allowed.

Here's an illustration:

* gres.conf is just "AutoDetect-nvml"

* compute node has 8 GPUs, each partitioned with 4 MIG instances:
-- 8< -----------------------------------------------------------
# nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-4f7d29a7-23c1-9347-446d-799ab0b6325c)
  MIG 1g.6gb      Device  0: (UUID: MIG-e347bd88-9df4-5c04-a3d2-6a46bcabace3)
  MIG 1g.6gb      Device  1: (UUID: MIG-d4f18614-934f-59b0-bb3d-1e4d707eeb16)
  MIG 1g.6gb      Device  2: (UUID: MIG-73bd67d3-52df-586a-b463-e1d9bd663366)
  MIG 1g.6gb      Device  3: (UUID: MIG-3a5959ec-0d62-52e2-aaa0-e6b188cb8a52)
GPU 1: NVIDIA A30 (UUID: GPU-fa2bf1a3-113a-af3a-03b9-f0789fcc9f09)
  MIG 1g.6gb      Device  0: (UUID: MIG-4befcd31-a551-5e88-8233-9816ed5ffd8d)
  MIG 1g.6gb      Device  1: (UUID: MIG-68cdb6a8-18d3-5526-a43f-73a162c8b710)
  MIG 1g.6gb      Device  2: (UUID: MIG-4feac1d5-fdbe-5fa9-9cf2-0ff39f37d1fb)
  MIG 1g.6gb      Device  3: (UUID: MIG-bd5efd99-d811-553c-b5af-a48e207f18cc)
GPU 2: NVIDIA A30 (UUID: GPU-c1447669-520e-4e8e-c332-aa2090a85c77)
  MIG 1g.6gb      Device  0: (UUID: MIG-193ff5e0-7b97-58a4-9d1d-5f09e7f95ad2)
  MIG 1g.6gb      Device  1: (UUID: MIG-88655199-0699-52e3-aeb6-fccd792af898)
  MIG 1g.6gb      Device  2: (UUID: MIG-6faea62a-d14c-5bd5-8679-ea89f58b4570)
  MIG 1g.6gb      Device  3: (UUID: MIG-2d0f6fd5-59d6-51d8-8bcf-bccd298be4ba)
GPU 3: NVIDIA A30 (UUID: GPU-890e9696-cd25-935d-9795-11e238c4151e)
  MIG 1g.6gb      Device  0: (UUID: MIG-c033c49e-d78d-5792-8260-9fa99b3c15bf)
  MIG 1g.6gb      Device  1: (UUID: MIG-3a11955a-f8eb-5928-a4bc-cfcd1a3b0c24)
  MIG 1g.6gb      Device  2: (UUID: MIG-42c6e8e5-25db-5425-a1d4-60ddf54058cc)
  MIG 1g.6gb      Device  3: (UUID: MIG-6687ff99-9bc8-5a5e-9cf4-9041ddacb25b)
GPU 4: NVIDIA A30 (UUID: GPU-ac772b5a-123a-dc76-9480-5998f435fe84)
  MIG 1g.6gb      Device  0: (UUID: MIG-87e5d835-8046-594a-b237-ccc770b868ef)
  MIG 1g.6gb      Device  1: (UUID: MIG-caffac87-b4d9-5617-a0ff-f12aae1d0104)
  MIG 1g.6gb      Device  2: (UUID: MIG-4588ffa0-bf9b-5692-b0ef-876a6c136442)
  MIG 1g.6gb      Device  3: (UUID: MIG-9a70d283-e956-5db7-a3c2-3671a3e4d326)
GPU 5: NVIDIA A30 (UUID: GPU-02f2db56-6e23-11e7-a621-639f2668d474)
  MIG 1g.6gb      Device  0: (UUID: MIG-1818d0d2-d0a7-5239-b03a-3eb755aa36d4)
  MIG 1g.6gb      Device  1: (UUID: MIG-3c6baef7-e7e0-5204-ab47-20a9906bef31)
  MIG 1g.6gb      Device  2: (UUID: MIG-b3a4648d-713f-5dbd-9fd9-8187df4cc63c)
  MIG 1g.6gb      Device  3: (UUID: MIG-46950239-a5ca-57ae-9f28-6ad15392aa5a)
GPU 6: NVIDIA A30 (UUID: GPU-ec9dcdfd-4063-08bf-e4bd-33233822a900)
  MIG 1g.6gb      Device  0: (UUID: MIG-b867d322-9ae9-5cf1-9802-03974629d99f)
  MIG 1g.6gb      Device  1: (UUID: MIG-18780e41-9a19-5093-94d0-b6ff56dbaa01)
  MIG 1g.6gb      Device  2: (UUID: MIG-4d7b88f7-6a49-5993-9913-269a1231d303)
  MIG 1g.6gb      Device  3: (UUID: MIG-e205d6ce-7881-536b-b043-bc2704049fdf)
GPU 7: NVIDIA A30 (UUID: GPU-afcc1fc7-b089-30a4-02cd-d0ca1d30d47d)
  MIG 1g.6gb      Device  0: (UUID: MIG-41a8f143-32e9-5e38-87c2-66c3b003049a)
  MIG 1g.6gb      Device  1: (UUID: MIG-e86f9207-0d01-5c9b-86a3-52b898dfcc97)
  MIG 1g.6gb      Device  2: (UUID: MIG-1e3010f6-d86a-5626-bd6c-fa95a81fb3ec)
  MIG 1g.6gb      Device  3: (UUID: MIG-3bf76b84-d6fe-5288-84f8-836234687af2)
-- 8< -----------------------------------------------------------


When submitting a job that requests 1 GPU:
- CUDA_VISIBLE_DEVICES is correctly set to the MIG instance allocated to the job
- access to the other MIG instances on the same GPU is denied (only 1 MIG instance is visible on the GPU where the allocated one is, here GPU 0)
- access to all the other GPUs (and their respective MIG instances) is still allowed:
-- 8< -----------------------------------------------------------
$ salloc -w sh03-17n15 -p test -G 1
...
$ echo $CUDA_VISIBLE_DEVICES
MIG-e347bd88-9df4-5c04-a3d2-6a46bcabace3

$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-4f7d29a7-23c1-9347-446d-799ab0b6325c)
  MIG 1g.6gb      Device  0: (UUID: MIG-e347bd88-9df4-5c04-a3d2-6a46bcabace3)
GPU 1: NVIDIA A30 (UUID: GPU-c1447669-520e-4e8e-c332-aa2090a85c77)
  MIG 1g.6gb      Device  0: (UUID: MIG-193ff5e0-7b97-58a4-9d1d-5f09e7f95ad2)
  MIG 1g.6gb      Device  1: (UUID: MIG-88655199-0699-52e3-aeb6-fccd792af898)
  MIG 1g.6gb      Device  2: (UUID: MIG-6faea62a-d14c-5bd5-8679-ea89f58b4570)
  MIG 1g.6gb      Device  3: (UUID: MIG-2d0f6fd5-59d6-51d8-8bcf-bccd298be4ba)
GPU 2: NVIDIA A30 (UUID: GPU-890e9696-cd25-935d-9795-11e238c4151e)
  MIG 1g.6gb      Device  0: (UUID: MIG-c033c49e-d78d-5792-8260-9fa99b3c15bf)
  MIG 1g.6gb      Device  1: (UUID: MIG-3a11955a-f8eb-5928-a4bc-cfcd1a3b0c24)
  MIG 1g.6gb      Device  2: (UUID: MIG-42c6e8e5-25db-5425-a1d4-60ddf54058cc)
  MIG 1g.6gb      Device  3: (UUID: MIG-6687ff99-9bc8-5a5e-9cf4-9041ddacb25b)
GPU 3: NVIDIA A30 (UUID: GPU-ac772b5a-123a-dc76-9480-5998f435fe84)
  MIG 1g.6gb      Device  0: (UUID: MIG-87e5d835-8046-594a-b237-ccc770b868ef)
  MIG 1g.6gb      Device  1: (UUID: MIG-caffac87-b4d9-5617-a0ff-f12aae1d0104)
  MIG 1g.6gb      Device  2: (UUID: MIG-4588ffa0-bf9b-5692-b0ef-876a6c136442)
  MIG 1g.6gb      Device  3: (UUID: MIG-9a70d283-e956-5db7-a3c2-3671a3e4d326)
GPU 4: NVIDIA A30 (UUID: GPU-02f2db56-6e23-11e7-a621-639f2668d474)
  MIG 1g.6gb      Device  0: (UUID: MIG-1818d0d2-d0a7-5239-b03a-3eb755aa36d4)
  MIG 1g.6gb      Device  1: (UUID: MIG-3c6baef7-e7e0-5204-ab47-20a9906bef31)
  MIG 1g.6gb      Device  2: (UUID: MIG-b3a4648d-713f-5dbd-9fd9-8187df4cc63c)
  MIG 1g.6gb      Device  3: (UUID: MIG-46950239-a5ca-57ae-9f28-6ad15392aa5a)
GPU 5: NVIDIA A30 (UUID: GPU-ec9dcdfd-4063-08bf-e4bd-33233822a900)
  MIG 1g.6gb      Device  0: (UUID: MIG-b867d322-9ae9-5cf1-9802-03974629d99f)
  MIG 1g.6gb      Device  1: (UUID: MIG-18780e41-9a19-5093-94d0-b6ff56dbaa01)
  MIG 1g.6gb      Device  2: (UUID: MIG-4d7b88f7-6a49-5993-9913-269a1231d303)
  MIG 1g.6gb      Device  3: (UUID: MIG-e205d6ce-7881-536b-b043-bc2704049fdf)
GPU 6: NVIDIA A30 (UUID: GPU-afcc1fc7-b089-30a4-02cd-d0ca1d30d47d)
  MIG 1g.6gb      Device  0: (UUID: MIG-41a8f143-32e9-5e38-87c2-66c3b003049a)
  MIG 1g.6gb      Device  1: (UUID: MIG-e86f9207-0d01-5c9b-86a3-52b898dfcc97)
  MIG 1g.6gb      Device  2: (UUID: MIG-1e3010f6-d86a-5626-bd6c-fa95a81fb3ec)
  MIG 1g.6gb      Device  3: (UUID: MIG-3bf76b84-d6fe-5288-84f8-836234687af2)
-- 8< -----------------------------------------------------------

Here's the portion of the slurmd debug logs:
-- 8< -----------------------------------------------------------
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _memcg_initialize: job: alloc=8000MB mem.limit=8000MB memsw.limit=8000MB job_swappiness=18446744073709551614
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _memcg_initialize: step: alloc=8000MB mem.limit=8000MB memsw.limit=8000MB job_swappiness=18446744073709551614
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.allow: adding c 195:0 rwm(/dev/nvidia0)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.allow: adding c 237:30 rwm(/dev/nvidia-caps/nvidia-cap30)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.allow: adding c 237:31 rwm(/dev/nvidia-caps/nvidia-cap31)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 195:1 rwm(/dev/nvidia1)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:165 rwm(/dev/nvidia-caps/nvidia-cap165)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:166 rwm(/dev/nvidia-caps/nvidia-cap166)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:39 rwm(/dev/nvidia-caps/nvidia-cap39)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:40 rwm(/dev/nvidia-caps/nvidia-cap40)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:174 rwm(/dev/nvidia-caps/nvidia-cap174)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:175 rwm(/dev/nvidia-caps/nvidia-cap175)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:48 rwm(/dev/nvidia-caps/nvidia-cap48)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:49 rwm(/dev/nvidia-caps/nvidia-cap49)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:183 rwm(/dev/nvidia-caps/nvidia-cap183)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:184 rwm(/dev/nvidia-caps/nvidia-cap184)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:57 rwm(/dev/nvidia-caps/nvidia-cap57)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:58 rwm(/dev/nvidia-caps/nvidia-cap58)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:192 rwm(/dev/nvidia-caps/nvidia-cap192)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: task/cgroup: _handle_device_access: GRES: job devices.deny: adding c 237:193 rwm(/dev/nvidia-caps/nvidia-cap193)
Mar 03 14:11:47 sh03-17n15.int slurmstepd[5164]: debug:  cgroup/v1: _oom_event_monitor: started.
-- 8< -----------------------------------------------------------

But there are many more devices on the node:
-- 8< -----------------------------------------------------------
# find /dev/nvidia-caps/ -type c | wc -l
66
-- 8< -----------------------------------------------------------

I guess the code that sets the devices cgroup up will also need to consider all the other GPUs present on the node, and deny access to those as well?

Thanks!
--
Kilian

Comment 2 Ben Glines 2023-03-07 09:56:30 MST

Hi Kilian,

Could you reply with an updated slurm.conf?

Comment 3 Kilian Cavalotti 2023-03-07 10:01:36 MST

Hi Ben,

(In reply to Ben Glines from comment #2)
> Could you reply with an updated slurm.conf?

Can I send you our slurm.conf directly? Attaching it here would require me to make that ticket private, and that would make it unavailable to other sites that may be interested.

Or I can attach here the specific parts you're interested in?

Thanks!
--
KIlian

Comment 5 Ben Glines 2023-03-07 10:03:41 MST

(In reply to Kilian Cavalotti from comment #3)
> Hi Ben,
> 
> (In reply to Ben Glines from comment #2)
> > Could you reply with an updated slurm.conf?
> 
> Can I send you our slurm.conf directly? Attaching it here would require me
> to make that ticket private, and that would make it unavailable to other
> sites that may be interested.
> 
> Or I can attach here the specific parts you're interested in?
> 
> Thanks!
> --
> KIlian

I'm mostly just interested in your node definition for sh03-17n15, specifically your Gres specification.

Comment 6 Kilian Cavalotti 2023-03-07 10:06:24 MST

(In reply to Ben Glines from comment #5)
> I'm mostly just interested in your node definition for sh03-17n15,
> specifically your Gres specification.

That I can do, no problem. Here it is:
-- 8< ------------------------------------------
# SH3_G8FP64m | MLN | 32c | 256GB | 8x A30
NodeName=sh03-17n[14-15] \
    Sockets=2 CoresPerSocket=16 \
    RealMemory=256000 \
    Gres=gpu:8 \
    Weight=164431 \
    Feature="IB:HDR,CPU_MNF:AMD,CPU_GEN:MLN,CPU_SKU:7543P,CPU_FRQ:2.75GHz,GPU_GEN:AMP,GPU_BRD:TESLA,GPU_SKU:A30,GPU_MEM:24GB,GPU_CC:8.0,CLASS:SH3_G8FP64m"
-- 8< ------------------------------------------

and the partition definition:
-- 8< ------------------------------------------
PartitionName=test \
    DefMemPerCPU=8000 \
    DefCPUPerGpu=1 \
    AllowGroups=sh_sysadm \
    PriorityTier=10000 \
    PriorityJobFactor=10000 \
            nodes=sh02-01n[59-60],sh03-01n[71-72],sh03-17n[14-15]
-- 8< ------------------------------------------

Thanks!
--
Kilian

Comment 7 Ben Glines 2023-03-07 10:11:16 MST

Looks like your issue might be that you only specified 8 gpus here.

Each MIG instance is really treated as its own GPU instance as mentioned in our docs: https://slurm.schedmd.com/gres.html#MIG_Management

The example there also gives insight into how a node definition would look like with a GPU without any MIGs, and a GPU with 2 MIGs configured.

Try changing your Gres specification to the number of MIGs you have and let me know if that fixes things.

Comment 8 Kilian Cavalotti 2023-03-07 11:57:51 MST

(In reply to Ben Glines from comment #7)
> Looks like your issue might be that you only specified 8 gpus here.
> 
> Each MIG instance is really treated as its own GPU instance as mentioned in
> our docs: https://slurm.schedmd.com/gres.html#MIG_Management
> 
> The example there also gives insight into how a node definition would look
> like with a GPU without any MIGs, and a GPU with 2 MIGs configured.
> 
> Try changing your Gres specification to the number of MIGs you have and let
> me know if that fixes things.

Ooh, I totally missed that! Changing the node definition to have `Gres=gpu:32` indeed fixes the problem. 

I somehow assumed that the AutoDetect=nvml part in gres.conf would have taken care of enumerating all the existing GPUs on the node, and forgot about the Gres option in the node definition.

Looking back at the documentation, it actually all makes sense. Except maybe this: I'm just not exactly sure what it means?
"""
The sanity-check AutoDetect mode is not supported for MIGs. 
"""

Thanks!
--
Kilian

Comment 9 Ben Glines 2023-03-07 12:12:26 MST

(In reply to Kilian Cavalotti from comment #8)
> Looking back at the documentation, it actually all makes sense. Except maybe
> this: I'm just not exactly sure what it means?
> """
> The sanity-check AutoDetect mode is not supported for MIGs. 
> """

From https://slurm.schedmd.com/gres.html#AutoDetect:
> By default, all system-detected devices are added to the node. However, if Type and 
> File in gres.conf match a GPU on the system, any other properties explicitly specified 
> (e.g. Cores or Links) can be double-checked against it. If the system-detected GPU 
> differs from its matching GPU configuration, then the GPU is omitted from the node with 
> an error. This allows gres.conf to serve as an optional sanity check and notifies 
> administrators of any unexpected changes in GPU properties.

You can use gres.conf to make sure that what NVML detects is what you actually expect, e.g. if NVML does not detect a GPU that you configured in gres.conf, then slurmd for that node will fatal(). This sort of double-checking/sanity-checking does not work for MIGs.

Comment 10 Kilian Cavalotti 2023-03-07 12:30:01 MST

(In reply to Ben Glines from comment #9)
> You can use gres.conf to make sure that what NVML detects is what you
> actually expect, e.g. if NVML does not detect a GPU that you configured in
> gres.conf, then slurmd for that node will fatal(). This sort of
> double-checking/sanity-checking does not work for MIGs.

Got it, thanks for the explanation!

Cheers,
--
Kilian