Ticket 1421

Summary:	Fun with cgroups and CUDA
Product:	Slurm	Reporter:	Kilian Cavalotti <kilian>
Component:	Limits	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	brian, da
Version:	14.03.10
Hardware:	Linux
OS:	Linux
Site:	Stanford	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	14.11.4
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Fix for CUDA env vars in Slurm v14.03

Description Kilian Cavalotti 2015-02-02 10:07:40 MST

Hi there, 

I've been experimenting with the cgroup device subsystem and CUDA for some time now. You're probably aware of this already, but until very recently and the release of CUDA 7.0 RC, cgroups were not supported at all in CUDA: having a blacklisted /dev/nvidia* device in a cgroup would result in calls to cudaGetDeviceCount() to fail when accessing that device. So basically, any CUDA program in that environment would stop working as soon as 1 GPU was not accessible to the job.

This changed in CUDA 7, as the release notes state:
"""
Instrumented NVML (NVIDA Management Library) and the CUDA driver to ignore
GPUs that have been made inaccessible via cgroups (control groups). This enables
schedulers that rely on cgroups to enforce device access restrictions for their jobs.
Job schedulers wanting to use cgroups for device restriction need CUDA and NVML
to handle those restrictions in a graceful way
""" 

This is great news, but there's still an issue: CUDA keeps numbering the GPUs in a relative fashion (ie. always starting at 0, ignoring the real order of the GPUs on the PCI bus), while Slurm sets CUDA_VISIBLE_DEVICES in an absolute scheme (ie. the first GPU on the PCI bus is 0, 2nd is 1 and so on).

To illustrate this, here is an example were a job requests 2 GPUs on a 4-GPUs nodes, where the first two ones are already used by another job. Slurm sets CUDA_VISIBLE_DEVICES to "2,3", but CUDA expects it to be "0,1":

-- 8< --------------------------------------------------------------------------
$ srun -n 2 -N 1  --gres gpu:2 --pty bash
[gpu ~]$ echo $CUDA_VISIBLE_DEVICES 
2,3
[gpu ~]$ nvidia-smi -L
GPU 0: GeForce GTX TITAN Black (UUID: GPU-73c0ea2c-b5d2-dc2c-d9f7-4a06c0f78f91)
GPU 1: GeForce GTX TITAN Black (UUID: GPU-6485856f-8031-d3a1-2273-da68b6639308)
[gpu ~]$ deviceQuery 
[...]
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
-- 8< --------------------------------------------------------------------------

Since CUDA_VISIBLE_DEVICES=2,3 and NVML only sees GPUs 0 and 1, deviceQuery doesn't see any usable GPU. 

If we unset CUDA_VISIBLE_DEVICES, it works fine:
-- 8< --------------------------------------------------------------------------
[gpu ~]$ unset CUDA_VISIBLE_DEVICES
[gpu ~]$ nvidia-smi -L
GPU 0: GeForce GTX TITAN Black (UUID: GPU-73c0ea2c-b5d2-dc2c-d9f7-4a06c0f78f91)
GPU 1: GeForce GTX TITAN Black (UUID: GPU-6485856f-8031-d3a1-2273-da68b6639308)
[gpu ~]$ deviceQuery 
[...]
Detected 2 CUDA Capable device(s)
[...]
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 2, Device0 = GeForce GTX TITAN Black, Device1 = GeForce GTX TITAN Black
-- 8< --------------------------------------------------------------------------

So it looks like Slurm would be better off NOT setting CUDA_VISIBLE_DEVICES when a combination of CUDA 7 and the cgroup devices subsystem is used. What do you think?

CUDA 7 also introduces new ways to enumerate GPUs, namely:
1. CUDA_VISIBLE_DEVICES now accepts GPU UUIDs to enumerate devices
2. Variable CUDA_DEVICE_ORDER can have a value of FASTEST_FIRST (default) or
PCI_BUS_ID.

So maybe 1. could help the cause here, but 2. seems to just modify the order of the enumeration, not the fact that it's relative and always starts at 0.

Cheers,
Kilian

Comment 1 Moe Jette 2015-02-02 10:23:01 MST

You can define specific file names in gres.conf and Slurm will use those numbers for the CUDA device numbering. For example:
Name=gpu File=/dev/nvidia2
Name=gpu File=/dev/nvidia3
Will cause Slurm to used CUDA 2 and, not 0 and 1.

What is in your gres.conf file and what are the GPU device names?

Comment 2 Kilian Cavalotti 2015-02-02 10:35:51 MST

Hi Moe, 

(In reply to Moe Jette from comment #1)
> You can define specific file names in gres.conf and Slurm will use those
> numbers for the CUDA device numbering. For example:
> Name=gpu File=/dev/nvidia2
> Name=gpu File=/dev/nvidia3
> Will cause Slurm to used CUDA 2 and, not 0 and 1.

Well, I want to be able to use all the GPUs, but enforce cgroup permissions so that users can only use the ones allocated to their jobs.

Let me explain with an example:

0. node1 is a 4-GPUs node, currently idle, with the cgroup devices subsystem enabled

1. userA submits a 2-GPUs job on node1:
  - Slurm sets a cgroup, and allows access to /dev/nvidia0 and /dev/nvidia1 
  - Slurm sets CUDA_VISIBLE_DEVICES=0,1
  - nvidia-smi -L returns info about GPU 0 and GPU 1
  - deviceQuery can access the 2 GPUs, and returns 2 GPUs available

2. userB submits a 2-GPUs job on node1:
  - Slurm sets a cgroup, and allows access to /dev/nvidia2 and /dev/nvidia3
  - Slurm sets CUDA_VISIBLE_DEVICES=2,3
  - nvidia-smi -L returns info about /dev/nvidia2 and /dev/nvidia3, but lists them as 0 and 1 (since they are the first available in that context)
  - deviceQuery tries to access GPUs 2 and 3, as per CUDA_VISIBLE_DEVICES, but fails, because only GPU 0 and GPU 1 exist in its context, thus it returns 0 usable device.

The problem is the way CUDA enumerates the GPUs: it always consider that the first GPU it can access is GPU 0.


> What is in your gres.conf file and what are the GPU device names?

It looks like this (4 GPUs in 16-CPUs nodes)
# 4 GPUs nodes
NodeName=gpu-9-[6-9] Name=gpu File=/dev/nvidia0 CPUs=[0-7]
NodeName=gpu-9-[6-9] Name=gpu File=/dev/nvidia1 CPUs=[0-7]
NodeName=gpu-9-[6-9] Name=gpu File=/dev/nvidia2 CPUs=[8-15]
NodeName=gpu-9-[6-9] Name=gpu File=/dev/nvidia3 CPUs=[8-15]

Comment 3 David Bigagli 2015-02-02 10:40:34 MST

Hi,
   could you set the CUDA_VISIBLE_DEVICES based on what nvidia-smi returns
before invoking deviceQuery? 

David

Comment 4 Kilian Cavalotti 2015-02-02 10:54:04 MST

(In reply to David Bigagli from comment #3)
> Hi,
>    could you set the CUDA_VISIBLE_DEVICES based on what nvidia-smi returns
> before invoking deviceQuery? 

Yes, I could manually override CUDA_VISIBLE_DEVICES:
- either set it to 0,1,...N where N is the number of requested GPUs, since it will always be the same for CUDA, as it assumes the first GPU is always 0, the 2nd is always 1...
- or unset it altogether, since the only GPUs reported by nvidia-smi will be those that are allowed in the cgroup. So CUDA_VISIBLE_DEVICES doesn't really makes sense anymore.

The point of my "question" was to make sure that you're aware the CUDA 7 introduces a change of behavior, and that if used in conjunction with cgroups, may render the current Slurm feature of automatically setting CUDA_VISIBLE_DEVICES less suitable for jobs.

I guess ideally, CUDA_VISIBLE_DEVICES shouldn't be automatically set when using the cgroup devices subsystem *and* CUDA 7. But I'm not sure that's doable at the Slurm level, though.

One surely can manually set CUDA_VISIBLE_DEVICES at the beginning of a job script, but that will be probably be worth at least a note in the documentation, since that would be overriding the scheduler setting.

Comment 5 David Bigagli 2015-02-02 10:56:17 MST

Yes indeed, it is very good to know and worth to investigate. I was just in my hacker state of mind :-)

David

Comment 6 Moe Jette 2015-02-03 02:16:16 MST

Created attachment 1604 [details]
Fix for CUDA env vars in Slurm v14.03

This patch should fix the problem. It was built against our version 14.03 code branch. I do not expect that we will have any more releases of version 14.03, but this will be in our next release of version 14.11.4. If you have an opportunity to test this, I would appreciate confirmation that this fix works.

Comment 7 Kilian Cavalotti 2015-02-03 05:44:17 MST

Hi Moe,

(In reply to Moe Jette from comment #6)
> Created attachment 1604 [details]
> Fix for CUDA env vars in Slurm v14.03
> 
> This patch should fix the problem. It was built against our version 14.03
> code branch. I do not expect that we will have any more releases of version
> 14.03, but this will be in our next release of version 14.11.4. If you have
> an opportunity to test this, I would appreciate confirmation that this fix
> works.

Thanks for the patch!
Let me make sure I understand how this works.

It does indeed seems to fix the problem when using cgroups by setting CUDA_VISIBLE_DEVICES to 0,...,#_requested_GPUs whatever the actual allocated GPUs are, is that right?

What I've seen:

* jobA:
-- 8< --------------------------------------------------------------------------
$ srun --gres gpu:2 --pty bash
[gpu ~]$ echo $CUDA_VISIBLE_DEVICES
0,1
$ nvidia-smi -L
GPU 0: GeForce GTX TITAN Black (UUID: GPU-c2b0abac-8798-0fc5-94a9-9f15836fe1b1)
GPU 1: GeForce GTX TITAN Black (UUID: GPU-5e5ac012-7a28-7831-aa5f-00843872f01c)
-- 8< --------------------------------------------------------------------------

* jobB:
-- 8< --------------------------------------------------------------------------
$ srun --gres gpu:2 --pty bash
[gpu ~]$ echo $CUDA_VISIBLE_DEVICES
0,1
$ nvidia-smi -L
GPU 0: GeForce GTX TITAN Black (UUID: GPU-73c0ea2c-b5d2-dc2c-d9f7-4a06c0f78f91)
GPU 1: GeForce GTX TITAN Black (UUID: GPU-6485856f-8031-d3a1-2273-da68b6639308)
-- 8< --------------------------------------------------------------------------

So CUDA_VISIBLE_DEVICES=0,1 in both cases, which matches what nvidia-smi expects, so that's good.


But now, if you disable the cgroup device subsystem (ConstrainDevices=no in cgroup.conf), it breaks compatibility, since it still sets CUDA_VISIBLE_DEVICES=0,1 in both cases, except GPU0 and GPU1 are now the same in both jobs.

* jobA with ConstrainDevices=no:
-- 8< --------------------------------------------------------------------------
$ srun --gres gpu:2 --pty bash
[gpu ~]$ echo $CUDA_VISIBLE_DEVICES
0,1
$ nvidia-smi -L
GPU 0: GeForce GTX TITAN Black (UUID: GPU-c2b0abac-8798-0fc5-94a9-9f15836fe1b1)
GPU 1: GeForce GTX TITAN Black (UUID: GPU-5e5ac012-7a28-7831-aa5f-00843872f01c)
GPU 2: GeForce GTX TITAN Black (UUID: GPU-73c0ea2c-b5d2-dc2c-d9f7-4a06c0f78f91)
GPU 3: GeForce GTX TITAN Black (UUID: GPU-6485856f-8031-d3a1-2273-da68b6639308)
-- 8< --------------------------------------------------------------------------

* jobB with ConstrainDevices=no:
-- 8< --------------------------------------------------------------------------
$ srun --gres gpu:2 --pty bash
[gpu ~]$ echo $CUDA_VISIBLE_DEVICES
0,1
$ nvidia-smi -L
GPU 0: GeForce GTX TITAN Black (UUID: GPU-c2b0abac-8798-0fc5-94a9-9f15836fe1b1)
GPU 1: GeForce GTX TITAN Black (UUID: GPU-5e5ac012-7a28-7831-aa5f-00843872f01c)
GPU 2: GeForce GTX TITAN Black (UUID: GPU-73c0ea2c-b5d2-dc2c-d9f7-4a06c0f78f91)
GPU 3: GeForce GTX TITAN Black (UUID: GPU-6485856f-8031-d3a1-2273-da68b6639308)
-- 8< --------------------------------------------------------------------------

So now both jobs will actually use the same two GPUs, which is not good.

I guess there's some sort of combination matrix required here, to illustrate what the scheduler needs to set. For instance, for jobB, that would be:

CUDA version | ConstrainDevices=no      | ConstrainDevices=yes
-------------+--------------------------+-------------------------
 < 7.0       | CUDA_VISIBLE_DEVICES=2,3 | breaks CUDA
 >=7.0       | CUDA_VISIBLE_DEVICES=2,3 | CUDA_VISIBLE_DEVICES=0,1 (or not set)


I hope this makes some sense.

Comment 8 Moe Jette 2015-02-03 07:28:14 MST

(In reply to Kilian Cavalotti from comment #7)
> I guess there's some sort of combination matrix required here, to illustrate
> what the scheduler needs to set. For instance, for jobB, that would be:
> 
> CUDA version | ConstrainDevices=no      | ConstrainDevices=yes
> -------------+--------------------------+-------------------------
>  < 7.0       | CUDA_VISIBLE_DEVICES=2,3 | breaks CUDA
>  >=7.0       | CUDA_VISIBLE_DEVICES=2,3 | CUDA_VISIBLE_DEVICES=0,1 (or not
> set)
> 
> 
> I hope this makes some sense.

Thanks. That makes good sense, unfortunately it will require moving around a fair bit of code so that the ConstrainDevices information is available from the gres plugin. The changes are quite straightforward, but it's not as simple as that patch that I sent attached earlier today.

Comment 9 Kilian Cavalotti 2015-02-03 07:31:36 MST

> Thanks. That makes good sense, unfortunately it will require moving around a
> fair bit of code so that the ConstrainDevices information is available from
> the gres plugin. The changes are quite straightforward, but it's not as
> simple as that patch that I sent attached earlier today.

I understand and I really appreciate the effort. 

In the meantime, I also reported this to NVIDIA, hoping that they would be willing to either change, or at least introduce a new, absolute GPU numbering scheme, that would match the /dev/nvidiaX indices. 

Thanks!

Comment 10 Moe Jette 2015-02-03 09:19:07 MST

This should be fixed in v14.11.4 when released, probably in a few days.

Move the cgroup.conf read logic into a module that can be used from gres/gpu plugin:
https://github.com/SchedMD/slurm/commit/c6b13b0ee4e60368c14aa8f0f819281b8c3605e2

Control setting of CUDA_VISIBLE_DEVICES:
https://github.com/SchedMD/slurm/commit/da2fba48e3042f0ed89d58ce8abb00ea1f6a9323

Comment 11 Kilian Cavalotti 2015-02-03 09:40:36 MST

(In reply to Moe Jette from comment #10)
> This should be fixed in v14.11.4 when released, probably in a few days.
> 
> Move the cgroup.conf read logic into a module that can be used from gres/gpu
> plugin:
> https://github.com/SchedMD/slurm/commit/
> c6b13b0ee4e60368c14aa8f0f819281b8c3605e2
> 
> Control setting of CUDA_VISIBLE_DEVICES:
> https://github.com/SchedMD/slurm/commit/
> da2fba48e3042f0ed89d58ce8abb00ea1f6a9323

That looks awesome, thank you.
I'll give it a try when 14.11.4 when it's released.

Thanks again!

Comment 12 Moe Jette 2015-02-18 05:21:54 MST

Fixed in v14.11.4, which is current available.