| Summary: | Fun with cgroups and CUDA | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
| Component: | Limits | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.03.10 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Stanford | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 14.11.4 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | Fix for CUDA env vars in Slurm v14.03 | ||
|
Description
Kilian Cavalotti
2015-02-02 10:07:40 MST
You can define specific file names in gres.conf and Slurm will use those numbers for the CUDA device numbering. For example: Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3 Will cause Slurm to used CUDA 2 and, not 0 and 1. What is in your gres.conf file and what are the GPU device names? Hi Moe, (In reply to Moe Jette from comment #1) > You can define specific file names in gres.conf and Slurm will use those > numbers for the CUDA device numbering. For example: > Name=gpu File=/dev/nvidia2 > Name=gpu File=/dev/nvidia3 > Will cause Slurm to used CUDA 2 and, not 0 and 1. Well, I want to be able to use all the GPUs, but enforce cgroup permissions so that users can only use the ones allocated to their jobs. Let me explain with an example: 0. node1 is a 4-GPUs node, currently idle, with the cgroup devices subsystem enabled 1. userA submits a 2-GPUs job on node1: - Slurm sets a cgroup, and allows access to /dev/nvidia0 and /dev/nvidia1 - Slurm sets CUDA_VISIBLE_DEVICES=0,1 - nvidia-smi -L returns info about GPU 0 and GPU 1 - deviceQuery can access the 2 GPUs, and returns 2 GPUs available 2. userB submits a 2-GPUs job on node1: - Slurm sets a cgroup, and allows access to /dev/nvidia2 and /dev/nvidia3 - Slurm sets CUDA_VISIBLE_DEVICES=2,3 - nvidia-smi -L returns info about /dev/nvidia2 and /dev/nvidia3, but lists them as 0 and 1 (since they are the first available in that context) - deviceQuery tries to access GPUs 2 and 3, as per CUDA_VISIBLE_DEVICES, but fails, because only GPU 0 and GPU 1 exist in its context, thus it returns 0 usable device. The problem is the way CUDA enumerates the GPUs: it always consider that the first GPU it can access is GPU 0. > What is in your gres.conf file and what are the GPU device names? It looks like this (4 GPUs in 16-CPUs nodes) # 4 GPUs nodes NodeName=gpu-9-[6-9] Name=gpu File=/dev/nvidia0 CPUs=[0-7] NodeName=gpu-9-[6-9] Name=gpu File=/dev/nvidia1 CPUs=[0-7] NodeName=gpu-9-[6-9] Name=gpu File=/dev/nvidia2 CPUs=[8-15] NodeName=gpu-9-[6-9] Name=gpu File=/dev/nvidia3 CPUs=[8-15] Hi, could you set the CUDA_VISIBLE_DEVICES based on what nvidia-smi returns before invoking deviceQuery? David (In reply to David Bigagli from comment #3) > Hi, > could you set the CUDA_VISIBLE_DEVICES based on what nvidia-smi returns > before invoking deviceQuery? Yes, I could manually override CUDA_VISIBLE_DEVICES: - either set it to 0,1,...N where N is the number of requested GPUs, since it will always be the same for CUDA, as it assumes the first GPU is always 0, the 2nd is always 1... - or unset it altogether, since the only GPUs reported by nvidia-smi will be those that are allowed in the cgroup. So CUDA_VISIBLE_DEVICES doesn't really makes sense anymore. The point of my "question" was to make sure that you're aware the CUDA 7 introduces a change of behavior, and that if used in conjunction with cgroups, may render the current Slurm feature of automatically setting CUDA_VISIBLE_DEVICES less suitable for jobs. I guess ideally, CUDA_VISIBLE_DEVICES shouldn't be automatically set when using the cgroup devices subsystem *and* CUDA 7. But I'm not sure that's doable at the Slurm level, though. One surely can manually set CUDA_VISIBLE_DEVICES at the beginning of a job script, but that will be probably be worth at least a note in the documentation, since that would be overriding the scheduler setting. Yes indeed, it is very good to know and worth to investigate. I was just in my hacker state of mind :-) David Created attachment 1604 [details]
Fix for CUDA env vars in Slurm v14.03
This patch should fix the problem. It was built against our version 14.03 code branch. I do not expect that we will have any more releases of version 14.03, but this will be in our next release of version 14.11.4. If you have an opportunity to test this, I would appreciate confirmation that this fix works.
Hi Moe, (In reply to Moe Jette from comment #6) > Created attachment 1604 [details] > Fix for CUDA env vars in Slurm v14.03 > > This patch should fix the problem. It was built against our version 14.03 > code branch. I do not expect that we will have any more releases of version > 14.03, but this will be in our next release of version 14.11.4. If you have > an opportunity to test this, I would appreciate confirmation that this fix > works. Thanks for the patch! Let me make sure I understand how this works. It does indeed seems to fix the problem when using cgroups by setting CUDA_VISIBLE_DEVICES to 0,...,#_requested_GPUs whatever the actual allocated GPUs are, is that right? What I've seen: * jobA: -- 8< -------------------------------------------------------------------------- $ srun --gres gpu:2 --pty bash [gpu ~]$ echo $CUDA_VISIBLE_DEVICES 0,1 $ nvidia-smi -L GPU 0: GeForce GTX TITAN Black (UUID: GPU-c2b0abac-8798-0fc5-94a9-9f15836fe1b1) GPU 1: GeForce GTX TITAN Black (UUID: GPU-5e5ac012-7a28-7831-aa5f-00843872f01c) -- 8< -------------------------------------------------------------------------- * jobB: -- 8< -------------------------------------------------------------------------- $ srun --gres gpu:2 --pty bash [gpu ~]$ echo $CUDA_VISIBLE_DEVICES 0,1 $ nvidia-smi -L GPU 0: GeForce GTX TITAN Black (UUID: GPU-73c0ea2c-b5d2-dc2c-d9f7-4a06c0f78f91) GPU 1: GeForce GTX TITAN Black (UUID: GPU-6485856f-8031-d3a1-2273-da68b6639308) -- 8< -------------------------------------------------------------------------- So CUDA_VISIBLE_DEVICES=0,1 in both cases, which matches what nvidia-smi expects, so that's good. But now, if you disable the cgroup device subsystem (ConstrainDevices=no in cgroup.conf), it breaks compatibility, since it still sets CUDA_VISIBLE_DEVICES=0,1 in both cases, except GPU0 and GPU1 are now the same in both jobs. * jobA with ConstrainDevices=no: -- 8< -------------------------------------------------------------------------- $ srun --gres gpu:2 --pty bash [gpu ~]$ echo $CUDA_VISIBLE_DEVICES 0,1 $ nvidia-smi -L GPU 0: GeForce GTX TITAN Black (UUID: GPU-c2b0abac-8798-0fc5-94a9-9f15836fe1b1) GPU 1: GeForce GTX TITAN Black (UUID: GPU-5e5ac012-7a28-7831-aa5f-00843872f01c) GPU 2: GeForce GTX TITAN Black (UUID: GPU-73c0ea2c-b5d2-dc2c-d9f7-4a06c0f78f91) GPU 3: GeForce GTX TITAN Black (UUID: GPU-6485856f-8031-d3a1-2273-da68b6639308) -- 8< -------------------------------------------------------------------------- * jobB with ConstrainDevices=no: -- 8< -------------------------------------------------------------------------- $ srun --gres gpu:2 --pty bash [gpu ~]$ echo $CUDA_VISIBLE_DEVICES 0,1 $ nvidia-smi -L GPU 0: GeForce GTX TITAN Black (UUID: GPU-c2b0abac-8798-0fc5-94a9-9f15836fe1b1) GPU 1: GeForce GTX TITAN Black (UUID: GPU-5e5ac012-7a28-7831-aa5f-00843872f01c) GPU 2: GeForce GTX TITAN Black (UUID: GPU-73c0ea2c-b5d2-dc2c-d9f7-4a06c0f78f91) GPU 3: GeForce GTX TITAN Black (UUID: GPU-6485856f-8031-d3a1-2273-da68b6639308) -- 8< -------------------------------------------------------------------------- So now both jobs will actually use the same two GPUs, which is not good. I guess there's some sort of combination matrix required here, to illustrate what the scheduler needs to set. For instance, for jobB, that would be: CUDA version | ConstrainDevices=no | ConstrainDevices=yes -------------+--------------------------+------------------------- < 7.0 | CUDA_VISIBLE_DEVICES=2,3 | breaks CUDA >=7.0 | CUDA_VISIBLE_DEVICES=2,3 | CUDA_VISIBLE_DEVICES=0,1 (or not set) I hope this makes some sense. (In reply to Kilian Cavalotti from comment #7) > I guess there's some sort of combination matrix required here, to illustrate > what the scheduler needs to set. For instance, for jobB, that would be: > > CUDA version | ConstrainDevices=no | ConstrainDevices=yes > -------------+--------------------------+------------------------- > < 7.0 | CUDA_VISIBLE_DEVICES=2,3 | breaks CUDA > >=7.0 | CUDA_VISIBLE_DEVICES=2,3 | CUDA_VISIBLE_DEVICES=0,1 (or not > set) > > > I hope this makes some sense. Thanks. That makes good sense, unfortunately it will require moving around a fair bit of code so that the ConstrainDevices information is available from the gres plugin. The changes are quite straightforward, but it's not as simple as that patch that I sent attached earlier today. > Thanks. That makes good sense, unfortunately it will require moving around a
> fair bit of code so that the ConstrainDevices information is available from
> the gres plugin. The changes are quite straightforward, but it's not as
> simple as that patch that I sent attached earlier today.
I understand and I really appreciate the effort.
In the meantime, I also reported this to NVIDIA, hoping that they would be willing to either change, or at least introduce a new, absolute GPU numbering scheme, that would match the /dev/nvidiaX indices.
Thanks!
This should be fixed in v14.11.4 when released, probably in a few days. Move the cgroup.conf read logic into a module that can be used from gres/gpu plugin: https://github.com/SchedMD/slurm/commit/c6b13b0ee4e60368c14aa8f0f819281b8c3605e2 Control setting of CUDA_VISIBLE_DEVICES: https://github.com/SchedMD/slurm/commit/da2fba48e3042f0ed89d58ce8abb00ea1f6a9323 (In reply to Moe Jette from comment #10) > This should be fixed in v14.11.4 when released, probably in a few days. > > Move the cgroup.conf read logic into a module that can be used from gres/gpu > plugin: > https://github.com/SchedMD/slurm/commit/ > c6b13b0ee4e60368c14aa8f0f819281b8c3605e2 > > Control setting of CUDA_VISIBLE_DEVICES: > https://github.com/SchedMD/slurm/commit/ > da2fba48e3042f0ed89d58ce8abb00ea1f6a9323 That looks awesome, thank you. I'll give it a try when 14.11.4 when it's released. Thanks again! Fixed in v14.11.4, which is current available. |