Ticket 11097

Summary:	Control of GPU visibility variables
Product:	Slurm	Reporter:	Matt Ezell <ezellma>
Component:	GPU	Assignee:	Director of Support <support>
Status:	RESOLVED FIXED	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	tim
Version:	20.11.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=8596
Site:	ORNL-OLCF	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	21.08.0
Target Release:	future	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Disable setting CVD v1 v2

Description Matt Ezell 2021-03-15 19:46:12 MDT

Slurm currently sets CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, and GPU_DEVICE_ORDINAL for tasks that request GPUs. We've run into an issue where this causes ROCm-based applications to fail.

A simplified example (without Slurm even in the picture) is:

bash-4.4$ ./simple_hip 
We have 4 devices
bash-4.4$ ROCR_VISIBLE_DEVICES=1 ./simple_hip 
We have 1 devices
bash-4.4$ CUDA_VISIBLE_DEVICES=1 ./simple_hip 
We have 1 devices
bash-4.4$ ROCR_VISIBLE_DEVICES=1 CUDA_VISIBLE_DEVICES=1 ./simple_hip 
We have 0 devices
Segmentation fault (core dumped)

I opened a bug with AMD, but they claim that this situation (specifying both CVD and RVD) is illegal. It seems that the ROCm runtime processes RVD first, and then CVD. So RVD=1 selects the second GPU only, but then CVD=1 tries to select the second (remaining) GPU which doesn't exist.

I'm still working with AMD to understand why the current behavior is as it currently is, but I wanted to go ahead and document that currently Slurm isn't 100% compatible with AMD GPUs and see what path we have (from the Slurm-only) side to make progress.

Does it make sense to add controls for what environment variables get set (CUDA, ROCR, both, neither)? If modern ROCm works with CUDA_VISIBLE_DEVICES maybe just stop setting ROCM_VISIBLE_DEVICES?

Comment 1 Michael Hinton 2021-03-16 10:35:11 MDT

Hi Matt,

(In reply to Matt Ezell from comment #0)
> Slurm currently sets CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, and
> GPU_DEVICE_ORDINAL for tasks that request GPUs. We've run into an issue
> where this causes ROCm-based applications to fail.
> 
> A simplified example (without Slurm even in the picture) is:
> 
> bash-4.4$ ./simple_hip 
> We have 4 devices
> bash-4.4$ ROCR_VISIBLE_DEVICES=1 ./simple_hip 
> We have 1 devices
> bash-4.4$ CUDA_VISIBLE_DEVICES=1 ./simple_hip 
> We have 1 devices
> bash-4.4$ ROCR_VISIBLE_DEVICES=1 CUDA_VISIBLE_DEVICES=1 ./simple_hip 
> We have 0 devices
> Segmentation fault (core dumped)
> 
> I opened a bug with AMD, but they claim that this situation (specifying both
> CVD and RVD) is illegal. It seems that the ROCm runtime processes RVD first,
> and then CVD. So RVD=1 selects the second GPU only, but then CVD=1 tries to
> select the second (remaining) GPU which doesn't exist.
> 
> I'm still working with AMD to understand why the current behavior is as it
> currently is, but I wanted to go ahead and document that currently Slurm
> isn't 100% compatible with AMD GPUs and see what path we have (from the
> Slurm-only) side to make progress.
Good to know! And also that's bizarre that it works like that. Thanks for reporting this.

> Does it make sense to add controls for what environment variables get set
> (CUDA, ROCR, both, neither)? If modern ROCm works with CUDA_VISIBLE_DEVICES
> maybe just stop setting ROCM_VISIBLE_DEVICES?
Could I send you a patch to try out the second option? Do you have any more information about what AMD GPUs are affected by this issue (e.g. when did AMD GPUs start looking at CUDA_VISIBLE_DEVICES)? Removing ROCM_VISIBLE_DEVICES might work in your case, but if it doesn't work in all cases, we might be forced to go with the first option and add in a control for which env vars get set...

Thanks,
-Michael

Comment 2 Matt Ezell 2021-03-16 14:08:56 MDT

(In reply to Michael Hinton from comment #1)
> Could I send you a patch to try out the second option? Do you have any more
> information about what AMD GPUs are affected by this issue (e.g. when did
> AMD GPUs start looking at CUDA_VISIBLE_DEVICES)? Removing
> ROCM_VISIBLE_DEVICES might work in your case, but if it doesn't work in all
> cases, we might be forced to go with the first option and add in a control
> for which env vars get set...

Our current workaround is to srun a wrapper script that just unsets one of the environment variables:

$ cat wrap.sh
#!/bin/bash
unset CUDA_VISIBLE_DEVICES
exec $*

Is the patch just to comment out https://github.com/SchedMD/slurm/blob/4c1ccec1aed42701c893c97f3bf386852dc073c1/src/plugins/gres/gpu/gres_gpu.c#L139 ?

Curiously enough, based on an inspection of the code, it seems that ROCR_VISIBLE_DEVICES is not set for the epilog.

I think this is all happening in the ROCr runtime, so all AMD GPUs are likely impacted the same based on ROCm version.

It appears that CUDA_VISIBLE_DEVICES is actually a synonym for HIP_VISIBLE_DEVICES (not ROCR_VISIBLE_DEVICES) and only impacts HIP programs. git blame shows this has been there at least since 2016.

Comment 3 Michael Hinton 2021-03-16 15:38:10 MDT

(In reply to Matt Ezell from comment #2)
> Is the patch just to comment out
> https://github.com/SchedMD/slurm/blob/
> 4c1ccec1aed42701c893c97f3bf386852dc073c1/src/plugins/gres/gpu/gres_gpu.
> c#L139 ?
Yes :)

> Curiously enough, based on an inspection of the code, it seems that
> ROCR_VISIBLE_DEVICES is not set for the epilog.
This seemed familiar to me, and it turns out that it is actually a known issue that I submitted a patch for a while back, but it has yet to be reviewed. I will see if I can get this prioritized for you.

> I think this is all happening in the ROCr runtime, so all AMD GPUs are
> likely impacted the same based on ROCm version.
> 
> It appears that CUDA_VISIBLE_DEVICES is actually a synonym for
> HIP_VISIBLE_DEVICES (not ROCR_VISIBLE_DEVICES) and only impacts HIP
> programs. git blame shows this has been there at least since 2016.
Is this the ROCm code? Do you have a reference to the code you are referring to?

-Michael

Comment 4 Matt Ezell 2021-03-16 18:01:34 MDT

(In reply to Michael Hinton from comment #3)
> This seemed familiar to me, and it turns out that it is actually a known
> issue that I submitted a patch for a while back, but it has yet to be
> reviewed. I will see if I can get this prioritized for you.

We allocate nodes exclusively, so anything we do in the epilog is for all devices - so not necessarily a priority for us, just something I noticed.


> Is this the ROCm code? Do you have a reference to the code you are referring
> to?

The HIP layer reading {HIP,CUDA}_VISIBLE_DEVICES is at https://github.com/ROCm-Developer-Tools/HIP/blob/2080cc113a2d767352b512b9d24c0620b6dee790/src/hip_hcc.cpp#L1354 with the actual work done in HIP_VISIBLE_DEVICES_callback.

The ROCr code reads the environment variable at https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/36b56afdd88f235c3ee28c86aa9075597ab3ea4c/src/core/util/flag.h#L76 with much of the actual work happening in https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/36b56afdd88f235c3ee28c86aa9075597ab3ea4c/src/core/runtime/amd_filter_device.cpp

Comment 5 Matt Ezell 2021-03-18 13:10:24 MDT

AMD says "Setting both RVD and CVD is typically unnecessary and may be harmful." They explained a hypothetical situation where a code might need use RVD and CVD simultaneously, with them set to different values. So I don't expect the behavior to change in future version of ROCm.

They also said that they can't guarantee that CVD will work for all programming models and use cases, so we need RVD set.

Unfortunately it seems that we will need some way to control which env-vars get set.

Comment 6 Michael Hinton 2021-03-18 14:59:38 MDT

(In reply to Matt Ezell from comment #5)
> AMD says "Setting both RVD and CVD is typically unnecessary and may be
> harmful." They explained a hypothetical situation where a code might need
> use RVD and CVD simultaneously, with them set to different values. So I
> don't expect the behavior to change in future version of ROCm.
> 
> They also said that they can't guarantee that CVD will work for all
> programming models and use cases, so we need RVD set.
> 
> Unfortunately it seems that we will need some way to control which env-vars
> get set.
Ok, that is very good to know. I'll go ahead and mark this as an enhancement, and we'll discuss this internally.

Thanks!
-Michael

Comment 8 Michael Hinton 2021-03-18 15:24:54 MDT

Matt,

It's my understanding that GPU_DEVICE_ORDINAL is for AMD OpenCL, which is superseded by AMD ROCm and ROCR_VISIBLE_DEVICES for AMD GPUs. Is this correct? Is there anything tricky that we should be aware of with GPU_DEVICE_ORDINAL vs. ROCR_VISIBLE_DEVICES?

Comment 9 Matt Ezell 2021-03-18 16:25:17 MDT

(In reply to Michael Hinton from comment #8)
> It's my understanding that GPU_DEVICE_ORDINAL is for AMD OpenCL, which is
> superseded by AMD ROCm and ROCR_VISIBLE_DEVICES for AMD GPUs. Is this
> correct? Is there anything tricky that we should be aware of with
> GPU_DEVICE_ORDINAL vs. ROCR_VISIBLE_DEVICES?

Yes, same behavior for an OpenCL program:

bash-4.4$ srun -n1 /opt/rocm-4.0.1/opencl/bin/clinfo |grep "Number of devices"
Number of devices:                               4
bash-4.4$ srun -n1 /bin/bash -c 'ROCR_VISIBLE_DEVICES=1 /opt/rocm-4.0.1/opencl/bin/clinfo' |grep "Number of devices"
Number of devices:                               1
bash-4.4$ srun -n1 /bin/bash -c 'ROCR_VISIBLE_DEVICES=1 GPU_DEVICE_ORDINAL=1 /opt/rocm-4.0.1/opencl/bin/clinfo' |grep "Number of devices"
Number of devices:                               0

Comment 10 Matt Ezell 2021-04-19 11:27:23 MDT

Created attachment 19041 [details]
Disable setting CVD

This is the patch we are running on Spock, our EA system.  Obviously it's not suitable to be merged upstream.

Has there been any discussion that you can share about the "right" way to address this?

What if we added a field to gres.conf called EnvVars. gpu/nvml and gpu/rsmi could set these to the correct value if AutoDetect were in use. If unset, it could default to current behavior (set all 3 currently set).

Comment 14 Matt Ezell 2021-07-21 11:18:56 MDT

(In reply to Matt Ezell from comment #10)
> Has there been any discussion that you can share about the "right" way to
> address this?
> 
> What if we added a field to gres.conf called EnvVars. gpu/nvml and gpu/rsmi
> could set these to the correct value if AutoDetect were in use. If unset, it
> could default to current behavior (set all 3 currently set).

I'd love to avoid having to patch this for the entire 21.08 release cycle. Has there been any work on this that is under review?

Comment 15 Michael Hinton 2021-07-21 11:53:03 MDT

Hi Matt,

Unfortunately, this won't make it into the 21.08 release, and there is nothing actively under review. We've already frozen the feature addition process and are working hard to get existing sponsored feature requests in. We'll have to pick this back up after 21.08 is released.

Thanks,
-Michael

Comment 16 Matt Ezell 2021-07-21 11:56:46 MDT

(In reply to Michael Hinton from comment #15)
> Unfortunately, this won't make it into the 21.08 release, and there is
> nothing actively under review. We've already frozen the feature addition
> process and are working hard to get existing sponsored feature requests in.
> We'll have to pick this back up after 21.08 is released.

Then it's probably worth adding something to the docs that multi-AMD-gpu nodes are effectively unusable/unsupported in 21.08.

Comment 17 Michael Hinton 2021-07-21 14:38:17 MDT

(In reply to Matt Ezell from comment #16)
> Then it's probably worth adding something to the docs that multi-AMD-gpu
> nodes are effectively unusable/unsupported in 21.08.
I 100% agree that we should document this limitation, since we are currently not planning on fixing it in 21.08, and since it is broken in the prior releases as well.

I do like your approach in comment 10, and we were discussing that internally, but then other things for this release took priority.

Normally, adding a new parameter to gres.conf would be a change that probably has to wait until the next major release (22.05), but if it is backwards-compatible, and if it doesn't change anything at the RPC layer (e.g. if we only add a flag instead of an entire new field), you may be able to convince Tim to commit it to a 21.08 minor release. It *does* seem to be more of a bug fix than an enhancement, and there isn't really a great workaround.

-Michael

Comment 20 Michael Hinton 2021-07-26 16:17:27 MDT

Created attachment 20547 [details]
v1

Surprise! Here is a v1 that I think will do the trick. Matt, could you test it out and see if it works for you while Tim reviews this?

We are going to try to sneak this into 21.08, especially since it doesn't mess with any RPCs.

----------------------------

Here's how I tested this patch. My machine has a single NVIDIA GPU that gets picked up with AutoDetect:

slurm.conf (abbreviated)
**************
SelectType=select/cons_tres
SelectTypeParameters=cr_core_memory
GresTypes=gpu
AccountingStorageTRES=gres/gpu,gres/gpu:rtx
NodeName=DEFAULT Gres=gpu:rtx:4
NodeName=test[1-2] NodeAddr=localhost Port=21082-21083


gres.conf: (Only uncomment one test at a time) 
**************
# 1) Default (all envs set)
NodeName=test1 Name=gpu Type=rtx File=/dev/tty[0-1] Cores=0-2
NodeName=test1 Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5
NodeName=test2 Name=gpu Type=rtx File=/dev/tty[0-1] Cores=0-2
NodeName=test2 Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5

# # 2) No envs set
# NodeName=test1 Name=gpu Type=rtx File=/dev/tty[0-1] Cores=0-2 EnvVars=none
# NodeName=test1 Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5 EnvVars=none
# NodeName=test2 Name=gpu Type=rtx File=/dev/tty[0-1] Cores=0-2 EnvVars=none
# NodeName=test2 Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5 EnvVars=bogus

# # 3) RSMI on test1, NVML on test2
# NodeName=test1 Name=gpu Type=rtx File=/dev/tty[0-1] Cores=0-2 EnvVars=rsmi
# NodeName=test1 Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5 EnvVars=rsmi
# NodeName=test2 Name=gpu Type=rtx File=/dev/tty[0-1] Cores=0-2 EnvVars=nvml
# NodeName=test2 Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5 EnvVars=nvml

# # 4) Test AutoDetect and set a combo of envs
# Autodetect=nvml
# NodeName=test1 Name=gpu Type=rtx File=/dev/tty1 Cores=0-2 EnvVars=rsmi
# NodeName=test2 Name=gpu Type=rtx File=/dev/tty1 Cores=0-2 EnvVars=opencl,rsmi
# NodeName=test[1-2] Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5 EnvVars=nvml


Commands
******************
srun --gpus=1 -w test1 env | grep -i "cuda\|rocr\|ordinal"
srun --gpus=1 -w test2 env | grep -i "cuda\|rocr\|ordinal"


Overview
************
If EnvVars is not set at all, then it will default to the current behavior (set all three of CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL, and ROCR_VISIBLE_DEVICES). If any of `nvml`, `rsmi`, or `opencl` are specified in EnvVars, then the corresponding envs will be set. If EnvVars is set but does not contain any of `nvml`, `rsmi`, or `opencl`, then no envs will be set (so you can do "none" or "bogus" to stop setting all envs).

EnvVars applies to nodes the GPUs are on, not the individual GPUs themselves. Slurm will cumulatively apply EnvVars to the node. 

Slurm already basically assumes that nodes will not have both AMD and NVIDIA GPUs. EnvVars adheres to that assumption.

I'll work on a v2 that includes documentation and testing.

Thanks,
-Michael

Comment 21 Michael Hinton 2021-07-26 16:19:32 MDT

Oh, I should note that v1 also supports AutoDetect, so that should fully work.

Comment 23 Matt Ezell 2021-07-26 19:17:59 MDT

(In reply to Michael Hinton from comment #20)
> Surprise! Here is a v1 that I think will do the trick. Matt, could you test
> it out and see if it works for you while Tim reviews this?

Awesome news! We have a downtime tomorrow for software upgrades, so I'll try to target this for Wednesday. Thanks so much!

Comment 24 Michael Hinton 2021-07-27 13:06:59 MDT

Created attachment 20559 [details]
v2

Recent changes on master required a rebase of v1, so here is v2. Hopefully it applies cleanly and works - another build issue on master is preventing me from confirming that. Let me know if you run into problems. Thanks!

Comment 26 Matt Ezell 2021-07-28 14:55:11 MDT

(In reply to Michael Hinton from comment #24)
> Created attachment 20559 [details]
> v2
> 
> Recent changes on master required a rebase of v1, so here is v2. Hopefully
> it applies cleanly and works - another build issue on master is preventing
> me from confirming that. Let me know if you run into problems. Thanks!

_set_env appears to have rsmi and opencl backwards:

@@ -138,12 +148,15 @@ static void _set_env(char ***env_ptr, bitstr_t *gres_bit_alloc,
        }
 
        if (local_list) {
-               env_array_overwrite(
-                       env_ptr, "CUDA_VISIBLE_DEVICES", local_list);
-               env_array_overwrite(
-                       env_ptr, "GPU_DEVICE_ORDINAL", local_list);
-               env_array_overwrite(
-                       env_ptr, "ROCR_VISIBLE_DEVICES", local_list);
+               if (node_flags & GRES_CONF_ENV_NVML)
+                       env_array_overwrite(env_ptr, "CUDA_VISIBLE_DEVICES",
+                                           local_list);
+               if (node_flags & GRES_CONF_ENV_RSMI)
+                       env_array_overwrite(env_ptr, "GPU_DEVICE_ORDINAL",
+                                           local_list);
+               if (node_flags & GRES_CONF_ENV_OPENCL)
+                       env_array_overwrite(env_ptr, "ROCR_VISIBLE_DEVICES",
+                                           local_list);
                xfree(local_list);
                *already_seen = true;
        }

If I set it to rsmi, I get GPU_DEVICE_ORDINAL and if I set it to opencl I get ROCR_VISIBLE_DEVICES.

Other than that, it seems to work for me. Thanks!

Note: I tested master on a VM, without a GPU, so I didn't test autodetect. We aren't currently using autodetect on our cluster.

Comment 33 Michael Hinton 2021-07-29 15:35:21 MDT

Hey Matt, this is now in 21.08 rc1 with commits https://github.com/SchedMD/slurm/compare/0705ded00d4e...0c7fb08c4e7d.

We made some changes compared to v2: instead of a dedicated EnvVars field, we simply added on to the preexisting Flags field in gres.conf

Flags
Optional flags that can be specified to change configured behavior of the GRES.
Allowed values at present are:
...
nvidia_gpu_env
Set environment variable CUDA_VISIBLE_DEVICES for all GPUs on the specified node(s).

amd_gpu_env
Set environment variable ROCR_VISIBLE_DEVICES for all GPUs on the specified node(s).

opencl_env
Set environment variable GPU_DEVICE_ORDINAL for all GPUs on the specified node(s).

no_gpu_env
Set no GPU-specific environment variables.

We also added ROCR_VISIBLE_DEVICES to prolog/epilog, like CUDA_VISIBLE_DEVICES.

We are planning on tweaking things before 21.08 is released, but you can go ahead and play around with this.

Thanks,
Michael

Comment 35 Matt Ezell 2021-08-03 13:19:46 MDT

My testing against 21.08.0rc1 shows the correct environment variable being set on our cluster (I'm using Flags=amd_gpu_env). Thanks!

Comment 54 Michael Hinton 2021-08-17 15:24:44 MDT

Matt, we've added some follow-up commits here: https://github.com/SchedMD/slurm/compare/6cb01bd89a92...5ddec8654a07. The most important commit was https://github.com/SchedMD/slurm/commit/6b890ed5b70c. This made it so env flags propagate to subsequent GRES lines, and so Flags will override AutoDetected values.

I'll go ahead and close this out. Thanks!
-Michael