| Summary: | Control of GPU visibility variables | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Matt Ezell <ezellma> |
| Component: | GPU | Assignee: | Director of Support <support> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | tim |
| Version: | 20.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=8596 | ||
| Site: | ORNL-OLCF | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 21.08.0 | |
| Target Release: | future | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
Disable setting CVD
v1 v2 |
||
|
Description
Matt Ezell
2021-03-15 19:46:12 MDT
Hi Matt, (In reply to Matt Ezell from comment #0) > Slurm currently sets CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, and > GPU_DEVICE_ORDINAL for tasks that request GPUs. We've run into an issue > where this causes ROCm-based applications to fail. > > A simplified example (without Slurm even in the picture) is: > > bash-4.4$ ./simple_hip > We have 4 devices > bash-4.4$ ROCR_VISIBLE_DEVICES=1 ./simple_hip > We have 1 devices > bash-4.4$ CUDA_VISIBLE_DEVICES=1 ./simple_hip > We have 1 devices > bash-4.4$ ROCR_VISIBLE_DEVICES=1 CUDA_VISIBLE_DEVICES=1 ./simple_hip > We have 0 devices > Segmentation fault (core dumped) > > I opened a bug with AMD, but they claim that this situation (specifying both > CVD and RVD) is illegal. It seems that the ROCm runtime processes RVD first, > and then CVD. So RVD=1 selects the second GPU only, but then CVD=1 tries to > select the second (remaining) GPU which doesn't exist. > > I'm still working with AMD to understand why the current behavior is as it > currently is, but I wanted to go ahead and document that currently Slurm > isn't 100% compatible with AMD GPUs and see what path we have (from the > Slurm-only) side to make progress. Good to know! And also that's bizarre that it works like that. Thanks for reporting this. > Does it make sense to add controls for what environment variables get set > (CUDA, ROCR, both, neither)? If modern ROCm works with CUDA_VISIBLE_DEVICES > maybe just stop setting ROCM_VISIBLE_DEVICES? Could I send you a patch to try out the second option? Do you have any more information about what AMD GPUs are affected by this issue (e.g. when did AMD GPUs start looking at CUDA_VISIBLE_DEVICES)? Removing ROCM_VISIBLE_DEVICES might work in your case, but if it doesn't work in all cases, we might be forced to go with the first option and add in a control for which env vars get set... Thanks, -Michael (In reply to Michael Hinton from comment #1) > Could I send you a patch to try out the second option? Do you have any more > information about what AMD GPUs are affected by this issue (e.g. when did > AMD GPUs start looking at CUDA_VISIBLE_DEVICES)? Removing > ROCM_VISIBLE_DEVICES might work in your case, but if it doesn't work in all > cases, we might be forced to go with the first option and add in a control > for which env vars get set... Our current workaround is to srun a wrapper script that just unsets one of the environment variables: $ cat wrap.sh #!/bin/bash unset CUDA_VISIBLE_DEVICES exec $* Is the patch just to comment out https://github.com/SchedMD/slurm/blob/4c1ccec1aed42701c893c97f3bf386852dc073c1/src/plugins/gres/gpu/gres_gpu.c#L139 ? Curiously enough, based on an inspection of the code, it seems that ROCR_VISIBLE_DEVICES is not set for the epilog. I think this is all happening in the ROCr runtime, so all AMD GPUs are likely impacted the same based on ROCm version. It appears that CUDA_VISIBLE_DEVICES is actually a synonym for HIP_VISIBLE_DEVICES (not ROCR_VISIBLE_DEVICES) and only impacts HIP programs. git blame shows this has been there at least since 2016. (In reply to Matt Ezell from comment #2) > Is the patch just to comment out > https://github.com/SchedMD/slurm/blob/ > 4c1ccec1aed42701c893c97f3bf386852dc073c1/src/plugins/gres/gpu/gres_gpu. > c#L139 ? Yes :) > Curiously enough, based on an inspection of the code, it seems that > ROCR_VISIBLE_DEVICES is not set for the epilog. This seemed familiar to me, and it turns out that it is actually a known issue that I submitted a patch for a while back, but it has yet to be reviewed. I will see if I can get this prioritized for you. > I think this is all happening in the ROCr runtime, so all AMD GPUs are > likely impacted the same based on ROCm version. > > It appears that CUDA_VISIBLE_DEVICES is actually a synonym for > HIP_VISIBLE_DEVICES (not ROCR_VISIBLE_DEVICES) and only impacts HIP > programs. git blame shows this has been there at least since 2016. Is this the ROCm code? Do you have a reference to the code you are referring to? -Michael (In reply to Michael Hinton from comment #3) > This seemed familiar to me, and it turns out that it is actually a known > issue that I submitted a patch for a while back, but it has yet to be > reviewed. I will see if I can get this prioritized for you. We allocate nodes exclusively, so anything we do in the epilog is for all devices - so not necessarily a priority for us, just something I noticed. > Is this the ROCm code? Do you have a reference to the code you are referring > to? The HIP layer reading {HIP,CUDA}_VISIBLE_DEVICES is at https://github.com/ROCm-Developer-Tools/HIP/blob/2080cc113a2d767352b512b9d24c0620b6dee790/src/hip_hcc.cpp#L1354 with the actual work done in HIP_VISIBLE_DEVICES_callback. The ROCr code reads the environment variable at https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/36b56afdd88f235c3ee28c86aa9075597ab3ea4c/src/core/util/flag.h#L76 with much of the actual work happening in https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/36b56afdd88f235c3ee28c86aa9075597ab3ea4c/src/core/runtime/amd_filter_device.cpp AMD says "Setting both RVD and CVD is typically unnecessary and may be harmful." They explained a hypothetical situation where a code might need use RVD and CVD simultaneously, with them set to different values. So I don't expect the behavior to change in future version of ROCm. They also said that they can't guarantee that CVD will work for all programming models and use cases, so we need RVD set. Unfortunately it seems that we will need some way to control which env-vars get set. (In reply to Matt Ezell from comment #5) > AMD says "Setting both RVD and CVD is typically unnecessary and may be > harmful." They explained a hypothetical situation where a code might need > use RVD and CVD simultaneously, with them set to different values. So I > don't expect the behavior to change in future version of ROCm. > > They also said that they can't guarantee that CVD will work for all > programming models and use cases, so we need RVD set. > > Unfortunately it seems that we will need some way to control which env-vars > get set. Ok, that is very good to know. I'll go ahead and mark this as an enhancement, and we'll discuss this internally. Thanks! -Michael Matt, It's my understanding that GPU_DEVICE_ORDINAL is for AMD OpenCL, which is superseded by AMD ROCm and ROCR_VISIBLE_DEVICES for AMD GPUs. Is this correct? Is there anything tricky that we should be aware of with GPU_DEVICE_ORDINAL vs. ROCR_VISIBLE_DEVICES? (In reply to Michael Hinton from comment #8) > It's my understanding that GPU_DEVICE_ORDINAL is for AMD OpenCL, which is > superseded by AMD ROCm and ROCR_VISIBLE_DEVICES for AMD GPUs. Is this > correct? Is there anything tricky that we should be aware of with > GPU_DEVICE_ORDINAL vs. ROCR_VISIBLE_DEVICES? Yes, same behavior for an OpenCL program: bash-4.4$ srun -n1 /opt/rocm-4.0.1/opencl/bin/clinfo |grep "Number of devices" Number of devices: 4 bash-4.4$ srun -n1 /bin/bash -c 'ROCR_VISIBLE_DEVICES=1 /opt/rocm-4.0.1/opencl/bin/clinfo' |grep "Number of devices" Number of devices: 1 bash-4.4$ srun -n1 /bin/bash -c 'ROCR_VISIBLE_DEVICES=1 GPU_DEVICE_ORDINAL=1 /opt/rocm-4.0.1/opencl/bin/clinfo' |grep "Number of devices" Number of devices: 0 Created attachment 19041 [details]
Disable setting CVD
This is the patch we are running on Spock, our EA system. Obviously it's not suitable to be merged upstream.
Has there been any discussion that you can share about the "right" way to address this?
What if we added a field to gres.conf called EnvVars. gpu/nvml and gpu/rsmi could set these to the correct value if AutoDetect were in use. If unset, it could default to current behavior (set all 3 currently set).
(In reply to Matt Ezell from comment #10) > Has there been any discussion that you can share about the "right" way to > address this? > > What if we added a field to gres.conf called EnvVars. gpu/nvml and gpu/rsmi > could set these to the correct value if AutoDetect were in use. If unset, it > could default to current behavior (set all 3 currently set). I'd love to avoid having to patch this for the entire 21.08 release cycle. Has there been any work on this that is under review? Hi Matt, Unfortunately, this won't make it into the 21.08 release, and there is nothing actively under review. We've already frozen the feature addition process and are working hard to get existing sponsored feature requests in. We'll have to pick this back up after 21.08 is released. Thanks, -Michael (In reply to Michael Hinton from comment #15) > Unfortunately, this won't make it into the 21.08 release, and there is > nothing actively under review. We've already frozen the feature addition > process and are working hard to get existing sponsored feature requests in. > We'll have to pick this back up after 21.08 is released. Then it's probably worth adding something to the docs that multi-AMD-gpu nodes are effectively unusable/unsupported in 21.08. (In reply to Matt Ezell from comment #16) > Then it's probably worth adding something to the docs that multi-AMD-gpu > nodes are effectively unusable/unsupported in 21.08. I 100% agree that we should document this limitation, since we are currently not planning on fixing it in 21.08, and since it is broken in the prior releases as well. I do like your approach in comment 10, and we were discussing that internally, but then other things for this release took priority. Normally, adding a new parameter to gres.conf would be a change that probably has to wait until the next major release (22.05), but if it is backwards-compatible, and if it doesn't change anything at the RPC layer (e.g. if we only add a flag instead of an entire new field), you may be able to convince Tim to commit it to a 21.08 minor release. It *does* seem to be more of a bug fix than an enhancement, and there isn't really a great workaround. -Michael Created attachment 20547 [details]
v1
Surprise! Here is a v1 that I think will do the trick. Matt, could you test it out and see if it works for you while Tim reviews this?
We are going to try to sneak this into 21.08, especially since it doesn't mess with any RPCs.
----------------------------
Here's how I tested this patch. My machine has a single NVIDIA GPU that gets picked up with AutoDetect:
slurm.conf (abbreviated)
**************
SelectType=select/cons_tres
SelectTypeParameters=cr_core_memory
GresTypes=gpu
AccountingStorageTRES=gres/gpu,gres/gpu:rtx
NodeName=DEFAULT Gres=gpu:rtx:4
NodeName=test[1-2] NodeAddr=localhost Port=21082-21083
gres.conf: (Only uncomment one test at a time)
**************
# 1) Default (all envs set)
NodeName=test1 Name=gpu Type=rtx File=/dev/tty[0-1] Cores=0-2
NodeName=test1 Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5
NodeName=test2 Name=gpu Type=rtx File=/dev/tty[0-1] Cores=0-2
NodeName=test2 Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5
# # 2) No envs set
# NodeName=test1 Name=gpu Type=rtx File=/dev/tty[0-1] Cores=0-2 EnvVars=none
# NodeName=test1 Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5 EnvVars=none
# NodeName=test2 Name=gpu Type=rtx File=/dev/tty[0-1] Cores=0-2 EnvVars=none
# NodeName=test2 Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5 EnvVars=bogus
# # 3) RSMI on test1, NVML on test2
# NodeName=test1 Name=gpu Type=rtx File=/dev/tty[0-1] Cores=0-2 EnvVars=rsmi
# NodeName=test1 Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5 EnvVars=rsmi
# NodeName=test2 Name=gpu Type=rtx File=/dev/tty[0-1] Cores=0-2 EnvVars=nvml
# NodeName=test2 Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5 EnvVars=nvml
# # 4) Test AutoDetect and set a combo of envs
# Autodetect=nvml
# NodeName=test1 Name=gpu Type=rtx File=/dev/tty1 Cores=0-2 EnvVars=rsmi
# NodeName=test2 Name=gpu Type=rtx File=/dev/tty1 Cores=0-2 EnvVars=opencl,rsmi
# NodeName=test[1-2] Name=gpu Type=rtx File=/dev/tty[2-3] Cores=3-5 EnvVars=nvml
Commands
******************
srun --gpus=1 -w test1 env | grep -i "cuda\|rocr\|ordinal"
srun --gpus=1 -w test2 env | grep -i "cuda\|rocr\|ordinal"
Overview
************
If EnvVars is not set at all, then it will default to the current behavior (set all three of CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL, and ROCR_VISIBLE_DEVICES). If any of `nvml`, `rsmi`, or `opencl` are specified in EnvVars, then the corresponding envs will be set. If EnvVars is set but does not contain any of `nvml`, `rsmi`, or `opencl`, then no envs will be set (so you can do "none" or "bogus" to stop setting all envs).
EnvVars applies to nodes the GPUs are on, not the individual GPUs themselves. Slurm will cumulatively apply EnvVars to the node.
Slurm already basically assumes that nodes will not have both AMD and NVIDIA GPUs. EnvVars adheres to that assumption.
I'll work on a v2 that includes documentation and testing.
Thanks,
-Michael
Oh, I should note that v1 also supports AutoDetect, so that should fully work. (In reply to Michael Hinton from comment #20) > Surprise! Here is a v1 that I think will do the trick. Matt, could you test > it out and see if it works for you while Tim reviews this? Awesome news! We have a downtime tomorrow for software upgrades, so I'll try to target this for Wednesday. Thanks so much! Created attachment 20559 [details]
v2
Recent changes on master required a rebase of v1, so here is v2. Hopefully it applies cleanly and works - another build issue on master is preventing me from confirming that. Let me know if you run into problems. Thanks!
(In reply to Michael Hinton from comment #24) > Created attachment 20559 [details] > v2 > > Recent changes on master required a rebase of v1, so here is v2. Hopefully > it applies cleanly and works - another build issue on master is preventing > me from confirming that. Let me know if you run into problems. Thanks! _set_env appears to have rsmi and opencl backwards: @@ -138,12 +148,15 @@ static void _set_env(char ***env_ptr, bitstr_t *gres_bit_alloc, } if (local_list) { - env_array_overwrite( - env_ptr, "CUDA_VISIBLE_DEVICES", local_list); - env_array_overwrite( - env_ptr, "GPU_DEVICE_ORDINAL", local_list); - env_array_overwrite( - env_ptr, "ROCR_VISIBLE_DEVICES", local_list); + if (node_flags & GRES_CONF_ENV_NVML) + env_array_overwrite(env_ptr, "CUDA_VISIBLE_DEVICES", + local_list); + if (node_flags & GRES_CONF_ENV_RSMI) + env_array_overwrite(env_ptr, "GPU_DEVICE_ORDINAL", + local_list); + if (node_flags & GRES_CONF_ENV_OPENCL) + env_array_overwrite(env_ptr, "ROCR_VISIBLE_DEVICES", + local_list); xfree(local_list); *already_seen = true; } If I set it to rsmi, I get GPU_DEVICE_ORDINAL and if I set it to opencl I get ROCR_VISIBLE_DEVICES. Other than that, it seems to work for me. Thanks! Note: I tested master on a VM, without a GPU, so I didn't test autodetect. We aren't currently using autodetect on our cluster. Hey Matt, this is now in 21.08 rc1 with commits https://github.com/SchedMD/slurm/compare/0705ded00d4e...0c7fb08c4e7d. We made some changes compared to v2: instead of a dedicated EnvVars field, we simply added on to the preexisting Flags field in gres.conf Flags Optional flags that can be specified to change configured behavior of the GRES. Allowed values at present are: ... nvidia_gpu_env Set environment variable CUDA_VISIBLE_DEVICES for all GPUs on the specified node(s). amd_gpu_env Set environment variable ROCR_VISIBLE_DEVICES for all GPUs on the specified node(s). opencl_env Set environment variable GPU_DEVICE_ORDINAL for all GPUs on the specified node(s). no_gpu_env Set no GPU-specific environment variables. We also added ROCR_VISIBLE_DEVICES to prolog/epilog, like CUDA_VISIBLE_DEVICES. We are planning on tweaking things before 21.08 is released, but you can go ahead and play around with this. Thanks, Michael My testing against 21.08.0rc1 shows the correct environment variable being set on our cluster (I'm using Flags=amd_gpu_env). Thanks! Matt, we've added some follow-up commits here: https://github.com/SchedMD/slurm/compare/6cb01bd89a92...5ddec8654a07. The most important commit was https://github.com/SchedMD/slurm/commit/6b890ed5b70c. This made it so env flags propagate to subsequent GRES lines, and so Flags will override AutoDetected values. I'll go ahead and close this out. Thanks! -Michael |