Ticket 10894

Summary:	AMD GPU plugin sets incorrect environment variables
Product:	Slurm	Reporter:	Taras Shapovalov <taras.shapovalov>
Component:	GPU	Assignee:	Director of Support <support>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	6 - No support contract
Priority:	---	CC:	ezellma, jbooth, tim
Version:	20.02.6
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=10827 https://bugs.schedmd.com/show_bug.cgi?id=11027 https://bugs.schedmd.com/show_bug.cgi?id=10933
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Taras Shapovalov 2021-02-18 09:49:17 MST

Instead of GPU index it sets index + 128, for example:

SLURM_STEP_GPUS=128
CUDA_VISIBLE_DEVICES=128
GPU_DEVICE_ORDINAL=128
ROCR_VISIBLE_DEVICES=128

while should be 0s. For index=1 it is 129.

GPU autodetection is enabled. Tested on CentOS7 and RHEL8.
When DCGM (NVidia) plugin is enabled, then this issue does not happen.

Comment 2 Michael Hinton 2021-03-04 16:03:25 MST

Hi Taras,

Could you please give us more information? Could you set SlurmdDebug=debug2, restart your slurmd, and post the output of the slurmd log? That should show what AutoDetect is detecting. I expect that you will see some warnings about the minor numbers not matching the GPU IDs. 

Could you attach your slurm.conf and gres.conf? Could you also attach the output of `lstopo-no-graphics` and `nvidia-smi topo -m` on the GPU node?

If the minor numbers do not match the GPU IDs, you will get incorrect GPU task binding (i.e. CUDA_VISIBLE_DEVICES will be messed up) and maybe even messed up GPU allocation. Slurm implicitly expects the minor numbers to match up with the GPU IDs reported by NVML/nvidia, which is more of an inherent limitation rather than a bug. It appears that AMD nodes with NVIDIA GPUs enumerate the GPUs in reverse order, so the minor numbers don't match the GPU IDs, and I want to confirm if that is the case with you.

Thanks,
-Michael

Comment 3 Taras Shapovalov 2021-03-04 23:35:56 MST

Hi Michael,

Unfortunately I (temporary) don't have access to AMD GPU cluster. But I can say that I have seen this on 2 different clusters with different types of AMD GPUs.

Taras

Comment 5 Matt Ezell 2021-03-05 10:31:25 MST

We aren't currently using autodetection, but here's what /dev/dri looks like:

[root@lyra14 ~]# ls -al /dev/dri/
total 0
drwxr-xr-x  3 root root        240 Feb 19 11:06 .
drwxr-xr-x 20 root root       3160 Feb 24 09:42 ..
drwxrwxrwx  2 root root        220 Feb 19 11:06 by-path
crwxrwxrwx  1 root video  226,   0 Feb 19 11:06 card0
crwxrwxrwx  1 root video  226,   1 Feb 19 11:06 card1
crwxrwxrwx  1 root video  226,   2 Feb 19 11:06 card2
crwxrwxrwx  1 root video  226,   3 Feb 19 11:06 card3
crwxrwxrwx  1 root root   226,   4 Feb 19 11:06 card4
crwxrwxrwx  1 root render 226, 128 Feb 19 11:06 renderD128
crwxrwxrwx  1 root render 226, 129 Feb 19 11:06 renderD129
crwxrwxrwx  1 root render 226, 130 Feb 19 11:06 renderD130
crwxrwxrwx  1 root render 226, 131 Feb 19 11:06 renderD131

Autodetect is likely picking up the "render" files instead of the "card" files. We are using:

[root@lyra14 ~]# cat /etc/slurm/gres.conf 
Name=gpu File=/dev/dri/card0 Cores=64-127
Name=gpu File=/dev/dri/card1 Cores=64-127
Name=gpu File=/dev/dri/card2 Cores=64-127
Name=gpu File=/dev/dri/card3 Cores=64-127

But I'll admit on this early system are doing whole-node allocations and we haven't dug deep into the GPU binding and affinity work.

Comment 6 Michael Hinton 2021-03-05 11:05:23 MST

Ok, I see the issue. Check out https://github.com/SchedMD/slurm/blob/slurm-20-02-6-1/src/plugins/gpu/rsmi/gpu_rsmi.c#L1092-L1111.

The RSMI GPU plugin treats /dev/dri/renderDXXX device files as the device files for the GPU. I think this works correctly with cgroups, but Slurm also has an inherent limitation in that it assumes that the minor number of the device file is equivalent to the GPU ID when it sets ROCR_VISIBLE_DEVICES (since that is usually the case with NVIDIA GPUs).

This is clearly incorrect (i.e. GPU 0 has minor number 128). This issue is not limited to users of AutoDetect - as long as the device file used for the GPU has a minor number different from its GPU ID, you will see ROCR_VISIBLE_DEVICES be set incorrectly.

I already have an internal ticket open to address this issue, but it's helpful to know that all AMD GPUs are affected, so the impact is bigger than I thought.

A quick hack to Slurm that may work is to simply subtract 128 from the minor number when setting ROCR_VISIBLE_DEVICES if it's > 127. I'll look to see if I can get a patch that does that.

-Michael

Comment 8 Matt Ezell 2021-03-05 11:12:16 MST

Maybe we need clarification from AMD on which file should be used for the cgroups - it may be both.

Per https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html it seems that access to the 'video' group is required, which on our system corresponds to the 'card' entries, not the 'renderD' entries.

I agree that subtracting 128 makes sense for the renderD files.

Comment 9 Michael Hinton 2021-03-05 11:21:25 MST

In 20.11, we added an index field to gres_device_t in order to support GPUs with multiple device files. See https://github.com/SchedMD/slurm/commit/5fbb2ca90a. My guess is that the intended use case was for AMD GPUs with multiple device files like this, but I'm not sure yet. It's possible that the RSMI plugin simply was not updated to support this. But I think that you may be able to mark both device files in gres.conf on separate lines and cgroups will work accordingly.

Comment 15 Michael Hinton 2021-08-04 18:25:08 MDT

Hello,

This issue should now be fixed in 21.08.0rc1 with commit https://github.com/SchedMD/slurm/commit/0ebfd37834.

Thanks!
-Michael

*** This ticket has been marked as a duplicate of ticket 10933 ***