Instead of GPU index it sets index + 128, for example: SLURM_STEP_GPUS=128 CUDA_VISIBLE_DEVICES=128 GPU_DEVICE_ORDINAL=128 ROCR_VISIBLE_DEVICES=128 while should be 0s. For index=1 it is 129. GPU autodetection is enabled. Tested on CentOS7 and RHEL8. When DCGM (NVidia) plugin is enabled, then this issue does not happen.
Hi Taras, Could you please give us more information? Could you set SlurmdDebug=debug2, restart your slurmd, and post the output of the slurmd log? That should show what AutoDetect is detecting. I expect that you will see some warnings about the minor numbers not matching the GPU IDs. Could you attach your slurm.conf and gres.conf? Could you also attach the output of `lstopo-no-graphics` and `nvidia-smi topo -m` on the GPU node? If the minor numbers do not match the GPU IDs, you will get incorrect GPU task binding (i.e. CUDA_VISIBLE_DEVICES will be messed up) and maybe even messed up GPU allocation. Slurm implicitly expects the minor numbers to match up with the GPU IDs reported by NVML/nvidia, which is more of an inherent limitation rather than a bug. It appears that AMD nodes with NVIDIA GPUs enumerate the GPUs in reverse order, so the minor numbers don't match the GPU IDs, and I want to confirm if that is the case with you. Thanks, -Michael
Hi Michael, Unfortunately I (temporary) don't have access to AMD GPU cluster. But I can say that I have seen this on 2 different clusters with different types of AMD GPUs. Taras
We aren't currently using autodetection, but here's what /dev/dri looks like: [root@lyra14 ~]# ls -al /dev/dri/ total 0 drwxr-xr-x 3 root root 240 Feb 19 11:06 . drwxr-xr-x 20 root root 3160 Feb 24 09:42 .. drwxrwxrwx 2 root root 220 Feb 19 11:06 by-path crwxrwxrwx 1 root video 226, 0 Feb 19 11:06 card0 crwxrwxrwx 1 root video 226, 1 Feb 19 11:06 card1 crwxrwxrwx 1 root video 226, 2 Feb 19 11:06 card2 crwxrwxrwx 1 root video 226, 3 Feb 19 11:06 card3 crwxrwxrwx 1 root root 226, 4 Feb 19 11:06 card4 crwxrwxrwx 1 root render 226, 128 Feb 19 11:06 renderD128 crwxrwxrwx 1 root render 226, 129 Feb 19 11:06 renderD129 crwxrwxrwx 1 root render 226, 130 Feb 19 11:06 renderD130 crwxrwxrwx 1 root render 226, 131 Feb 19 11:06 renderD131 Autodetect is likely picking up the "render" files instead of the "card" files. We are using: [root@lyra14 ~]# cat /etc/slurm/gres.conf Name=gpu File=/dev/dri/card0 Cores=64-127 Name=gpu File=/dev/dri/card1 Cores=64-127 Name=gpu File=/dev/dri/card2 Cores=64-127 Name=gpu File=/dev/dri/card3 Cores=64-127 But I'll admit on this early system are doing whole-node allocations and we haven't dug deep into the GPU binding and affinity work.
Ok, I see the issue. Check out https://github.com/SchedMD/slurm/blob/slurm-20-02-6-1/src/plugins/gpu/rsmi/gpu_rsmi.c#L1092-L1111. The RSMI GPU plugin treats /dev/dri/renderDXXX device files as the device files for the GPU. I think this works correctly with cgroups, but Slurm also has an inherent limitation in that it assumes that the minor number of the device file is equivalent to the GPU ID when it sets ROCR_VISIBLE_DEVICES (since that is usually the case with NVIDIA GPUs). This is clearly incorrect (i.e. GPU 0 has minor number 128). This issue is not limited to users of AutoDetect - as long as the device file used for the GPU has a minor number different from its GPU ID, you will see ROCR_VISIBLE_DEVICES be set incorrectly. I already have an internal ticket open to address this issue, but it's helpful to know that all AMD GPUs are affected, so the impact is bigger than I thought. A quick hack to Slurm that may work is to simply subtract 128 from the minor number when setting ROCR_VISIBLE_DEVICES if it's > 127. I'll look to see if I can get a patch that does that. -Michael
Maybe we need clarification from AMD on which file should be used for the cgroups - it may be both. Per https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html it seems that access to the 'video' group is required, which on our system corresponds to the 'card' entries, not the 'renderD' entries. I agree that subtracting 128 makes sense for the renderD files.
In 20.11, we added an index field to gres_device_t in order to support GPUs with multiple device files. See https://github.com/SchedMD/slurm/commit/5fbb2ca90a. My guess is that the intended use case was for AMD GPUs with multiple device files like this, but I'm not sure yet. It's possible that the RSMI plugin simply was not updated to support this. But I think that you may be able to mark both device files in gres.conf on separate lines and cgroups will work accordingly.
Hello, This issue should now be fixed in 21.08.0rc1 with commit https://github.com/SchedMD/slurm/commit/0ebfd37834. Thanks! -Michael *** This ticket has been marked as a duplicate of ticket 10933 ***