Issue I am setting up a new slurmd instance on Oracle Linux 8, with the following packages installed: - slurm-slurmd-21.08.8-1.sdl8.5 (From springdale computational core 8) - nvidia-driver-NVML-520.61.05-1.el8 (From cuda-rhel8-x86_64, automatic pulled in as dependency) However the slurmd failed to start with the following error message: `fatal: We were configured with nvml functionality, but that lib wasn't found on the system.` Investigation The first idea that come to my mind when such error occurs is to make sure that I have nvidia-driver-NVML installed correctly. And indeed the NVML library is definitely installed and even linked to the slurmd's so as shown by `ldd /usr/lib64/slurm/gpu_nvml.so` ``` linux-vdso.so.1 (0x00007fff6fd6c000) libnvidia-ml.so.1 => /lib64/libnvidia-ml.so.1 (0x00007f55e14a5000) libm.so.6 => /lib64/libm.so.6 (0x00007f55e1123000) libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f55e0f0b000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f55e0ceb000) libc.so.6 => /lib64/libc.so.6 (0x00007f55e0925000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f55e0721000) /lib64/ld-linux-x86-64.so.2 (0x00007f55e2648000) ``` Also a list of the files installed by the `nvidia-driver-NVML` indicated that the following files were installed: ``` $ rpm -ql nvidia-driver-NVML /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so.520.61.05 ``` This got me lost on what is happening and why my slurmd is not using the libnvidia-ml.so.1 that it is linked to. I then turned into the slurm's source code for some ideas. The gpu_nvml.c looked clean and not relevant to my issue. After further digging into the source code of slurmd, I found the following segment that looked suspicious: ```C if (!dlopen("libnvidia-ml.so", RTLD_NOW | RTLD_GLOBAL)) info("We were configured with nvml functionality, but that lib wasn't found on the system."); ``` Location: https://github.com/SchedMD/slurm/blob/662587dc1dec3c21803793e516acb470b2878a61/src/interfaces/gpu.c#L100-L101 This gave me the hint that the slurmd's gpu detection process would search for libnvidia-ml.so no matter what the gpu_nvml.so is linked to. However this shared library file is not provided by the nvidia-driver-NVML. This yields the error message before and stopped my slurmd from running. As this code is found in all releases of slurm, while I am running on 21.08.8, it applies to all versions until latest git build. Fixes In order to get this file from the package manager, I would have to install nvidia-driver-devel which pulls another load of dependencies that are a waste of disk space in my case, as I am running this slurmd on a machine without GPU. For a quick and lightweight fix, I manually `ln -s /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so` that satisfied the code shown above, and the slurmd worked. However, I think the better way to fix such issue is by modifying the code above to search for the specific .so version of libnvidia-ml.so that the gpu_nvml.so is linked to. This way it minimizes the package dependency, and in case nvidia bumped the version of the .so file, we have a better chance of catching the error in the library existence check (.so.1 library not found, missing dependency package installation or .so version mismatch?), instead of continuing to let gpu_nvml.so try to use the incompatible version of libnvidia-ml.so and core dump, which is harder to debug. It could however also be a potential nvidia side problem for them failing to pack the symbolic link of libnvidia-ml.so without suffix into nvidia-driver-NVML in the first place. In the mean time I would take a look at other distros to see if the missing .so symbolic link is the "default behavior" or "unique to rhel release by nvidia".
Hardware mismatch was resolved by disabling hyperthreading. Node still in invalid state and nvml issue persists