| Summary: | slurmd failed to find nvml shared library in RHEL-Family | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | S. Zhang <szhang> |
| Component: | slurmd | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 21.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Issue I am setting up a new slurmd instance on Oracle Linux 8, with the following packages installed: - slurm-slurmd-21.08.8-1.sdl8.5 (From springdale computational core 8) - nvidia-driver-NVML-520.61.05-1.el8 (From cuda-rhel8-x86_64, automatic pulled in as dependency) However the slurmd failed to start with the following error message: `fatal: We were configured with nvml functionality, but that lib wasn't found on the system.` Investigation The first idea that come to my mind when such error occurs is to make sure that I have nvidia-driver-NVML installed correctly. And indeed the NVML library is definitely installed and even linked to the slurmd's so as shown by `ldd /usr/lib64/slurm/gpu_nvml.so` ``` linux-vdso.so.1 (0x00007fff6fd6c000) libnvidia-ml.so.1 => /lib64/libnvidia-ml.so.1 (0x00007f55e14a5000) libm.so.6 => /lib64/libm.so.6 (0x00007f55e1123000) libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f55e0f0b000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f55e0ceb000) libc.so.6 => /lib64/libc.so.6 (0x00007f55e0925000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f55e0721000) /lib64/ld-linux-x86-64.so.2 (0x00007f55e2648000) ``` Also a list of the files installed by the `nvidia-driver-NVML` indicated that the following files were installed: ``` $ rpm -ql nvidia-driver-NVML /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so.520.61.05 ``` This got me lost on what is happening and why my slurmd is not using the libnvidia-ml.so.1 that it is linked to. I then turned into the slurm's source code for some ideas. The gpu_nvml.c looked clean and not relevant to my issue. After further digging into the source code of slurmd, I found the following segment that looked suspicious: ```C if (!dlopen("libnvidia-ml.so", RTLD_NOW | RTLD_GLOBAL)) info("We were configured with nvml functionality, but that lib wasn't found on the system."); ``` Location: https://github.com/SchedMD/slurm/blob/662587dc1dec3c21803793e516acb470b2878a61/src/interfaces/gpu.c#L100-L101 This gave me the hint that the slurmd's gpu detection process would search for libnvidia-ml.so no matter what the gpu_nvml.so is linked to. However this shared library file is not provided by the nvidia-driver-NVML. This yields the error message before and stopped my slurmd from running. As this code is found in all releases of slurm, while I am running on 21.08.8, it applies to all versions until latest git build. Fixes In order to get this file from the package manager, I would have to install nvidia-driver-devel which pulls another load of dependencies that are a waste of disk space in my case, as I am running this slurmd on a machine without GPU. For a quick and lightweight fix, I manually `ln -s /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so` that satisfied the code shown above, and the slurmd worked. However, I think the better way to fix such issue is by modifying the code above to search for the specific .so version of libnvidia-ml.so that the gpu_nvml.so is linked to. This way it minimizes the package dependency, and in case nvidia bumped the version of the .so file, we have a better chance of catching the error in the library existence check (.so.1 library not found, missing dependency package installation or .so version mismatch?), instead of continuing to let gpu_nvml.so try to use the incompatible version of libnvidia-ml.so and core dump, which is harder to debug. It could however also be a potential nvidia side problem for them failing to pack the symbolic link of libnvidia-ml.so without suffix into nvidia-driver-NVML in the first place. In the mean time I would take a look at other distros to see if the missing .so symbolic link is the "default behavior" or "unique to rhel release by nvidia".