Ticket 15505

Summary: slurmd failed to find nvml shared library in RHEL-Family
Product: Slurm Reporter: S. Zhang <szhang>
Component: slurmdAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description S. Zhang 2022-11-28 19:31:08 MST
Issue

I am setting up a new slurmd instance on Oracle Linux 8, with the following packages installed:
- slurm-slurmd-21.08.8-1.sdl8.5 (From springdale computational core 8)
- nvidia-driver-NVML-520.61.05-1.el8 (From cuda-rhel8-x86_64, automatic pulled in as dependency)

However the slurmd failed to start with the following error message:
`fatal: We were configured with nvml functionality, but that lib wasn't found on the system.`

Investigation

The first idea that come to my mind when such error occurs is to make sure that I have nvidia-driver-NVML installed correctly. And indeed the NVML library is definitely installed and even linked to the slurmd's so as shown by `ldd /usr/lib64/slurm/gpu_nvml.so`

```
        linux-vdso.so.1 (0x00007fff6fd6c000)
        libnvidia-ml.so.1 => /lib64/libnvidia-ml.so.1 (0x00007f55e14a5000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f55e1123000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f55e0f0b000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f55e0ceb000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f55e0925000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f55e0721000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f55e2648000)
```

Also a list of the files installed by the `nvidia-driver-NVML` indicated that the following files were installed:

```
$ rpm -ql nvidia-driver-NVML
/usr/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.520.61.05
```

This got me lost on what is happening and why my slurmd is not using the libnvidia-ml.so.1 that it is linked to. I then turned into the slurm's source code for some ideas. The gpu_nvml.c looked clean and not relevant to my issue.

After further digging into the source code of slurmd, I found the following segment that looked suspicious:

```C
if (!dlopen("libnvidia-ml.so", RTLD_NOW | RTLD_GLOBAL))
	info("We were configured with nvml functionality, but that lib wasn't found on the system.");
```
Location: https://github.com/SchedMD/slurm/blob/662587dc1dec3c21803793e516acb470b2878a61/src/interfaces/gpu.c#L100-L101

This gave me the hint that the slurmd's gpu detection process would search for libnvidia-ml.so no matter what the gpu_nvml.so is linked to. However this shared library file is not provided by the nvidia-driver-NVML. This yields the error message before and stopped my slurmd from running.

As this code is found in all releases of slurm, while I am running on 21.08.8, it applies to all versions until latest git build.

Fixes

In order to get this file from the package manager, I would have to install nvidia-driver-devel which pulls another load of dependencies that are a waste of disk space in my case, as I am running this slurmd on a machine without GPU.

For a quick and lightweight fix, I manually `ln -s /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so` that satisfied the code shown above, and the slurmd worked.

However, I think the better way to fix such issue is by modifying the code above to search for the specific .so version of libnvidia-ml.so that the gpu_nvml.so is linked to. This way it minimizes the package dependency, and in case nvidia bumped the version of the .so file, we have a better chance of catching the error in the library existence check (.so.1 library not found, missing dependency package installation or .so version mismatch?), instead of continuing to let gpu_nvml.so try to use the incompatible version of libnvidia-ml.so and core dump, which is harder to debug.

It could however also be a potential nvidia side problem for them failing to pack the symbolic link of libnvidia-ml.so without suffix into nvidia-driver-NVML in the first place. In the mean time I would take a look at other distros to see if the missing .so symbolic link is the "default behavior" or "unique to rhel release by nvidia".