Ticket 15505 - slurmd failed to find nvml shared library in RHEL-Family
Summary: slurmd failed to find nvml shared library in RHEL-Family
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 21.08.8
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-11-28 19:31 MST by S. Zhang
Modified: 2025-07-07 20:40 MDT (History)
1 user (show)

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description S. Zhang 2022-11-28 19:31:08 MST
Issue

I am setting up a new slurmd instance on Oracle Linux 8, with the following packages installed:
- slurm-slurmd-21.08.8-1.sdl8.5 (From springdale computational core 8)
- nvidia-driver-NVML-520.61.05-1.el8 (From cuda-rhel8-x86_64, automatic pulled in as dependency)

However the slurmd failed to start with the following error message:
`fatal: We were configured with nvml functionality, but that lib wasn't found on the system.`

Investigation

The first idea that come to my mind when such error occurs is to make sure that I have nvidia-driver-NVML installed correctly. And indeed the NVML library is definitely installed and even linked to the slurmd's so as shown by `ldd /usr/lib64/slurm/gpu_nvml.so`

```
        linux-vdso.so.1 (0x00007fff6fd6c000)
        libnvidia-ml.so.1 => /lib64/libnvidia-ml.so.1 (0x00007f55e14a5000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f55e1123000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f55e0f0b000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f55e0ceb000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f55e0925000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f55e0721000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f55e2648000)
```

Also a list of the files installed by the `nvidia-driver-NVML` indicated that the following files were installed:

```
$ rpm -ql nvidia-driver-NVML
/usr/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.520.61.05
```

This got me lost on what is happening and why my slurmd is not using the libnvidia-ml.so.1 that it is linked to. I then turned into the slurm's source code for some ideas. The gpu_nvml.c looked clean and not relevant to my issue.

After further digging into the source code of slurmd, I found the following segment that looked suspicious:

```C
if (!dlopen("libnvidia-ml.so", RTLD_NOW | RTLD_GLOBAL))
	info("We were configured with nvml functionality, but that lib wasn't found on the system.");
```
Location: https://github.com/SchedMD/slurm/blob/662587dc1dec3c21803793e516acb470b2878a61/src/interfaces/gpu.c#L100-L101

This gave me the hint that the slurmd's gpu detection process would search for libnvidia-ml.so no matter what the gpu_nvml.so is linked to. However this shared library file is not provided by the nvidia-driver-NVML. This yields the error message before and stopped my slurmd from running.

As this code is found in all releases of slurm, while I am running on 21.08.8, it applies to all versions until latest git build.

Fixes

In order to get this file from the package manager, I would have to install nvidia-driver-devel which pulls another load of dependencies that are a waste of disk space in my case, as I am running this slurmd on a machine without GPU.

For a quick and lightweight fix, I manually `ln -s /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so` that satisfied the code shown above, and the slurmd worked.

However, I think the better way to fix such issue is by modifying the code above to search for the specific .so version of libnvidia-ml.so that the gpu_nvml.so is linked to. This way it minimizes the package dependency, and in case nvidia bumped the version of the .so file, we have a better chance of catching the error in the library existence check (.so.1 library not found, missing dependency package installation or .so version mismatch?), instead of continuing to let gpu_nvml.so try to use the incompatible version of libnvidia-ml.so and core dump, which is harder to debug.

It could however also be a potential nvidia side problem for them failing to pack the symbolic link of libnvidia-ml.so without suffix into nvidia-driver-NVML in the first place. In the mean time I would take a look at other distros to see if the missing .so symbolic link is the "default behavior" or "unique to rhel release by nvidia".
Comment 1 Lillian Kelting 2025-07-07 20:40:05 MDT
Hardware mismatch was resolved by disabling hyperthreading. Node still in invalid state and nvml issue persists