Ticket 15505

Summary:	slurmd failed to find nvml shared library in RHEL-Family
Product:	Slurm	Reporter:	S. Zhang <szhang>
Component:	slurmd	Assignee:	Jacob Jenson <jacob>
Status:	OPEN ---	QA Contact:
Severity:	6 - No support contract
Priority:	---	CC:	lillian.kelting
Version:	21.08.8
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description S. Zhang 2022-11-28 19:31:08 MST

Issue

I am setting up a new slurmd instance on Oracle Linux 8, with the following packages installed:
- slurm-slurmd-21.08.8-1.sdl8.5 (From springdale computational core 8)
- nvidia-driver-NVML-520.61.05-1.el8 (From cuda-rhel8-x86_64, automatic pulled in as dependency)

However the slurmd failed to start with the following error message:
`fatal: We were configured with nvml functionality, but that lib wasn't found on the system.`

Investigation

The first idea that come to my mind when such error occurs is to make sure that I have nvidia-driver-NVML installed correctly. And indeed the NVML library is definitely installed and even linked to the slurmd's so as shown by `ldd /usr/lib64/slurm/gpu_nvml.so`

```
        linux-vdso.so.1 (0x00007fff6fd6c000)
        libnvidia-ml.so.1 => /lib64/libnvidia-ml.so.1 (0x00007f55e14a5000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f55e1123000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f55e0f0b000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f55e0ceb000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f55e0925000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f55e0721000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f55e2648000)
```

Also a list of the files installed by the `nvidia-driver-NVML` indicated that the following files were installed:

```
$ rpm -ql nvidia-driver-NVML
/usr/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.520.61.05
```

This got me lost on what is happening and why my slurmd is not using the libnvidia-ml.so.1 that it is linked to. I then turned into the slurm's source code for some ideas. The gpu_nvml.c looked clean and not relevant to my issue.

After further digging into the source code of slurmd, I found the following segment that looked suspicious:

```C
if (!dlopen("libnvidia-ml.so", RTLD_NOW | RTLD_GLOBAL))
	info("We were configured with nvml functionality, but that lib wasn't found on the system.");
```
Location: https://github.com/SchedMD/slurm/blob/662587dc1dec3c21803793e516acb470b2878a61/src/interfaces/gpu.c#L100-L101

This gave me the hint that the slurmd's gpu detection process would search for libnvidia-ml.so no matter what the gpu_nvml.so is linked to. However this shared library file is not provided by the nvidia-driver-NVML. This yields the error message before and stopped my slurmd from running.

As this code is found in all releases of slurm, while I am running on 21.08.8, it applies to all versions until latest git build.

Fixes

In order to get this file from the package manager, I would have to install nvidia-driver-devel which pulls another load of dependencies that are a waste of disk space in my case, as I am running this slurmd on a machine without GPU.

For a quick and lightweight fix, I manually `ln -s /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so` that satisfied the code shown above, and the slurmd worked.

However, I think the better way to fix such issue is by modifying the code above to search for the specific .so version of libnvidia-ml.so that the gpu_nvml.so is linked to. This way it minimizes the package dependency, and in case nvidia bumped the version of the .so file, we have a better chance of catching the error in the library existence check (.so.1 library not found, missing dependency package installation or .so version mismatch?), instead of continuing to let gpu_nvml.so try to use the incompatible version of libnvidia-ml.so and core dump, which is harder to debug.

It could however also be a potential nvidia side problem for them failing to pack the symbolic link of libnvidia-ml.so without suffix into nvidia-driver-NVML in the first place. In the mean time I would take a look at other distros to see if the missing .so symbolic link is the "default behavior" or "unique to rhel release by nvidia".

Comment 1 Lillian Kelting 2025-07-07 20:40:05 MDT

Hardware mismatch was resolved by disabling hyperthreading. Node still in invalid state and nvml issue persists