Ticket 18205 - 23.11 - Debian packaging misses nvml plugin
Summary: 23.11 - Debian packaging misses nvml plugin
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Build System and Packaging (show other tickets)
Version: 23.11.x
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Tim McMullan
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-11-15 23:13 MST by Markus Kötter
Modified: 2024-07-31 07:07 MDT (History)
2 users (show)

See Also:
Site: CISPA
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Markus Kötter 2023-11-15 23:13:50 MST
The nvidia nvml plugin is not packaged in the current state of the debian packaging.
With debians package, the nvml is in the contrib branch of the repo

It's the libnvidia-ml-dev dependency 
https://salsa.debian.org/hpc-team/slurm-wlm/-/blob/contrib/debian/control#L37
and the package definitions
https://salsa.debian.org/hpc-team/slurm-wlm/-/blob/contrib/debian/control#L43-71

I'd appreciate if this was added.
Comment 2 Marcin Stolarek 2023-11-16 01:37:43 MST
Markus,

As a feature request (for the future version) I don't think we can consider it as Severity 1. Per our commercial support definitions[1]. 

>Severity 1 — Major Impact
> 
>There is a continued system outage that affects a large number of end users. The 
>system is down or unusable due to Slurm problem(s) and no procedural workaround 
>exists.

cheers,
Marcin
[1]https://www.schedmd.com/support.php
Comment 5 Tim McMullan 2023-11-17 11:58:00 MST
Markus,

While there is no slurm-smd-nvml-plugin package, the nvml libraries will be included in the slurm-smd package if the cuda libraries are installed on the build system.  The same is true of the amd and intel gpu plugins.

Its set up this way to allow for a little more flexibility on what version of cuda/rocm/OneAPI is built into the packages since it is fairly common to install those from external sources.

> dpkg -c ./slurm-smd_23.11.0-0rc2_amd64.deb | grep nvml

should be able to show that the built package has the nvml plugin included or not.

Let me know if this is not the case for you?
Thanks,
--Tim
Comment 6 Tim McMullan 2023-11-22 07:39:00 MST
Hi, I just wanted to check in and see if you were able to confirm that the nvml plugin is indeed included in the generic slurm-smd package.

Thanks,
--Tim
Comment 7 Markus Kötter 2023-11-22 08:00:38 MST
(In reply to Tim McMullan from comment #6)
> to confirm that the
> nvml plugin is indeed included in the generic slurm-smd package.


Yes, it is. 
As the nvml is build without linking and relies on dlopen, I assume building it by default, including the build-dependency is fine, as it does not require a specific library version/dependency at runtime.
It just fails to load when used without the library installed and can not be used, but does not create a dependency in dpkg.
Comment 10 Tim McMullan 2023-12-01 06:45:47 MST
Hey, I just wanted to give you an update on this.

We are discussing internally what adding these kinds of packages would look like (nvml, rsmi, and oneapi), but we haven't yet found an elegant solution that doesn't require you to have all three set up on your system to build the packages.  The package build is very all or nothing, and we seem to need a pre-build script to alter the rules and control files to make this happen the way we would like it to.

(In reply to Markus Kötter from comment #7)
> As the nvml is build without linking and relies on dlopen, I assume building
> it by default, including the build-dependency is fine, as it does not
> require a specific library version/dependency at runtime.

Unfortunately there are some version requirements as the CUDA api breaks compatibility on occasion (cuda 11 and 12 don't seem to mix well) which does complicate this some since a generic control file would require a list of about a dozen acceptable packages.

We are continuing to look into ways to make this a little cleaner, though I'll also note that the slurm-smd packages are not meant to perfectly mirror what Debian has done with the slurm-wlm packages.

(In reply to Markus Kötter from comment #7)
> It just fails to load when used without the library installed and can not be used, but does not create a dependency in dpkg.
Is the problem here that some nodes do and some don't have nvidia cards and you are using the same set of packages so the slurmd doesn't start on nodes without them?

Thanks!
--Tim
Comment 11 Markus Kötter 2023-12-03 14:16:49 MST
(In reply to Tim McMullan from comment #10)
> We are discussing internally what adding these kinds of packages would look
> like (nvml, rsmi, and oneapi), but we haven't yet found an elegant solution
> that doesn't require you to have all three set up on your system to build
> the packages. 

I guess it is safe to say building the package is inside a container anyway, most likely using gpb or dpkg-buildpackage in a CI pipeline.

So I would not even care about build time requirements.
I'd build all, split into module packages by requirement.
Comment 13 Tim McMullan 2024-07-31 07:07:44 MDT
Hi Markus,

I just wanted to let you know we are still working on this.  I've got an idea for the implementation that I'm working on here.

Thanks!
--Tim