The nvidia nvml plugin is not packaged in the current state of the debian packaging. With debians package, the nvml is in the contrib branch of the repo It's the libnvidia-ml-dev dependency https://salsa.debian.org/hpc-team/slurm-wlm/-/blob/contrib/debian/control#L37 and the package definitions https://salsa.debian.org/hpc-team/slurm-wlm/-/blob/contrib/debian/control#L43-71 I'd appreciate if this was added.
Forgot to mention https://salsa.debian.org/hpc-team/slurm-wlm/-/blob/contrib/debian/slurm-wlm-nvml-plugin-dev.install https://salsa.debian.org/hpc-team/slurm-wlm/-/blob/contrib/debian/slurm-wlm-nvml-plugin.install
Markus, As a feature request (for the future version) I don't think we can consider it as Severity 1. Per our commercial support definitions[1]. >Severity 1 — Major Impact > >There is a continued system outage that affects a large number of end users. The >system is down or unusable due to Slurm problem(s) and no procedural workaround >exists. cheers, Marcin [1]https://www.schedmd.com/support.php
Markus, While there is no slurm-smd-nvml-plugin package, the nvml libraries will be included in the slurm-smd package if the cuda libraries are installed on the build system. The same is true of the amd and intel gpu plugins. Its set up this way to allow for a little more flexibility on what version of cuda/rocm/OneAPI is built into the packages since it is fairly common to install those from external sources. > dpkg -c ./slurm-smd_23.11.0-0rc2_amd64.deb | grep nvml should be able to show that the built package has the nvml plugin included or not. Let me know if this is not the case for you? Thanks, --Tim
Hi, I just wanted to check in and see if you were able to confirm that the nvml plugin is indeed included in the generic slurm-smd package. Thanks, --Tim
(In reply to Tim McMullan from comment #6) > to confirm that the > nvml plugin is indeed included in the generic slurm-smd package. Yes, it is. As the nvml is build without linking and relies on dlopen, I assume building it by default, including the build-dependency is fine, as it does not require a specific library version/dependency at runtime. It just fails to load when used without the library installed and can not be used, but does not create a dependency in dpkg.
Hey, I just wanted to give you an update on this. We are discussing internally what adding these kinds of packages would look like (nvml, rsmi, and oneapi), but we haven't yet found an elegant solution that doesn't require you to have all three set up on your system to build the packages. The package build is very all or nothing, and we seem to need a pre-build script to alter the rules and control files to make this happen the way we would like it to. (In reply to Markus Kötter from comment #7) > As the nvml is build without linking and relies on dlopen, I assume building > it by default, including the build-dependency is fine, as it does not > require a specific library version/dependency at runtime. Unfortunately there are some version requirements as the CUDA api breaks compatibility on occasion (cuda 11 and 12 don't seem to mix well) which does complicate this some since a generic control file would require a list of about a dozen acceptable packages. We are continuing to look into ways to make this a little cleaner, though I'll also note that the slurm-smd packages are not meant to perfectly mirror what Debian has done with the slurm-wlm packages. (In reply to Markus Kötter from comment #7) > It just fails to load when used without the library installed and can not be used, but does not create a dependency in dpkg. Is the problem here that some nodes do and some don't have nvidia cards and you are using the same set of packages so the slurmd doesn't start on nodes without them? Thanks! --Tim
(In reply to Tim McMullan from comment #10) > We are discussing internally what adding these kinds of packages would look > like (nvml, rsmi, and oneapi), but we haven't yet found an elegant solution > that doesn't require you to have all three set up on your system to build > the packages. I guess it is safe to say building the package is inside a container anyway, most likely using gpb or dpkg-buildpackage in a CI pipeline. So I would not even care about build time requirements. I'd build all, split into module packages by requirement.
Hi Markus, I just wanted to let you know we are still working on this. I've got an idea for the implementation that I'm working on here. Thanks! --Tim