Summary: | Force nvml configuration even when nvml is not avaible at compile time | ||
---|---|---|---|
Product: | Slurm | Reporter: | Gennaro Oliva <gennaro.oliva> |
Component: | Build System and Packaging | Assignee: | Tim Wickberg <tim> |
Status: | OPEN --- | QA Contact: | |
Severity: | C - Contributions | ||
Priority: | --- | ||
Version: | 23.02.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
Force nvml configuration without autodetection
slurmd and slurmctld log files when the plugin is present or missing |
Hey Gennaro - The one issue I have with removing this is that, if AutoDetect=nvml was set, but the extra package you're splitting the gpu_nvml.so off into hasn't been installed, we'd end up blowing up in a slightly weird spot during the plugin init. That's what these conditionals are trying to protect against... under the assumption that everything came from a single build, not the split-builds you're looking at handling due to the licensing issue. I'd suggest a stat() get put in to check against the gpu_nvml.so's existence if any change is going to be made here. We'd also want the equivalent changes for oneapi/rsmi for consistency, even though I understand packaging for those is likely not as frequent a request for you at this point. thanks, - Tim Hi Tim, thank you very much for your comments. (In reply to Tim Wickberg from comment #2) > The one issue I have with removing this is that, if AutoDetect=nvml was set, > but the extra package you're splitting the gpu_nvml.so off into hasn't been > installed, we'd end up blowing up in a slightly weird spot during the plugin > init. That's what these conditionals are trying to protect against... under > the assumption that everything came from a single build, not the > split-builds you're looking at handling due to the licensing issue. As far as I understood looking at the code, the autodetect=nvml only retrieve information by the gpu to check them against those provided in the configuration file. When the plugin is not available, Slurm just fails to create the plugin context. I don't see the problem as long as the user get notified, but I'm surely missing something. > I'd suggest a stat() get put in to check against the gpu_nvml.so's existence > if any change is going to be made here. Where do you suggest to put the check for the presence of the plugin? Changes can be made everywhere in the code. We don't necessarily have to change this file only. > We'd also want the equivalent > changes for oneapi/rsmi for consistency, even though I understand packaging > for those is likely not as frequent a request for you at this point. RSMI can be included in the main release: it is free software. I'm attaching debug3 output for slurmctld and slurmd when the plugin is present or missing. I really appreciate the time you spend on this issue. Thank you Created attachment 28697 [details]
slurmd and slurmctld log files when the plugin is present or missing
|
Created attachment 28650 [details] Force nvml configuration without autodetection Hi there, I have to build slurm in two separate environment for licensing reasons, the only difference between the twos is the availability of libnvml. The plan is to build the main slurm in the "free" environment and the gpu_nvml.so plugin in the "non-free" environment and then to allow the free version to use the nvml plugin. To this aim I want to remove the HAVE_NVML clause in src/common/gpu.c as in the attached patch. Do you see any issues?