Ticket 19066

Summary: slurm loaded the nvidia_uvm kernel module, which made it impossible to upgrade nvidia gpu drivers online
Product: Slurm Reporter: andi cao <caoshiwei>
Component: LimitsAssignee: Jacob Jenson <jacob>
Status: OPEN --- QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: lyeager
Version: 23.11.3   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description andi cao 2024-02-21 23:33:14 MST
After I upgraded slurm to slurm 23.11.3 version, I tried to online upgrade the GPU driver of the computing node, and it prompted me that the nvidia_uvm module was being used. After analysis, I found that the slurmd service imported the kernel module. Is it possible to make some improvements, not to let slurm load the nvidia uvm module for a long time, but to load it when necessary.

Thank you.