Ticket 25180 - Node configuration requires GPU count even with autodetect plugins
Summary: Node configuration requires GPU count even with autodetect plugins
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other tickets)
Version: 25.11.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2026-05-07 03:25 MDT by VUB HPC
Modified: 2026-05-22 07:22 MDT (History)
1 user (show)

See Also:
Site: VUB
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description VUB HPC 2026-05-07 03:25:16 MDT
We have nodes with 8 GPU devices. Whenever any of those GPUs becomes faulty or fully broken, we want to keep the node online with the remaining working GPUs.

We are using a similar approach to what is discussed in tickets #22871 and #22384:

* we manually disable/drain the GPU with nvidia-smi
* update the 'Gres' attribute in the node specification in "slurm.conf" to match the new device count
* no changes needed to "gres.conf" because we use the NVML autodetection

This works, but we would very much like to not have to manually change "slurm.conf". Since we already have autodetection in "gres.conf" and that returns the number of GPUs, Slurm could take that GPU count at face value avoiding any hardcoded setting in "slurm.conf". Can this be fixed?

Thanks and kind regards,

Alex Domingo
VUB
Comment 2 Scott Hilton 2026-05-13 11:53:48 MDT
Alex,

I don't see better way to do this differently than your current workflow, without adding the concept of draining or downing a GPU as you mentioned in ticket 25181.

I will talk to the team about these feature requests.

-Scott
Comment 3 VUB HPC 2026-05-22 07:22:03 MDT
Hi Scott,

> I don't see better way to do this differently than your current workflow, without adding the concept of draining or downing a GPU as you mentioned in ticket 25181.

It would be already an improvement if slurm could read the number of GPUs from the NVML autodetection, instead of having to manually wrote that count down in "slurm.conf".

That might be faster/easier to implement than a new feature to drain/down single GPU devices.

Thanks for forwarding this to the team!

Alex Domingo
VUB-HPC