Ticket 25180

Summary:	Node configuration requires GPU count even with autodetect plugins
Product:	Slurm	Reporter:	VUB HPC <hpcadmin>
Component:	GPU	Assignee:	Scott Hilton <scott>
Status:	OPEN ---	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	brian.gregory
Version:	25.11.5
Hardware:	Linux
OS:	Linux
Site:	VUB	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description VUB HPC 2026-05-07 03:25:16 MDT

We have nodes with 8 GPU devices. Whenever any of those GPUs becomes faulty or fully broken, we want to keep the node online with the remaining working GPUs.

We are using a similar approach to what is discussed in tickets #22871 and #22384:

* we manually disable/drain the GPU with nvidia-smi
* update the 'Gres' attribute in the node specification in "slurm.conf" to match the new device count
* no changes needed to "gres.conf" because we use the NVML autodetection

This works, but we would very much like to not have to manually change "slurm.conf". Since we already have autodetection in "gres.conf" and that returns the number of GPUs, Slurm could take that GPU count at face value avoiding any hardcoded setting in "slurm.conf". Can this be fixed?

Thanks and kind regards,

Alex Domingo
VUB

Comment 2 Scott Hilton 2026-05-13 11:53:48 MDT

Alex,

I don't see better way to do this differently than your current workflow, without adding the concept of draining or downing a GPU as you mentioned in ticket 25181.

I will talk to the team about these feature requests.

-Scott

Comment 3 VUB HPC 2026-05-22 07:22:03 MDT

Hi Scott,

> I don't see better way to do this differently than your current workflow, without adding the concept of draining or downing a GPU as you mentioned in ticket 25181.

It would be already an improvement if slurm could read the number of GPUs from the NVML autodetection, instead of having to manually wrote that count down in "slurm.conf".

That might be faster/easier to implement than a new feature to drain/down single GPU devices.

Thanks for forwarding this to the team!

Alex Domingo
VUB-HPC