Ticket 25181

Summary:	Option to avoid jobs starting on selected GPU devices
Product:	Slurm	Reporter:	VUB HPC <hpcadmin>
Component:	GPU	Assignee:	Tim Wickberg <tim>
Status:	OPEN ---	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	brian.gregory, djacobsen, kilian, scott
Version:	25.11.5
Hardware:	Linux
OS:	Linux
Site:	VUB	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description VUB HPC 2026-05-07 03:41:24 MDT

This feature request follows what is discussed in tickets #22871 and #22384

We would like to be able to flag a specific GPU device in a node as "down" so that jobs do not start on that device, but the node continues to accept jobs on the other GPUs.

Such a feature would improve handling situations with faulty GPUs on boxes with many devices. Avoiding having to put the full node in drain while we wait for replacement parts.

Specifically, we imagine such a feature to work with the existing support for constraining access to Gres devices with cgroups. Ideally, with "ConstrainDevices" active there would be no need to make any changes to "gres.conf" or "slurm.conf" to put a GPU in a "down" state. Just the execution of some (new) scontrol command.

Is this something that could be implemented in the near future? Or is maybe another form to achieve a similar result already planned?

Thanks and kind regards,

Alex Domingo
VUB

Comment 1 Tim Wickberg 2026-05-13 14:20:43 MDT

Hey Alex -

We're discussing something of this nature for the 26.11 release, but can't promise a specific implementation just yet. But the broad idea would be to add independent status tracking for each GRES device - you'd then have options to mark them down individually while persisting them in the configuration.

Right now the only way to readily achieve this is to alter the gres.conf/slurm.conf definition for the node, which I know isn't ideal.

- Tim

Comment 2 VUB HPC 2026-05-22 07:27:25 MDT

Hi Tim,

Happy to hear that there are plans in motion for this feature. A target on the 26.11 release would be convenient to us as well.

Thanks for the update.

Alex Domingo
VUB-HPC