Ticket 25181

Summary: Option to avoid jobs starting on selected GPU devices
Product: Slurm Reporter: VUB HPC <hpcadmin>
Component: GPUAssignee: Tim Wickberg <tim>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: brian.gregory, djacobsen, kilian, scott
Version: 25.11.5   
Hardware: Linux   
OS: Linux   
Site: VUB Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description VUB HPC 2026-05-07 03:41:24 MDT
This feature request follows what is discussed in tickets #22871 and #22384

We would like to be able to flag a specific GPU device in a node as "down" so that jobs do not start on that device, but the node continues to accept jobs on the other GPUs.

Such a feature would improve handling situations with faulty GPUs on boxes with many devices. Avoiding having to put the full node in drain while we wait for replacement parts.

Specifically, we imagine such a feature to work with the existing support for constraining access to Gres devices with cgroups. Ideally, with "ConstrainDevices" active there would be no need to make any changes to "gres.conf" or "slurm.conf" to put a GPU in a "down" state. Just the execution of some (new) scontrol command.

Is this something that could be implemented in the near future? Or is maybe another form to achieve a similar result already planned?

Thanks and kind regards,

Alex Domingo
VUB
Comment 1 Tim Wickberg 2026-05-13 14:20:43 MDT
Hey Alex -

We're discussing something of this nature for the 26.11 release, but can't promise a specific implementation just yet. But the broad idea would be to add independent status tracking for each GRES device - you'd then have options to mark them down individually while persisting them in the configuration.

Right now the only way to readily achieve this is to alter the gres.conf/slurm.conf definition for the node, which I know isn't ideal.

- Tim
Comment 2 VUB HPC 2026-05-22 07:27:25 MDT
Hi Tim,

Happy to hear that there are plans in motion for this feature. A target on the 26.11 release would be convenient to us as well.

Thanks for the update.

Alex Domingo
VUB-HPC