Ticket 25181 - Option to avoid jobs starting on selected GPU devices
Summary: Option to avoid jobs starting on selected GPU devices
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other tickets)
Version: 25.11.5
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2026-05-07 03:41 MDT by VUB HPC
Modified: 2026-05-22 07:27 MDT (History)
4 users (show)

See Also:
Site: VUB
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description VUB HPC 2026-05-07 03:41:24 MDT
This feature request follows what is discussed in tickets #22871 and #22384

We would like to be able to flag a specific GPU device in a node as "down" so that jobs do not start on that device, but the node continues to accept jobs on the other GPUs.

Such a feature would improve handling situations with faulty GPUs on boxes with many devices. Avoiding having to put the full node in drain while we wait for replacement parts.

Specifically, we imagine such a feature to work with the existing support for constraining access to Gres devices with cgroups. Ideally, with "ConstrainDevices" active there would be no need to make any changes to "gres.conf" or "slurm.conf" to put a GPU in a "down" state. Just the execution of some (new) scontrol command.

Is this something that could be implemented in the near future? Or is maybe another form to achieve a similar result already planned?

Thanks and kind regards,

Alex Domingo
VUB
Comment 1 Tim Wickberg 2026-05-13 14:20:43 MDT
Hey Alex -

We're discussing something of this nature for the 26.11 release, but can't promise a specific implementation just yet. But the broad idea would be to add independent status tracking for each GRES device - you'd then have options to mark them down individually while persisting them in the configuration.

Right now the only way to readily achieve this is to alter the gres.conf/slurm.conf definition for the node, which I know isn't ideal.

- Tim
Comment 2 VUB HPC 2026-05-22 07:27:25 MDT
Hi Tim,

Happy to hear that there are plans in motion for this feature. A target on the 26.11 release would be convenient to us as well.

Thanks for the update.

Alex Domingo
VUB-HPC