This feature request follows what is discussed in tickets #22871 and #22384 We would like to be able to flag a specific GPU device in a node as "down" so that jobs do not start on that device, but the node continues to accept jobs on the other GPUs. Such a feature would improve handling situations with faulty GPUs on boxes with many devices. Avoiding having to put the full node in drain while we wait for replacement parts. Specifically, we imagine such a feature to work with the existing support for constraining access to Gres devices with cgroups. Ideally, with "ConstrainDevices" active there would be no need to make any changes to "gres.conf" or "slurm.conf" to put a GPU in a "down" state. Just the execution of some (new) scontrol command. Is this something that could be implemented in the near future? Or is maybe another form to achieve a similar result already planned? Thanks and kind regards, Alex Domingo VUB
Hey Alex - We're discussing something of this nature for the 26.11 release, but can't promise a specific implementation just yet. But the broad idea would be to add independent status tracking for each GRES device - you'd then have options to mark them down individually while persisting them in the configuration. Right now the only way to readily achieve this is to alter the gres.conf/slurm.conf definition for the node, which I know isn't ideal. - Tim
Hi Tim, Happy to hear that there are plans in motion for this feature. A target on the 26.11 release would be convenient to us as well. Thanks for the update. Alex Domingo VUB-HPC