We have nodes with 8 GPU devices. Whenever any of those GPUs becomes faulty or fully broken, we want to keep the node online with the remaining working GPUs. We are using a similar approach to what is discussed in tickets #22871 and #22384: * we manually disable/drain the GPU with nvidia-smi * update the 'Gres' attribute in the node specification in "slurm.conf" to match the new device count * no changes needed to "gres.conf" because we use the NVML autodetection This works, but we would very much like to not have to manually change "slurm.conf". Since we already have autodetection in "gres.conf" and that returns the number of GPUs, Slurm could take that GPU count at face value avoiding any hardcoded setting in "slurm.conf". Can this be fixed? Thanks and kind regards, Alex Domingo VUB
Alex, I don't see better way to do this differently than your current workflow, without adding the concept of draining or downing a GPU as you mentioned in ticket 25181. I will talk to the team about these feature requests. -Scott
Hi Scott, > I don't see better way to do this differently than your current workflow, without adding the concept of draining or downing a GPU as you mentioned in ticket 25181. It would be already an improvement if slurm could read the number of GPUs from the NVML autodetection, instead of having to manually wrote that count down in "slurm.conf". That might be faster/easier to implement than a new feature to drain/down single GPU devices. Thanks for forwarding this to the team! Alex Domingo VUB-HPC