Ticket 7560

Summary: Enhance support for AMD GPUs and APIs
Product: Slurm Reporter: Tim Wickberg <tim>
Component: GPUAssignee: Tim Wickberg <tim>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: bertsch2, day36, ezellma
Version: 20.02.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=7714
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: Nazare
Coreweave sites: --- Cray Sites: Cray Internal
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: 20.11
DevPrio: 1 - Paid Emory-Cloud Sites: ---

Description Tim Wickberg 2019-08-12 22:17:45 MDT
Add support for ROCR_VISIBLE_DEVICES environment variable manipulation, similar to that of CUDA_VISIBLE_DEVICES.

Add support equivalent to that of the Nvidia NVML / MPS libraries assuming sufficient API availability.
Comment 1 Tim Wickberg 2021-05-24 14:47:26 MDT
Just tidying up. I'm marking this as complete - the gpu/rsmi plugin has been available since the 20.02 release last year as is working as intended.

*** This ticket has been marked as a duplicate of ticket 7714 ***
Comment 2 Tim Wickberg 2022-01-24 10:35:40 MST
Opening this ticket up publicly, and adding a couple of documentation links:

AMD's ROCm SMI library is what the Slurm gpu/rsmi plugin depends on for device info:

https://github.com/RadeonOpenCompute/rocm_smi_lib

The rsmi.h header itself is the best description of the API they've defined:

https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/master/include/rocm_smi/rocm_smi.h