Ticket 12098

Summary: AMD 'Brand' missing from rocm-smi 4.1+
Product: Slurm Reporter: Danny Auble <da>
Component: GPUAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: harish.kasiviswanathan, Tianxinmike.Li
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: rocm 4.5.0 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Danny Auble 2021-07-21 16:33:00 MDT
Mike, in recent versions of rocm (I think it started with 4.1?) we have stopped getting the name and device brand.

For instance my 4.0 gives me this...

/opt/rocm-4.0.0/bin/rocm-smi --showproductname


======================= ROCm System Management Interface =======================
================================= Product Info =================================
GPU[0]          : Card series:          Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
GPU[0]          : Card model:           RX 5700 XT RAW II
GPU[0]          : Card vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]          : Card SKU:             R_195_
================================================================================
============================= End of ROCm SMI Log ==============================

But my 4.1+ (I have tried up till 4.2.0) I get this...

/opt/rocm-4.1.0/bin/rocm-smi --showproductname


======================= ROCm System Management Interface =======================
================================= Product Info =================================
GPU[0]          : Card model:           RX 5700 XT RAW II
GPU[0]          : Card vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]          : Card SKU:             R_195_
================================================================================
============================= End of ROCm SMI Log ==============================

From rocminfo on my node I get

...
  Name:                    gfx1010                            
  Uuid:                    GPU-XX                             
  Marketing Name:          Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
...

I'm expecting to get that from my slurmd when starting but I get...

[Jul 21 16:14:58.990427 2607576 slurmd       0x7f3f60bd8780] debug:  gpu/rsmi: _get_system_gpu_list_rsmi: AMD Graphics Driver Version: 5.11.0-22-generic
[Jul 21 16:14:58.990441 2607576 slurmd       0x7f3f60bd8780] debug:  gpu/rsmi: _get_system_gpu_list_rsmi: RSMI Library Version: 0
[Jul 21 16:14:58.990450 2607576 slurmd       0x7f3f60bd8780] debug2: gpu/rsmi: _get_system_gpu_list_rsmi: Device count: 1
[Jul 21 16:14:58.990537 2607576 slurmd       0x7f3f60bd8780] error: RSMI: Failed to get Unique ID of the GPU: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.
[Jul 21 16:14:58.990549 2607576 slurmd       0x7f3f60bd8780] debug2: gpu/rsmi: _get_system_gpu_list_rsmi: GPU index 0:
[Jul 21 16:14:58.990558 2607576 slurmd       0x7f3f60bd8780] debug2: gpu/rsmi: _get_system_gpu_list_rsmi:     Name: 
[Jul 21 16:14:58.990565 2607576 slurmd       0x7f3f60bd8780] debug2: gpu/rsmi: _get_system_gpu_list_rsmi:     Brand/Type: 
[Jul 21 16:14:58.990574 2607576 slurmd       0x7f3f60bd8780] debug2: gpu/rsmi: _get_system_gpu_list_rsmi:     UUID: 0
[Jul 21 16:14:58.990581 2607576 slurmd       0x7f3f60bd8780] debug2: gpu/rsmi: _get_system_gpu_list_rsmi:     PCI Domain/Bus/Device/Function: 0:12:0.0
[Jul 21 16:14:58.990598 2607576 slurmd       0x7f3f60bd8780] debug2: gpu/rsmi: _get_system_gpu_list_rsmi:     Device File (minor number): /dev/dri/renderD128

As you can see Name and Brand are blank.

Any idea how to get this to work again?  It is sort of problematic when the user wants to have a gpu 'type' in Slurm and use Autodetect, Brand being the key missing thing here.

Thanks for looking at this!  If you aren't the person to talk to please help me find the right person.
Comment 1 Mike Li 2021-07-22 07:49:31 MDT
Added Harish
Comment 2 Mike Li 2021-07-27 15:28:30 MDT
I can repro the issue and have forwarded the issue to rocm-smi developer.
Comment 3 Danny Auble 2021-07-27 15:44:07 MDT
Thanks a bunch Mike!
Comment 4 Danny Auble 2021-11-18 09:40:27 MST
I can confirm this is fixed in rocm-4.5.0

/opt/rocm-4.5.0/bin/rocm-smi --showproductname


======================= ROCm System Management Interface =======================
================================= Product Info =================================
GPU[0]          : Card series:          Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
GPU[0]          : Card model:           RX 5700 XT RAW II
GPU[0]          : Card vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]          : Card SKU:             R_195_
================================================================================
============================= End of ROCm SMI Log ==============================

Thanks for fixing it!