Ticket 11056

Summary: Invalid gres when using autodetect nvml Type:quadro_rtx_6000
Product: Slurm Reporter: Jenny Williams <jennyw>
Component: GPUAssignee: Jacob Jenson <jacob>
Status: RESOLVED FIXED QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 20.11.3   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: RHEL
Machine Name: CLE Version:
Version Fixed: 20.11.3 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Jenny Williams 2021-03-10 14:08:05 MST
GPU are detected but are ignored.  

slurmd: debug2: gpu/nvml: _nvml_shutdown: Successfully shut down NVML
slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
slurmd: debug:  Gres GPU plugin: Normalizing gres.conf with system GPUs
slurmd: debug2: gres/gpu: _normalize_gres_conf: gres_list_conf:
slurmd: gres/gpu: _normalize_gres_conf: WARNING: The following autodetected GPUs are being ignored:
slurmd:     GRES[gpu] Type:quadro_rtx_6000 Count:1 Cores(256):0-127  Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
slurmd:     GRES[gpu] Type:quadro_rtx_6000 Count:1 Cores(256):0-127  Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
slurmd:     GRES[gpu] Type:quadro_rtx_6000 Count:1 Cores(256):0-127  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
slurmd:     GRES[gpu] Type:quadro_rtx_6000 Count:1 Cores(256):0-127  Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
slurmd: debug:  Gres GPU plugin: Final normalized gres.conf list is empty
Comment 1 Jenny Williams 2021-03-10 14:27:20 MST
Ah - behavior if not also listed in slurm.conf in node definition. Posted too soon. Thanks.