Ticket 17637

Summary: How to disable GPU statistics collection
Product: Slurm Reporter: Josko Plazonic <plazonic>
Component: GPUAssignee: Marshall Garey <marshall>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 23.02.4   
Hardware: Linux   
OS: Linux   
Site: Princeton (PICSciE) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Josko Plazonic 2023-09-06 15:07:37 MDT
Hello,

I noticed that we get errors like these in logs:
[50162878.batch] error: NVML: Failed to get Compute running procs(7): Insufficient Size
on our MIG instances. This seems to be happening because slurm is attempting to collect GPU utilization via NVML, which does not work on MIG instances (by design).

That's a bug on its own - it should ignore those and not report errors - but we would like to turn off this feature completely as we do not need it and it might conflict with what we do already to collect these stats and what we are planning to do. 

The long story is that we have a separate GPU stats collector using NVML and we are trying to switch it to the DCGM based collector. It is not a particularly good idea to have multiple processes collecting stats, especially nvml+dcgm - maybe nvidia got it fixed recently but we used to have GPU crashes in such situations and since we are not using/needing slurm's GPU stats it is safest to disable them.

Is there a way to do that? To configure slurm to not collect GPU memory and utilization?

I'd think that one could achieve that by not adding gres/gpumem,gres/gpuutil to AccountingStorageTRES but we do not do that and it got added on its own. I.e. we set

AccountingStorageTRES=cpu,mem,energy,node,Gres/gpu

which results in:

# scontrol show config | grep AccountingStorageTRES
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu,gres/gpumem,gres/gpuutil

Can you please let us know how can we disable this feature. Thanks.
Comment 2 Marshall Garey 2023-09-06 15:59:37 MDT
In bug 17102, we are adding an option to slurm.conf to disable gpu accounting. This will be part of the Slurm 23.11 release.

I'm marking this as a duplicate of bug 17102.

*** This ticket has been marked as a duplicate of ticket 17102 ***