Ticket 16688

Summary: Visualize gpu data from elastisearch
Product: Slurm Reporter: Petros.Zolotas
Component: OtherAssignee: Alejandro Sanchez <alex>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: benjamin.witham
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: Pfizer Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Petros.Zolotas 2023-05-10 09:03:42 MDT
Hello,

we use slurm elasticsearch plugin to visualize metrics in Grafana.

Before our latest update from v20 to v21 we used to have an implementation to get the data to grafana via elasticsearch, which broke after the update, probably due to changes in the elasticsearch plugin fields.

We are wondering if there is a already known query - solution we can use to be able to get gpu hours accounting data in Grafana.

We are also interested in monitoring pending jobs in general via Grafana if possible.

Thanks
Petros
Comment 2 Alejandro Sanchez 2023-05-12 07:47:32 MDT
Hi Petros,

The plugin exposes counts of allocated TRES (including gres/gpu) via tres_req and tres_alloc fields. The job record (which is the data structure handed by design to the plugin) doesn't contain information about GPU utilization. That information is gathered via JobAcctGather plugins and sent to the Slurm accounting database, and is retrievable via sacct TRESUsage* fields, but it's not available from the Job Completion plugins.

At the same time, the Job Completion plugin (as the name indicates) doesn't receive information about jobs until they are finished, thus pending jobs are not available to this plugin.

So you'll need to query pending jobs information and gpu utilization from the Slurm database via sacct or via REST queries to slurmrestd and send them somewhere for Grafana to be displayed. I'm aware there are external connector tools like Open XDMoD or Prometheus Slurm exporter that people use, but I haven't personally tried any of them and I'm not in a position to recommend them in consequence. An alternative would be to script local tools to accomplish what you want.

Please, let me know if there's anything else we can help here.

Thanks.
Comment 3 Alejandro Sanchez 2023-06-06 04:23:49 MDT
Hi,

I'm gonna go ahead and mark the bug as resolved. Please, reopen if there's anything else or open a separate bug for potential different issues.

Thanks.