Ticket 24227 - Doubts Regarding Definition of Metrics
Summary: Doubts Regarding Definition of Metrics
Status: RESOLVED MOVED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: - Unsupported Older Versions
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-12-01 03:17 MST by Manuel Giménez de Castro Marciani
Modified: 2025-12-03 03:30 MST (History)
1 user (show)

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Manuel Giménez de Castro Marciani 2025-12-01 03:17:04 MST
I am using Slurm's (version 23.02.7) collected metrics (jobacct_gather/linux) to do a performance analysis of an application. 

I have thoroughly read the documentation regarding the metrics (https://slurm.schedmd.com/sacct.html) but still find the Ave* metrics confusing, and more specifically the AveRSS and AveDiskWrite.

AveDiskWrite is defined as "Average number of bytes written by all tasks in job." So if I double the workload, which had x avediskwrite, I should observe 2x. So far it is what I observed. But, then, if I double the resources while maintaining the workload I observe again x, and not 2x. 

So my suspicion is that the metric is the sum of written bytes across time, then divided by the number of nodes. 

But then with AveRSS, defined as "Average resident set size of all tasks in job," I observe what I expected with AveDiskWrite. That is, that this metric scales with the workload irrespective of the resources it has available. 

I would be thankful if you could clarify the behavior, and even more grateful if you could point me where in the code these metrics are aggregated and processed to be stored in the database.

Thanks!
Comment 3 Manuel Giménez de Castro Marciani 2025-12-03 03:30:33 MST
Support from my HPC center contacted me saying that I should talk with them.