Ticket 18725 - Strange values for `CPUFrequency` in `sacct` profile
Summary: Strange values for `CPUFrequency` in `sacct` profile
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Profiling (show other tickets)
Version: 23.11.1
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-01-19 02:08 MST by Ruben Laso
Modified: 2024-01-19 02:08 MST (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
sh5 output, slurm.conf & acct_gather.conf (13.76 KB, text/plain)
2024-01-19 02:08 MST, Ruben Laso
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Ruben Laso 2024-01-19 02:08:47 MST
Created attachment 34176 [details]
sh5 output, slurm.conf & acct_gather.conf

When enabling `task` profiling alongside `srun` or `sbatch` commands, the retrieved `CPUFrequency` value seems to be wrong.

Attached you can find the profiling output for this command:

srun -N 1 -n 1 -c 64 --profile=energy,task --acctg-freq=energy=1,task=1 lu.C.x

CPU in the compute node (AMD EPYC 7551) works at an operating frequency of 2.0 to 3.0 GHz.
However, the retrieved values in the `h5` from `sh5utils` seem to be oscillating between 39KHz and 18MHz.
Furthermore, this is the output in the `sacct` command:

JobID           JobName  AllocCPUS NNo    Elapsed Consumed AveCPUFreq
------------ ---------- ---------- --- ---------- -------- ----------
18               lu.C.x         64   1   00:01:12   22.36K
18.extern        extern         64   1   00:01:12   22.36K         2G
18.0             lu.C.x         64   1   00:01:12   22.36K       296K

The `.extern` value is reasonable, but the `.0` is not.

The values reported by `/proc/cpuinfo` during the execution are in a reasonable range, around 2.5 GHz.

Alongside the profiling output, you can find attached the `slurm.conf` and `acct_gather.conf` files.

Are we doing something wrong when interpreting the `h5` file or when extracting the info?
Is there a problem with the CPU frequencies in Slurm profiling?