Ticket 18725

Summary: Strange values for `CPUFrequency` in `sacct` profile
Product: Slurm Reporter: Ruben Laso <laso>
Component: ProfilingAssignee: Jacob Jenson <jacob>
Status: OPEN --- QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 23.11.1   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: sh5 output, slurm.conf & acct_gather.conf

Description Ruben Laso 2024-01-19 02:08:47 MST
Created attachment 34176 [details]
sh5 output, slurm.conf & acct_gather.conf

When enabling `task` profiling alongside `srun` or `sbatch` commands, the retrieved `CPUFrequency` value seems to be wrong.

Attached you can find the profiling output for this command:

srun -N 1 -n 1 -c 64 --profile=energy,task --acctg-freq=energy=1,task=1 lu.C.x

CPU in the compute node (AMD EPYC 7551) works at an operating frequency of 2.0 to 3.0 GHz.
However, the retrieved values in the `h5` from `sh5utils` seem to be oscillating between 39KHz and 18MHz.
Furthermore, this is the output in the `sacct` command:

JobID           JobName  AllocCPUS NNo    Elapsed Consumed AveCPUFreq
------------ ---------- ---------- --- ---------- -------- ----------
18               lu.C.x         64   1   00:01:12   22.36K
18.extern        extern         64   1   00:01:12   22.36K         2G
18.0             lu.C.x         64   1   00:01:12   22.36K       296K

The `.extern` value is reasonable, but the `.0` is not.

The values reported by `/proc/cpuinfo` during the execution are in a reasonable range, around 2.5 GHz.

Alongside the profiling output, you can find attached the `slurm.conf` and `acct_gather.conf` files.

Are we doing something wrong when interpreting the `h5` file or when extracting the info?
Is there a problem with the CPU frequencies in Slurm profiling?