Summary: | Strange values for `CPUFrequency` in `sacct` profile | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ruben Laso <laso> |
Component: | Profiling | Assignee: | Jacob Jenson <jacob> |
Status: | OPEN --- | QA Contact: | |
Severity: | 6 - No support contract | ||
Priority: | --- | ||
Version: | 23.11.1 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | sh5 output, slurm.conf & acct_gather.conf |
Created attachment 34176 [details] sh5 output, slurm.conf & acct_gather.conf When enabling `task` profiling alongside `srun` or `sbatch` commands, the retrieved `CPUFrequency` value seems to be wrong. Attached you can find the profiling output for this command: srun -N 1 -n 1 -c 64 --profile=energy,task --acctg-freq=energy=1,task=1 lu.C.x CPU in the compute node (AMD EPYC 7551) works at an operating frequency of 2.0 to 3.0 GHz. However, the retrieved values in the `h5` from `sh5utils` seem to be oscillating between 39KHz and 18MHz. Furthermore, this is the output in the `sacct` command: JobID JobName AllocCPUS NNo Elapsed Consumed AveCPUFreq ------------ ---------- ---------- --- ---------- -------- ---------- 18 lu.C.x 64 1 00:01:12 22.36K 18.extern extern 64 1 00:01:12 22.36K 2G 18.0 lu.C.x 64 1 00:01:12 22.36K 296K The `.extern` value is reasonable, but the `.0` is not. The values reported by `/proc/cpuinfo` during the execution are in a reasonable range, around 2.5 GHz. Alongside the profiling output, you can find attached the `slurm.conf` and `acct_gather.conf` files. Are we doing something wrong when interpreting the `h5` file or when extracting the info? Is there a problem with the CPU frequencies in Slurm profiling?