Ticket 19978 - jobcomp not including energy
Summary: jobcomp not including energy
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 23.11.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Benjamin Witham
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-05-23 18:52 MDT by Matt Ezell
Modified: 2024-07-17 18:39 MDT (History)
1 user (show)

See Also:
Site: ORNL-OLCF
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmd.conf (4.93 KB, text/plain)
2024-05-23 18:53 MDT, Matt Ezell
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Matt Ezell 2024-05-23 18:52:31 MDT
Per Bug 18274, it seems that jobcomp should include energy. It doesn't seem to be working for us. Our Kafka messages output data like the following:

"tres_req_raw":"1=8,2=4096000,4=8,5=8",
"tres_req":"cpu=8,mem=4000G,node=8,billing=8",
"tres_alloc_raw":"1=896,3=18446744073709551614,4=8,5=896",
"tres_alloc":"cpu=896,node=8,billing=896"

I think that 3=18446744073709551614 means energy=NO_VAL64 in tres_alloc_raw

The database has the following info:
[root@slurm1.borg ~]# sacct -j 129242 -o jobid,alloctres%42
JobID                                         AllocTRES 
------------ ------------------------------------------ 
129242       billing=896,cpu=896,energy=24703018,node=8 
129242.batch                        cpu=63,mem=0,node=1 
129242.exte+                 billing=896,cpu=896,node=8 
129242.0                           cpu=448,mem=0,node=8 

Do we have a misconfiguration, or has this broken since it was added for 23.11.1?
Comment 1 Matt Ezell 2024-05-23 18:53:25 MDT
Created attachment 36685 [details]
slurmd.conf
Comment 2 Benjamin Witham 2024-05-28 10:07:35 MDT
Hello Matt,

I can reproduce this issue from my end. I'm looking into the cause of the issue currently. Just a few quick questions for you.

Are you never seeing the energy be printed? Do only some jobs print energy?
In your best estimate, what is the average length of one of your jobs?
Comment 3 Matt Ezell 2024-05-28 10:37:42 MDT
(In reply to Benjamin Witham from comment #2)
> Are you never seeing the energy be printed? Do only some jobs print energy?
> In your best estimate, what is the average length of one of your jobs?

Reading from the Kakfa stream, I'm only ever seeing these 3 fields in tre_alloc: "cpu=3584,node=32,billing=3584" and tres_alloc_raw always has 3=18446744073709551614.

This is a test system, so walltimes vary quite a bit. Some jobs are sub-minute (and I would understand if those didn't report energy if the plugin wasn't able to gather enough samples), but some are many hours (up to our default 12 hour walltime).

sacct seems to show energy for all jobs, even ones that ran sub-minute.

[root@slurm1.borg ~]# sacct -o jobid,alloctres%40,elapsed,start,end -X -S 2024-05-26
JobID                                       AllocTRES    Elapsed               Start                 End 
------------ ---------------------------------------- ---------- ------------------- ------------------- 
129488+0     billing=56,cpu=112,energy=16952728,node+   07:35:14 2024-05-25T20:47:24 2024-05-26T04:22:38 
129488+1     billing=448,cpu=448,energy=69644742,nod+   07:35:14 2024-05-25T20:47:24 2024-05-26T04:22:38 
129497       billing=112,cpu=112,energy=3083790,node+   00:23:19 2024-05-26T00:04:11 2024-05-26T00:27:30 
129498        billing=112,cpu=112,energy=93030,node=1   00:00:44 2024-05-26T00:27:34 2024-05-26T00:28:18
Comment 4 Benjamin Witham 2024-07-17 18:39:06 MDT
Hello Matt, 

I apologize for the delayed response. I can reproduce this issue, and I'm looking into it currently. I'll keep you updated.