Per Bug 18274, it seems that jobcomp should include energy. It doesn't seem to be working for us. Our Kafka messages output data like the following: "tres_req_raw":"1=8,2=4096000,4=8,5=8", "tres_req":"cpu=8,mem=4000G,node=8,billing=8", "tres_alloc_raw":"1=896,3=18446744073709551614,4=8,5=896", "tres_alloc":"cpu=896,node=8,billing=896" I think that 3=18446744073709551614 means energy=NO_VAL64 in tres_alloc_raw The database has the following info: [root@slurm1.borg ~]# sacct -j 129242 -o jobid,alloctres%42 JobID AllocTRES ------------ ------------------------------------------ 129242 billing=896,cpu=896,energy=24703018,node=8 129242.batch cpu=63,mem=0,node=1 129242.exte+ billing=896,cpu=896,node=8 129242.0 cpu=448,mem=0,node=8 Do we have a misconfiguration, or has this broken since it was added for 23.11.1?
Created attachment 36685 [details] slurmd.conf
Hello Matt, I can reproduce this issue from my end. I'm looking into the cause of the issue currently. Just a few quick questions for you. Are you never seeing the energy be printed? Do only some jobs print energy? In your best estimate, what is the average length of one of your jobs?
(In reply to Benjamin Witham from comment #2) > Are you never seeing the energy be printed? Do only some jobs print energy? > In your best estimate, what is the average length of one of your jobs? Reading from the Kakfa stream, I'm only ever seeing these 3 fields in tre_alloc: "cpu=3584,node=32,billing=3584" and tres_alloc_raw always has 3=18446744073709551614. This is a test system, so walltimes vary quite a bit. Some jobs are sub-minute (and I would understand if those didn't report energy if the plugin wasn't able to gather enough samples), but some are many hours (up to our default 12 hour walltime). sacct seems to show energy for all jobs, even ones that ran sub-minute. [root@slurm1.borg ~]# sacct -o jobid,alloctres%40,elapsed,start,end -X -S 2024-05-26 JobID AllocTRES Elapsed Start End ------------ ---------------------------------------- ---------- ------------------- ------------------- 129488+0 billing=56,cpu=112,energy=16952728,node+ 07:35:14 2024-05-25T20:47:24 2024-05-26T04:22:38 129488+1 billing=448,cpu=448,energy=69644742,nod+ 07:35:14 2024-05-25T20:47:24 2024-05-26T04:22:38 129497 billing=112,cpu=112,energy=3083790,node+ 00:23:19 2024-05-26T00:04:11 2024-05-26T00:27:30 129498 billing=112,cpu=112,energy=93030,node=1 00:00:44 2024-05-26T00:27:34 2024-05-26T00:28:18
Hello Matt, I apologize for the delayed response. I can reproduce this issue, and I'm looking into it currently. I'll keep you updated.