| Summary: | Energy accounting provides invalid values in case of multi-task | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Yiannis Georgiou <yiannis.georgiou> |
| Component: | Accounting | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | natmari6117 |
| Version: | 2.5.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Universitat Dresden (Germany) | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | Other | Machine Name: | other |
| CLE Version: | Version Fixed: | 1.1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
energy accounting multi task fix
checks for new energy values |
||
Created attachment 271 [details]
checks for new energy values
Yiannis, while I was unable to reproduce this problem I think this patch has merit, but don't totally fix the issue. Your patch could result in the energy being 0 at the end of the day. I have changed it to check for a non-zero value from the jobacct before zeroing it out just in case we already get the info from a previous task. I patched 2.5 as well.
|
Created attachment 269 [details] energy accounting multi task fix if more than 1 task is used with srun the slurm energy accounting results with sacct are too high. here is a script to reproduce the problem #!/bin/bash #SBATCH -N 1 #SBATCH -n 16 #SBATCH -o job.out #SBATCH -e job.err #SBATCH -p mpi2 #SBATCH --exclusive #SBATCH --time 60 #SBATCH --profile=Energy #SBATCH --acctg-freq=Energy=3 sleeptime=60 for ntasks in 1 2 4 16 do for repeat in {1..3} do name="s${sleeptime}@${ntasks}" srun -N1 -n${ntasks} -J${name} sleep ${sleeptime} done done and these are the results we may have: $ SACCT_FORMAT="jobid,jobname,ntasks,start,end,ConsumedEnergy,nodelist" sacct -j 35248 JobID JobName NTasks Start End ConsumedEnergy NodeList ------------ ---------- -------- ------------------- ------------------- -------------- --------------- 35248 job-sandy+ 2013-05-31T14:30:58 2013-05-31T14:43:09 taurusi1218 35248.batch batch 1 2013-05-31T14:30:58 2013-05-31T14:43:09 49025 taurusi1218 35248.0 s60@1 1 2013-05-31T14:30:59 2013-05-31T14:31:59 3844 taurusi1218 35248.1 s60@1 1 2013-05-31T14:31:59 2013-05-31T14:33:00 3911 taurusi1218 35248.2 s60@1 1 2013-05-31T14:33:00 2013-05-31T14:34:00 3829 taurusi1218 35248.3 s60@2 2 2013-05-31T14:34:00 2013-05-31T14:35:01 3863 taurusi1218 35248.4 s60@2 2 2013-05-31T14:35:01 2013-05-31T14:36:02 3860 taurusi1218 35248.5 s60@2 2 2013-05-31T14:36:02 2013-05-31T14:37:02 3857 taurusi1218 35248.6 s60@4 4 2013-05-31T14:37:02 2013-05-31T14:38:03 7600 taurusi1218 35248.7 s60@4 4 2013-05-31T14:38:03 2013-05-31T14:39:04 7822 taurusi1218 35248.8 s60@4 4 2013-05-31T14:39:04 2013-05-31T14:40:04 7636 taurusi1218 35248.9 s60@16 16 2013-05-31T14:40:04 2013-05-31T14:41:06 11907 taurusi1218 35248.10 s60@16 16 2013-05-31T14:41:06 2013-05-31T14:42:07 12117 taurusi1218 35248.11 s60@16 16 2013-05-31T14:42:07 2013-05-31T14:43:08 7756 taurusi1218 Example with load: $ SACCT_FORMAT="jobid,jobname,ntasks,start,end,ConsumedEnergy,nodelist" sacct -j 35276 JobID JobName NTasks Start End ConsumedEnergy NodeList ------------ ---------- -------- ------------------- ------------------- -------------- --------------- 35276 bash 2013-05-31T15:04:47 2013-05-31T15:22:19 taurusi1041 35276.0 firestart+ 1 2013-05-31T15:04:55 2013-05-31T15:05:56 21677 taurusi1041 35276.1 firestart+ 2 2013-05-31T15:07:09 2013-05-31T15:08:10 42598 taurusi1041 35276.2 firestart+ 4 2013-05-31T15:09:43 2013-05-31T15:10:44 21008 taurusi1041 35276.3 firestart+ 8 2013-05-31T15:12:31 2013-05-31T15:13:32 66126 taurusi1041 35276.4 firestart+ 16 2013-05-31T15:15:20 2013-05-31T15:16:22 65763 taurusi1041 I've found out that the problem comes from the last poll that we make in jobacct_gather when the last task is removed. Perhaps the last removed task is different from the one that we do the last poll. Since we don't care about which of the task we calculate the consumption of the node we need to be certain that it is 0 before we make last aggregation The attached patch fixes the problem, let me know if you agree