Ticket 9296

Summary: CPUusage information for job allocation is reported as zero when -X option is used
Product: Slurm Reporter: Jim Long <jlong1s>
Component: AccountingAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact: Ben Roberts <ben>
Severity: 3 - Medium Impact    
Priority: --- CC: da
Version: 20.02.3   
Hardware: Linux   
OS: Linux   
Site: NCSA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Jim Long 2020-06-29 12:21:31 MDT
I have a simple test script which does a bit of memory allocation just to
generate some user and system CPU usage -

#!/bin/bash
#SBATCH -N 3
#SBATCH -n 40
#SBATCH -J mytest
#SBATCH -A aaa
#SBATCH -t 100
#SBATCH --mem=200G
hostname
sleep 2
srun -n3 ./memalloc
srun -n3 ./memalloc
srun -n3 ./memalloc
~


After the script is run, I use sacct to check the CPU usage stats -

[jlong@iforgehn2 testing]$ sacct -j 175 --format jobid,state,user,NNodes,partition,TotalCPU,SystemCPU,UserCPU

       JobID      State      User   NNodes  Partition   TotalCPU  SystemCPU    UserCPU
------------ ---------- --------- -------- ---------- ---------- ---------- ----------
175           COMPLETED     jlong        3     normal  00:52.695  00:43.278  00:09.417
175.batch     COMPLETED                  1             00:00.098  00:00.056  00:00.042
175.0         COMPLETED                  3             00:17.540  00:14.363  00:03.177
175.1         COMPLETED                  3             00:17.527  00:14.452  00:03.075
175.2         COMPLETED                  3             00:17.527  00:14.407  00:03.120


The usage for the job allocation gets reported as the total for all of the job steps as expected.

However, when I add the -X argument to see just the job allocation stats, the CPU usages stats suddenly get reported as zero -

[jlong@iforgehn2 testing]$ sacct -j 175 -X   --format jobid,state,user,NNodes,partition,TotalCPU,SystemCPU,UserCPU

       JobID      State      User   NNodes  Partition   TotalCPU  SystemCPU    UserCPU
------------ ---------- --------- -------- ---------- ---------- ---------- ----------
175           COMPLETED     jlong        3     normal   00:00:00   00:00:00   00:00:00
Comment 1 Jeff DeGraw 2020-07-01 11:08:19 MDT
Jim,

Thanks for bringing this to our attention. I'm looking into it now and will update you with any progress.

- Jeff
Comment 2 Jeff DeGraw 2020-07-01 12:15:04 MDT
Tim,

From the sacct man page:
> -X, --allocations
>     Only show statistics relevant to the job allocation itself, not taking steps into consideration.

Allocations/jobs don't actually run anything on the node, steps do.  You can't get any cpu utilization stats without steps. This is functioning as intended.
Comment 3 Jim Long 2020-07-01 14:23:16 MDT
I'm a little confused here then.

Why do I get CPU usage numbers on the first non-header line when the -X 
argument is not used.   Isn't that reflecting the stats for the job/allocation.

How/why is that different than what is reported with -X?


I guess what I am asking is - why doesn't the first non-header line match whether or not the -X argument is used?
Comment 4 Jeff DeGraw 2020-07-01 15:00:57 MDT
Jim,

(In reply to Jim Long from comment #3)
> I guess what I am asking is - why doesn't the first non-header line match
> whether or not the -X argument is used?

Yes, it is a bit confusing. It has to do with the way the code queries the database. After it gets the all of the steps for a job, it aggregates all of the information and reports that as the total for the job. With the -X option, no steps are pulled from the database, so it has nothing to aggregate.

> -X, --allocations
>     Only show statistics relevant to the job allocation itself, not taking steps into consideration.
I thought I was taking "not taking steps into consideration" too literally by assuming it would entirely ignore stats from steps, but that's exactly what intention is.

This code can be found in the file src/sacct/options.c:get_data() if you want to see it. Does that all make sense?

> 

- Jeff
Comment 5 Jim Long 2020-07-01 15:25:15 MDT
It makes sense, but not very useful.  No way the get total CPU stats without
displaying all of the steps too.   Perhaps there should be an aggregate option.

From a billing perspective I'm probably not interested in the steps, but might be interested in total resource usage.
Comment 11 Jeff DeGraw 2020-07-14 10:00:35 MDT
Jim,

We have documented this special case and it will appear on the man page for future releases. I will go ahead and close this now. Don't hesitate to reach out if you have any questions.

- Jeff