Ticket 5372

Summary: Depending upon the query parameters sacct can report endtime incorrectly for running jobs
Product: Slurm Reporter: Greg Wickham <greg.wickham>
Component: AccountingAssignee: Broderick Gardner <broderick>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
Site: KAUST Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 18.08
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Greg Wickham 2018-07-02 00:16:28 MDT
Whilst pulling usage information using sacct it appears that depending upon the query parameters the reported end time varies for the running job:

- no start or end time specified (for the query) - reported data is correct (job is running: end time of the job is unknown).

$ sacct -j 11369774 --duplicates --truncate --allocations --noheader --parsable2  -N dgpu702-16  --format=JobIDRaw,Elapsed,NodeList,Start,End
11369774|05:13:46|dgpu702-16|2018-07-02T03:57:55|Unknown

- specifying a start time then the end time is shown as the current time (not correct: job is still running)

$ sacct -j 11369774 --duplicates --truncate --allocations --noheader --parsable2  -N dgpu702-16  --format=JobIDRaw,Elapsed,NodeList,Start,End --starttime=2018-07-01T00:00:00
11369774|05:14:10|dgpu702-16|2018-07-02T03:57:55|2018-07-02T09:12:05

- specifying a start time and an end time then then end time is shown as the end time as defined by the query (not correct: job is still running - reported end time exceeds run time of job)

$ sacct -j 11369774 --duplicates --truncate --allocations --noheader --parsable2  -N dgpu702-16  --format=JobIDRaw,Elapsed,NodeList,Start,End --starttime=2018-07-01T00:00:00 --endtime=2018-08-01T00:00:00 
11369774|29-20:02:05|dgpu702-16|2018-07-02T03:57:55|2018-08-01T00:00:00
Comment 2 Broderick Gardner 2018-07-02 15:51:06 MDT
I will look into reproducing this and get back to you.
Comment 3 Broderick Gardner 2018-07-03 16:16:05 MDT
This appears to be the intended behavior of the --truncate flag for the third query you sent, with starttime and endtime set.  

--truncate
    Truncate time. So if a job started before --starttime the start time would be truncated to --starttime.  The same for end time and --endtime.

In your second query, the --endtime isn't set, so I am looking into the behavior under those conditions.
Comment 9 Broderick Gardner 2018-07-18 14:41:41 MDT
As an update for you, we have a patch in review that fixes this issue. 

We found that the issue lies in the combination of the --truncate and --nodelist(-N) flags; the --nodelist flag changes the internal filter endtime to the current time, then --truncate sets the endtime of the job to that time.  
The patch will also change the behavior of the elapsed time field; you didn't mention it, but it is also inconsistent when using the --truncate flag.  Now, it will be the time since start when there is no end time and end time minus start time when there is.  

I'll let you know when the patch is committed.
Comment 22 Broderick Gardner 2018-08-24 15:14:35 MDT
A fix has been added to Slurm-18.08. To summarize, we have determined that the truncate flag should never return an unknown time because it is used to report running times in a certain window of time. So --endtime now defaults to now, and --endtime in the future is set to now; you can't report running time in the future. Pending jobs have both the start and end time truncated to --endtime, since they did not run during the window. 

You haven't clarified what you wanted with sacct, but it looks like you were misunderstanding the usage of the truncate flag. In any case, I'm closing this bug now.