Ticket 5717

Summary: sacct truncate gives incorrect results in two scenarios for currently running jobs
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: DatabaseAssignee: Albert Gil <albert.gil>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: albert.gil, broderick, yhe
Version: 17.11.9   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=6686
https://bugs.schedmd.com/show_bug.cgi?id=6697
https://bugs.schedmd.com/show_bug.cgi?id=6755
Site: NERSC Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 18.08
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Doug Jacobsen 2018-09-11 17:43:40 MDT
Hello,

We've found a couple of issues with sacct --truncate.


Issue #1
For a currently running job, examining a window including the current running portion of the job, it fills in the end time with the end of the window, even though it should be "Unknown", as well as incorrectly apportioning the cputime (probably two faces of the same issues).


Issue #2
For a currently running job, examining a window including the current running portion of the job, if the Running state is selected (-s R), the job is dropped from the output.



ctl2:~ # sacct -j 14455644 -X --format=job,start,end,timelimit,cputime,state
       JobID               Start                 End  Timelimit    CPUTime      State
------------ ------------------- ------------------- ---------- ---------- ----------
14455644     2018-09-10T06:47:41             Unknown 2-00:00:00 44-20:56:32    RUNNING
ctl2:~ # sacct -j 14455644 -X --format=job,start,end,timelimit,cputime,state -T --start=2018-09-10 --end=2018-09-11
       JobID               Start                 End  Timelimit    CPUTime      State
------------ ------------------- ------------------- ---------- ---------- ----------
14455644     2018-09-10T06:47:41 2018-09-11T00:00:00 2-00:00:00 22-22:34:08    RUNNING
ctl2:~ # sacct -j 14455644 -X --format=job,start,end,timelimit,cputime,state -T --start=2018-09-11 --end=2018-09-12
       JobID               Start                 End  Timelimit    CPUTime      State
------------ ------------------- ------------------- ---------- ---------- ----------
14455644     2018-09-11T00:00:00 2018-09-12T00:00:00 2-00:00:00 32-00:00:00    RUNNING
ctl2:~ #
ctl2:~ #
ctl2:~ # sacct -j 14455644 -X --format=job,start,end,timelimit,cputime,state -T --start=2018-09-10 --end=2018-09-11 -s R
       JobID               Start                 End  Timelimit    CPUTime      State
------------ ------------------- ------------------- ---------- ---------- ----------
14455644     2018-09-10T06:47:41 2018-09-11T00:00:00 2-00:00:00 22-22:34:08    RUNNING
ctl2:~ # sacct -j 14455644 -X --format=job,start,end,timelimit,cputime,state -T --start=2018-09-11 --end=2018-09-12 -s R
       JobID               Start                 End  Timelimit    CPUTime      State
------------ ------------------- ------------------- ---------- ---------- ----------
ctl2:~ #

Thank you,
Doug
Comment 2 Broderick Gardner 2018-09-12 14:03:57 MDT
Prior to version 18.08, the truncate flag is broken and inconsistent with running jobs (no end time) and pending jobs. Bug 5372 was submitted about this same issue, at least your first issue. The fix also involved a change in behavior, so it was not included in 17.11.  

Here is the commit: 5a6a26f6438d43d49bc5df8e2d78fc71ebf72d9d
Due to other changes to the way flags are passed here, it won't apply cleanly to 17.11. 

For 18.08, it was decided that the results when truncating should never return "Unknown". So, running jobs without an end time are truncated to the end of the window. This will never return more running time than actually occurred because the window end time (--endtime) is also truncated to the current time if it is in the future or unspecified. 

The motivation is that the truncate flag should make sacct return running time within the specified time window. Let me know if you have questions or concerns about this change. 

I am investigating your second issue. I have confirmed it in both 17.11 and 18.08.
Comment 3 Broderick Gardner 2018-11-05 10:27:01 MST
Update:

I found the cause of issue 2. It is specific to running jobs because in the database time_end is 0 for unknown. The query looks for jobs where -S time is between time_start and time_end or time_start is between -S and -E (where -S and -E are the sacct start and end args). I am working on a fix, as well as a proper specification of the correct behavior.
Comment 5 Doug Jacobsen 2018-11-07 17:39:52 MST
Thank you for the update!
Comment 8 Albert Gil 2019-01-15 08:45:28 MST
Hi Doug, 

I'm taking this bug (issue 2) from Broderick.
We have already replicated and it only happens if all these conditions are met:
- The --state is specified
  - At this moment only replicated/tested for Running jobs
- The --starttime is specified and set to a time after the actual Start
  - If we set it to a time before the actual Start it works fine
- The --endtime is also specified
  - Value seems to be not relevant

The --truncate flag is not relevant.
The expected behavior is the job to be reported, but it's not.

To avoid collisions with bug 5372 we are working on a patch for the 18.08 version.

Broderick also worked on the internal db queries, so I'm taking it from there.

Albert
Comment 9 Doug Jacobsen 2019-01-15 09:06:23 MST
Sounds good, thank you!
Comment 52 Albert Gil 2019-02-20 01:15:13 MST
Hi Doug,

Starting from this bug we've been looking deeper in other corner cases when using the -S, -E with -j and/or -s.
We've fixed other cases, added several tests and improved the documentation and verbose messages to avoid confusions (the default time window deserves a section in sacct manpage).

It's already committed on 18.08:
https://github.com/SchedMD/slurm/commit/cc153e036686ea82fefaa173b6fab6c844fa2f09
https://github.com/SchedMD/slurm/commit/ade9101e95925b01005b465c5451e243578c2fd8
https://github.com/SchedMD/slurm/commit/2c709953d291669e5142e67e42826bb57082c372
https://github.com/SchedMD/slurm/commit/9d6d0c6e29336506189f2145ef2c6a6648e5d03e
https://github.com/SchedMD/slurm/commit/a62891fdc7a21785e11107c5d27d2be0dcfdd990

We think that we can close this bug as fixed.
But please, feel free to reopen it you have any feedback,
Albert
Comment 53 Albert Gil 2019-02-27 02:35:58 MST
*** Ticket 6188 has been marked as a duplicate of this ticket. ***