Summary: | sacct truncate gives incorrect results in two scenarios for currently running jobs | ||
---|---|---|---|
Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
Component: | Database | Assignee: | Albert Gil <albert.gil> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | albert.gil, broderick, yhe |
Version: | 17.11.9 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=6686 https://bugs.schedmd.com/show_bug.cgi?id=6697 https://bugs.schedmd.com/show_bug.cgi?id=6755 |
||
Site: | NERSC | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 18.08 | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Doug Jacobsen
2018-09-11 17:43:40 MDT
Prior to version 18.08, the truncate flag is broken and inconsistent with running jobs (no end time) and pending jobs. Bug 5372 was submitted about this same issue, at least your first issue. The fix also involved a change in behavior, so it was not included in 17.11. Here is the commit: 5a6a26f6438d43d49bc5df8e2d78fc71ebf72d9d Due to other changes to the way flags are passed here, it won't apply cleanly to 17.11. For 18.08, it was decided that the results when truncating should never return "Unknown". So, running jobs without an end time are truncated to the end of the window. This will never return more running time than actually occurred because the window end time (--endtime) is also truncated to the current time if it is in the future or unspecified. The motivation is that the truncate flag should make sacct return running time within the specified time window. Let me know if you have questions or concerns about this change. I am investigating your second issue. I have confirmed it in both 17.11 and 18.08. Update: I found the cause of issue 2. It is specific to running jobs because in the database time_end is 0 for unknown. The query looks for jobs where -S time is between time_start and time_end or time_start is between -S and -E (where -S and -E are the sacct start and end args). I am working on a fix, as well as a proper specification of the correct behavior. Thank you for the update! Hi Doug, I'm taking this bug (issue 2) from Broderick. We have already replicated and it only happens if all these conditions are met: - The --state is specified - At this moment only replicated/tested for Running jobs - The --starttime is specified and set to a time after the actual Start - If we set it to a time before the actual Start it works fine - The --endtime is also specified - Value seems to be not relevant The --truncate flag is not relevant. The expected behavior is the job to be reported, but it's not. To avoid collisions with bug 5372 we are working on a patch for the 18.08 version. Broderick also worked on the internal db queries, so I'm taking it from there. Albert Sounds good, thank you! Hi Doug, Starting from this bug we've been looking deeper in other corner cases when using the -S, -E with -j and/or -s. We've fixed other cases, added several tests and improved the documentation and verbose messages to avoid confusions (the default time window deserves a section in sacct manpage). It's already committed on 18.08: https://github.com/SchedMD/slurm/commit/cc153e036686ea82fefaa173b6fab6c844fa2f09 https://github.com/SchedMD/slurm/commit/ade9101e95925b01005b465c5451e243578c2fd8 https://github.com/SchedMD/slurm/commit/2c709953d291669e5142e67e42826bb57082c372 https://github.com/SchedMD/slurm/commit/9d6d0c6e29336506189f2145ef2c6a6648e5d03e https://github.com/SchedMD/slurm/commit/a62891fdc7a21785e11107c5d27d2be0dcfdd990 We think that we can close this bug as fixed. But please, feel free to reopen it you have any feedback, Albert *** Ticket 6188 has been marked as a duplicate of this ticket. *** |