Ticket 21573 - documentation for sacct --endtime says no value == now but that's not true
Summary: documentation for sacct --endtime says no value == now but that's not true
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 24.05.4
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-12-04 18:26 MST by Jo Rhett
Modified: 2025-01-03 14:38 MST (History)
2 users (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Jo Rhett 2024-12-04 18:26:07 MST
From reading your manual it says that I can query for a given state within a timeframe... however this very clearly does not work as documented:

$ sacct --allusers --nodelist=tempest03 --starttime=now-1hour
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
289690             bash  emulation    general         10  COMPLETED      0:0
289690.exte+     extern               general         10  COMPLETED      0:0
289690.0           bash               general         10  COMPLETED      0:0
289726             bash  emulation    general         10  COMPLETED      0:0
289726.exte+     extern               general         10  COMPLETED      0:0
289726.0           bash               general         10  COMPLETED      0:0
289822             bash  emulation    general         40 CANCELLED+      0:0
289822.exte+     extern               general         40  COMPLETED      0:0
289822.0           bash               general         40 CANCELLED+      0:9
290040             bash  emulation    general         10  COMPLETED      0:0

Exact same query with states returns no results:

$ sacct --allusers --nodelist=tempest03 --starttime=now-1hour --state=cd
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------

$ sacct --allusers --nodelist=tempest03 --starttime=now-1hour --state=ca
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------

$ sacct --allusers --nodelist=tempest03 --starttime=now-1hour --state=r
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
289690             bash  emulation    general         10  COMPLETED      0:0
289690.exte+     extern               general         10  COMPLETED      0:0
289690.0           bash               general         10  COMPLETED      0:0

This isn't a lack of data. If I use starttime=1201 or without nodelist I get tens of thousands of records, but when I select a state I get nothing

$ sacct --allusers --starttime=1201 --state=ca
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------

$ sacct --allusers --starttime=1201 --state=oom
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------

$ sacct --allusers --starttime=1201 --state=cd
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------

However there are thousands of records that match:

$ sacct --allusers --starttime=1201 |grep CANCELLED |wc -l
2363
Comment 1 Jo Rhett 2024-12-04 19:11:05 MST
Aha, it seems that the confusing part is this:

https://slurm.schedmd.com/sacct.html#OPT_state
> Selects jobs based on their state during the time period given. Unless otherwise specified, the start and end time will be the current time when the --state option is specified

Further, there is a note at the end of the docs that says:

> NOTE: When specifying states and no start time is given the default start time is 'now'. This is only when -j is not used. If -j is used the start time will default to 'Epoch'. In both cases if no end time is given it will default to 'now'. 

Read again: "In both bases if no end time is given it will default to NOW"

It most certainly does not.

It seems that the end time will always be the time specified for the start time, which is not what anyone would expect given that statements above.
Comment 3 Ben Roberts 2024-12-05 11:50:03 MST
Hi Jo,

Thanks for the details about what you are looking at.  I can see how that note can be confusing.  It does reference looking at the "DEFAULT TIME WINDOW" section of the documentation for more details.  I think there is a clearer explanation of the behavior there.
https://slurm.schedmd.com/sacct.html#SECTION_DEFAULT-TIME-WINDOW

Specifically there is a section that talks about the case you're looking at:
  WITHOUT --jobs AND WITH --state specified:
  --starttime defaults to Now.
  --endtime defaults to --starttime and to Now if --starttime is not specified.


I would also point out that you can add '-v' to the sacct command to see the time window that is being used.

We want to make the documentation as usable as possible.  Would it have been more helpful if that Note in the --state section were removed and there were just a reference to see the "DEFAULT TIME WINDOW" section?

Thanks,
Ben
Comment 4 Jo Rhett 2024-12-05 12:22:59 MST
(In reply to Ben Roberts from comment #3)
> note can be confusing.  It does reference looking at the "DEFAULT TIME
> WINDOW" section of the documentation for more details.

I don't think it's reasonable to say "this value will be X, see (reference) for more details" and for the person to find out it's not really X at all. "More details" does not to any person I know mean "invalidates the information given here". I don't know if this is a translation problem, but English speakers expect "More" to mean "expands upon"

> We want to make the documentation as usable as possible.  Would it have been
> more helpful if that Note in the --state section were removed and there were
> just a reference to see the "DEFAULT TIME WINDOW" section?

I'm not certain what the best answer is, but I can say that I have found numerous places where the documentation is redundant and disagrees with itself and I think it overall needs some serious attention.

FWIW I think the behavior described in the argument description is reasonable and also would meet the expectations of most people. Further that a review of strackoverflow finds your own team members telling people to utilize sacct with those expectations. So this behavior changed at some point in the last few years to something that defies your team's previous advice.

I think the behavior as impplemented (and described in the DEFAULT-TIME-WINDOW section) defies logic or reason and violates the expectations of most if not all people. It simply does not follow that if I query with a start time, that I want the end time to be the same. The behavior as described on the argument is a sane behavior, and I consider the current implementation to be a bug that will cause every single person who uses it confusion.
Comment 8 Ben Roberts 2024-12-11 09:16:03 MST
Hi Jo,

Apologies for the delayed response.  I wanted to have a discussion internally about whether it made sense on our side to change the default behavior in the situation you bring up.  In the end we decided to leave the default behavior as it is since it has been that way for a long time and any changes to that default are likely to affect scripts that sites may have had in place for an equally long amount of time.  

The current behavior also acts as a safety mechanism to prevent sites from accidentally running a query that is so large that it negatively impacts the database performance.  If a user specifies a start time that it several years back and forgets to put an end time, then the database would try to get all the job records in that window of time, which could be in the millions.  The same request with the current default end time would limit the query to a much more manageable number of jobs.  Users can still request a huge number of jobs, but they at least have to think about what they're doing now.

I will work on updating the documentation to make it clearer that there are different situations that affect the default values of the start and end times.  

Thanks for your understanding.
Ben