From reading your manual it says that I can query for a given state within a timeframe... however this very clearly does not work as documented: $ sacct --allusers --nodelist=tempest03 --starttime=now-1hour JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 289690 bash emulation general 10 COMPLETED 0:0 289690.exte+ extern general 10 COMPLETED 0:0 289690.0 bash general 10 COMPLETED 0:0 289726 bash emulation general 10 COMPLETED 0:0 289726.exte+ extern general 10 COMPLETED 0:0 289726.0 bash general 10 COMPLETED 0:0 289822 bash emulation general 40 CANCELLED+ 0:0 289822.exte+ extern general 40 COMPLETED 0:0 289822.0 bash general 40 CANCELLED+ 0:9 290040 bash emulation general 10 COMPLETED 0:0 Exact same query with states returns no results: $ sacct --allusers --nodelist=tempest03 --starttime=now-1hour --state=cd JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- $ sacct --allusers --nodelist=tempest03 --starttime=now-1hour --state=ca JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- $ sacct --allusers --nodelist=tempest03 --starttime=now-1hour --state=r JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 289690 bash emulation general 10 COMPLETED 0:0 289690.exte+ extern general 10 COMPLETED 0:0 289690.0 bash general 10 COMPLETED 0:0 This isn't a lack of data. If I use starttime=1201 or without nodelist I get tens of thousands of records, but when I select a state I get nothing $ sacct --allusers --starttime=1201 --state=ca JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- $ sacct --allusers --starttime=1201 --state=oom JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- $ sacct --allusers --starttime=1201 --state=cd JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- However there are thousands of records that match: $ sacct --allusers --starttime=1201 |grep CANCELLED |wc -l 2363
Aha, it seems that the confusing part is this: https://slurm.schedmd.com/sacct.html#OPT_state > Selects jobs based on their state during the time period given. Unless otherwise specified, the start and end time will be the current time when the --state option is specified Further, there is a note at the end of the docs that says: > NOTE: When specifying states and no start time is given the default start time is 'now'. This is only when -j is not used. If -j is used the start time will default to 'Epoch'. In both cases if no end time is given it will default to 'now'. Read again: "In both bases if no end time is given it will default to NOW" It most certainly does not. It seems that the end time will always be the time specified for the start time, which is not what anyone would expect given that statements above.
Hi Jo, Thanks for the details about what you are looking at. I can see how that note can be confusing. It does reference looking at the "DEFAULT TIME WINDOW" section of the documentation for more details. I think there is a clearer explanation of the behavior there. https://slurm.schedmd.com/sacct.html#SECTION_DEFAULT-TIME-WINDOW Specifically there is a section that talks about the case you're looking at: WITHOUT --jobs AND WITH --state specified: --starttime defaults to Now. --endtime defaults to --starttime and to Now if --starttime is not specified. I would also point out that you can add '-v' to the sacct command to see the time window that is being used. We want to make the documentation as usable as possible. Would it have been more helpful if that Note in the --state section were removed and there were just a reference to see the "DEFAULT TIME WINDOW" section? Thanks, Ben
(In reply to Ben Roberts from comment #3) > note can be confusing. It does reference looking at the "DEFAULT TIME > WINDOW" section of the documentation for more details. I don't think it's reasonable to say "this value will be X, see (reference) for more details" and for the person to find out it's not really X at all. "More details" does not to any person I know mean "invalidates the information given here". I don't know if this is a translation problem, but English speakers expect "More" to mean "expands upon" > We want to make the documentation as usable as possible. Would it have been > more helpful if that Note in the --state section were removed and there were > just a reference to see the "DEFAULT TIME WINDOW" section? I'm not certain what the best answer is, but I can say that I have found numerous places where the documentation is redundant and disagrees with itself and I think it overall needs some serious attention. FWIW I think the behavior described in the argument description is reasonable and also would meet the expectations of most people. Further that a review of strackoverflow finds your own team members telling people to utilize sacct with those expectations. So this behavior changed at some point in the last few years to something that defies your team's previous advice. I think the behavior as impplemented (and described in the DEFAULT-TIME-WINDOW section) defies logic or reason and violates the expectations of most if not all people. It simply does not follow that if I query with a start time, that I want the end time to be the same. The behavior as described on the argument is a sane behavior, and I consider the current implementation to be a bug that will cause every single person who uses it confusion.
Hi Jo, Apologies for the delayed response. I wanted to have a discussion internally about whether it made sense on our side to change the default behavior in the situation you bring up. In the end we decided to leave the default behavior as it is since it has been that way for a long time and any changes to that default are likely to affect scripts that sites may have had in place for an equally long amount of time. The current behavior also acts as a safety mechanism to prevent sites from accidentally running a query that is so large that it negatively impacts the database performance. If a user specifies a start time that it several years back and forgets to put an end time, then the database would try to get all the job records in that window of time, which could be in the millions. The same request with the current default end time would limit the query to a much more manageable number of jobs. Users can still request a huge number of jobs, but they at least have to think about what they're doing now. I will work on updating the documentation to make it clearer that there are different situations that affect the default values of the start and end times. Thanks for your understanding. Ben