| Summary: | Number of queued "ready-to-run" jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | hpc-cs-hd |
| Component: | Scheduling | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 18.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Cineca | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
hpc-cs-hd
2019-03-28 03:20:07 MDT
Greetings Alessandro Marani,
Can you clarify what you mean by this statement?
> how can we calculate the number of queued jobs that are "ready-to-run" at every timestamp?
Are you looking for the average run time of your jobs so that you can plan on a time to submit more batches of jobs?
I think it would be helpful for you to define the "timestamp" mentioned above.
You can use "sacct -e" to see a list of options that can be used to query data.
For example:
sacct -j 6397 --format=jobid,elapsed,elapsedraw,timelimit,timelimitraw
sacct also has a start and end flag that you can query data from.
-S, --starttime
Select eligible jobs in any state after the specified time. Default is 00:00:00 of the current day, unless the '-s' or '-j' options are used. If the '-s' option is used, then the default is 'now'. If states are given with the '-s' option then only jobs in this state at this time will be returned. If the '-j' option is used, then the default time is Unix Epoch 0. See the DEFAULT TIME WINDOW for more details.
Valid time formats are...
-E end_time, --endtime=end_time
Select eligible jobs in any state before the specified time. If states are given with the -s option return jobs in this state before this period. See the DEFAULT TIME WINDOW for more details.
Valid time formats are...
For example:
sacct -S 2019-03-26 -E 2019-03-28 --format=jobid,elapsed,elapsedraw,timelimit,timelimitraw
Dear Jason, what we mean with "ready-to-run" is the state of a job when it is free from any restriction (dependency, QoS limitations,...) and can be moved in "run" state, i.e. it can be executed as soon as the resources are available. In depth, when we produce our statistics, we check the traffic history of a queue with: sacct -anXP -S start_date -E end_date --format "jobid,partition,reqnodes,allocnodes,reqcpus,alloccpus,submit,eligible,start,end" where "submit" is the time when the job has been submited, "eligible" is the time when the job has been considered free from any restriction, and it can be switched in "run" state, "start" is the time when the job switches to the run state and "end" is when the job is completed, or failed. Under this rules, we calculate: eligible - submit = time spent by the scheduler to evaluate all job restritions (dependency from another job, hold statement, and so on); start - eligible = time spent by the scheduler to switch the state of the job, that is free from any dependecies, from pending state to run state; and we consider: start - submit = queued time of a job; start - eligible = pending time of a job. But we think that "eligible" is not really the moment where the job is ready to be switched from the run state to the run state. Indeed, we saw that a lot of "eligible" values change during the "life" of a job in the queue. For example, when a user releases a job from "job held by user" state and after few minutes he puts again the job in hold the job: when the job will be released again, the value of eligible time will be updated to the last change of state. So the eligible time is probably not a good choice to understand when a job is ready to be switched to "run" statement. We need this information because we have to understand when a job is free from any dependency and the only reason that keeps it from switching from "pending" to "run" is the unavailabilty of resources and/or a lower priority. Thanks in advance, Alessandro Hi Alessandro, Jason has been busy with other work and asked me to take a look at this. What you're describing about the time a job is marked as "Eligible" not always being an accurate reflection of the state of the job is true in cases like you're describing. The job has a time stamp set for the Eligible field when it is first free of any restrictions that would prevent it from running. If there are other things that make the job unable to run again later then that will have an effect on the scheduling for the cluster, but won't be reflected in the sacct output. I am a little confused by your description that if you have a user switch the job between a held and eligible state that the Eligible time will reflect the time of the last state change. I did some testing like this and what I'm seeing is that the eligible time is set the first time the job has the hold released. Subsequent changes to the hold state of the job don't update the time stamp. This behavior lines up with the code I'm looking at. If you are able to reproduce the behavior you're describing consistently I'd be interested in seeing some example output from sbatch, scontrol show job, sacct, etc... to see the steps you're taking. To your primary question though, if your environment has jobs that change from an eligible state to ineligible and back again then the data currently gathered for the database isn't going to allow you to get the information you're looking for. It would be adequate to give you this data with the sacct command you've been running with the assumption that eligible jobs remain eligible. If you need more detailed data about how many jobs are eligible at any given time it would be a feature request. Would you be interested in sponsoring a feature like this? Thanks, Ben Dear Ben, thanks for your reply. We think to complete some tests next week, and we will let you know the results as soon as possible. We are also thinking about several new features about job statistics to propose to you, that if implemented in following releases may be helpful for us. We will share the list with you after the completion of our tests. Best regards, Alessandro Hi Alessandro, I wanted to check in with you and see if you were able to do the tests you had planned. Let me know if you still need help with this ticket. Thanks, Ben Hi Ben, yeah sorry, we're still working on the tests. Hope we can provide you the results soon. Regards, Alessandro Dear Ben, sorry for making you wait. We completed our test, and we noticed that: 1. starting from the last minor release of slurm, when a job is put by the user in hold state and then released, the AGE field of the priority formula is reset to 0. In previous releases this seemed false; when trying the experiment in system with the previous release, it looked like the AGE field was accumulating priority even when in hold state (the accumulation became visible only when the hold was released); 2. when submitting a job with the BeginTime option, it seems that it automatically gets from the very beginning all the AGE it would obtain while waiting for its begin time. This usually results in the job starting immediately after its begin time passes, by starting to drain resources for it when it shouldn't (in our opinion) be considered from the scheduler in any way; 3. when a job is submitted with --hold option, the eligible time is set to Unknown, but if released and put in hold again later, the eligible time is set to the time when it was released first. Our considerations about these tests are as follows: We think that it is correct to reset to 0 the AGE when a job pending is moved in hold state, this can help us to answer to our question. However, an even preferable (for us) behaviour should be to "freeze" the value of AGE from when the job is put on hold and restart the accumulation from there when it gets released. From a statistical point of view, do you think that it can be possible to save the AGE field in the slurm database used by sacct? Also, is it true that the behaviour of the AGE field in this regard changed after the last minor release? Is there a changelog for minor releases somewhere? In your website we found a log only for the last major release of last August; About the use of --begin= variable, we think that if our supposition stands true it can be considered like a reservation, and therefore is not wanted. Something we would prefer is a job that doesn't start to accumulate age after its begin time arrives (which, in our vision, should be the moment when the scheduler can start to take it into consideration and not the moment when the job should start), or at least a job that can accumulate age gradually over time even when in pre-begin state instead of having it all already, so that the scheduler sees it like any other job in this regard and doesn't start to drain resources for it when the begin time isn't arrived yet; is it possible to disable the possibility to use the begin directive in the slurm configuration? in which file? If not, do you think it can be possible for you to make some chages so that it can be enabled/disabled? In all fairness, we always tried the begin option with an account that has the maximum (or close to the maximum) amount of fair-share and this may affect its priority when the job becomes eligible. We still have to test it with an account with a low amount of fair-share, in that case we expect for it to not necessarily start immediately after its begin time expires. Do you think our expectations will be what we will actually get? And, optionally, do you think it can be possible to save in the database all the timestamps related to every time the reason for a pending job changes? In this way we can have all the history about what happens during the waiting time of a job. Thank you for your attention. Regards, Alessandro Hi Alessandro, That's ok, I'm glad you were able to run through some tests. 1. You're right that by default putting a hold on a job will cause it to reset the age counter so it starts over when the hold is released. There is a PriorityFlags you can set called ACCRUE_ALWAYS to have jobs always accrue age based priority. I didn't see any recent changes to this flag other than this one to cause sprio to show jobs before they were eligible when this flag was present: https://github.com/SchedMD/slurm/commit/8782db29e7f44887d6b52ec265b038cdc05bc086 It's possible that this flag was set on your older system and not on the new one you're testing, which would account for the different behavior. And we do have a NEWS file that contains a summary of all the changes made for each release of the code. You can find a copy of the NEWS file in the source when you download it, or you can find it on github here: https://github.com/SchedMD/slurm/blob/master/NEWS 2. This behavior should also be tied to the ACCRUE_ALWAYS flag. With ACCRUE_ALWAYS set you would see the age based priority begin to accrue from the time the job was submitted. Without this flag it shouldn't start until the job reaches the begin time and becomes eligible. This seems to contradict what you're seeing in point 1 though. There it sounds like the behavior indicates ACCRUE_ALWAYS is not set on your new system and this makes it sound like it is. There's not an option to disable the --begin flag for users. You could do something with a submit plugin that removes the flag if users added it. This option leaves things pretty open for you to customize the behavior to your needs. There is more information on submit plugins here: https://slurm.schedmd.com/job_submit_plugins.html 3. You're right that events like Eligible Time are going to be recorded for jobs the first time they happen. This can lead to confusion like in the case you're describing where jobs are held and released multiple times and sacct only shows the first eligible time. I can discuss this with my colleagues to see if they're aware of similar requests or if there are any plans to store data like this. Thanks, Ben Hi Ben, thank you for your answer and all the interesting informations provided. At a first glance, it doesn't look like the ACCRUE_ALWAYS flag is currently defined on any of our systems, but we will discuss with our system administrators to see if there were some changes in the configuration. We will then discuss about its implementation and make some tests, and we will return to you with our results. We are also looking forward to the possibility of having more details about the lifetime of a pending job. Please keep us updated with that as well! Thank you and regards, Alessandro Hi Alessandro, I'll wait to hear about your experience with the ACCRUE_ALWAYS flag. I looked into whether there were any current plans to store information about multiple state changes for jobs and the answer there was 'no'. It's possible that something like this could be added, but it wouldn't be a trivial change. For new features like this we require someone to sponsor the development effort. Is that something you would be interested in sponsoring? Thanks, Ben Hi Alessandro, I wanted to see how things have been working with the ACCRUE_ALWAYS flag set. Has this addressed the problem you were having? Thanks, Ben Dear Ben, sorry for the late answer. There were many discussions between us and we came to the following conclusions. About ACCRUE_ALWAYS, in the end we didn't try it. We were stopped by its own description: "If set, priority age factor will be increased despite job dependencies or holds." This is unfortunately not what we want, because in our idea the AGE parameter should freeze while a job is in hold state and restart from there when it returns eligible. If this is not possible, we prefer the current configuration where AGE is reset and a freed job restarts from zero. About the begin time, we take the suggestion that we should try to disable the directive via plugin. However, the users that take advantage of it aren't many, so the priority of this implementation isn't high. Finally, about the statistics for eligibility and the necessity to have a picture of the eligible jobs on a particular timestamp, we decided to workaround it manually by adding to our system crontab a script that prints the output of sacct periodically. So there is no other need for you to implement such a demanding change. All this being considered, we think that you can close this bug. Many thanks for your assistance and your suggestions, this exchange of ideas has been very helpful for us! Regards, Alessandro Hi Alessandro, I'm glad that you were able to come to a decision about how to get the information you need and that the discussion was helpful. I'll close this ticket. Please let us know if we can help in the future. Thanks, Ben |