Last evening we concluded our Scheduled Maintenance for this month. At that time, we released our maintenance reservation and jobs started running - that's good. The queue of jobs can be quite high. For example, at the moment: Job states ts: Wed Sep 02 14:16:41 2020 (1599056201) Jobs pending: 451 Jobs running: 249 This can vary quite a bit over time (as you know). However, what's not so good is that the jobs far down in the queue never get any attention. Hence, I believe the true reason they are pending is not being updated and give user the incorrect impression that something is wrong. For example, a pending job ranked 168 down has the following reason: Reason=ReqNodeNotAvail,_Reserved_for_maintenance There are no reservations in place and all nodes are available. So the reason should really be: (Priority). I realize this likely an issue with the scheduler not reaching down low enough to look at those jobs for scheduling. However, there should be a mechanism in place to at least update the reason so that it is not misleading. Is there a setting I'm not using that could correct this behavior? I'll upload my slurm.conf momentarily. Thanks, Tony.
Created attachment 15692 [details] hera slurm.conf
Tony, It looks like the job reason wasn't updated when you removed the reservation - I'll take a look into the code to check how we can improve that. Another thing is that those jobs are not reached by the backfill scheduler. Could you please change your SchedulerParameters to be: >SchedulerParameters=kill_invalid_depend,bf_max_job_test=500,bf_interval=60 and call `scontrol reconfigure` and `sdiag -r`. After a few minutes, let say 10, please execute `sdiag` (without any option) and share the results with me. cheers, Marcin
(In reply to Marcin Stolarek from comment #2) Marcin, Thanks for the quick reply! > Another thing is that those jobs are not reached by the backfill scheduler. > Could you please change your SchedulerParameters to be: > >SchedulerParameters=kill_invalid_depend,bf_max_job_test=500,bf_interval=60 > > and call `scontrol reconfigure` and `sdiag -r`. > > After a few minutes, let say 10, please execute `sdiag` (without any option) > and share the results with me. I suspected the backfill scheduler depth would be the response to my query. However, since this is a production system, I am hesitant to make any changes since I don't know what the effect that will have on overall scheduling and system performance. Would you provide more guidance on this? Perhaps some pros and cons that I can include in my justification for a change request that management can review? I'd appreciate any insight you may have. Best, Tony.
OK.. let's just share `sdiag` first. In your current config backfill depth is limited by the default value of bf_max_job_test=100. Another parameter like that is bf_interval - which defines a frequency (and by default a time limit) each backfill iteration can take. Increasing it generally makes the system more responsive for other RPCs, but means that jobs won't be backfilled more frequently than the defined value. We'll probably have to add bf_continue as well to reach the whole queue - this switch allows backfill to continue work over the queue after locks release. You may find a detailed explanation of those under SchedulerParameters section[1] of man slurm.conf cheers, Marcin [1]https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters
Created attachment 15697 [details] sdiag output - about 15 min since sdiag -r
Tony, Based on the sdiag output I'd suggest increasing bf_max_job_test=500 and add bf_continue to your scheduler parameters as a general recommendation, however, after some additional code analysis I see it won't change pending jobs reason. From what I see those jobs are still evaluated by both backfill and main scheduler but the reason is not updated. I'll have to check the details of it to see how we can improve that. I'll keep you posted on the progress. cheers, Marcin
PS. As a workaround, you can call `scontrol release JOBID` on those jobs. After that, they will get displayed with "(None)" reason which will be updated in next scheduler cycle.
(In reply to Marcin Stolarek from comment #7) > PS. As a workaround, you can call `scontrol release JOBID` on those jobs. > After that, they will get displayed with "(None)" reason which will be > updated in next scheduler cycle. Marcin, Thanks for the update. It appears the queue got short enough overnight for the backfill to reach all the jobs. Thanks for the workaround. I will try that next time this happens. It seems the default bf_interval is 30 min. This seems awfully high. While "Higher values result in less overhead..." - and I understand that - how long does it typically take to run that part of the scheduling code. I realize that would depend on the number of jobs it had to process, is the time on the order of seconds or minutes? I'm considering bringing that value down to 5 min from the default. Would that be excessive? What could I monitor to determine how effective or impactful that change is? sdiag seems to provide a lot of good info. I suspect that's not all. Is there a way to get a more comprehensive set of stats? While I could do an sdiag -r followed by an sdiag say 5 min later, to get near-real time data, is there a better way to do this in order to analyze scheduler behavior? What is the impact of doing that? Best, Tony.
Tony, >It seems the default bf_interval is 30 min. No, it's 30 seconds. >how long does it typically take to run that part of the scheduling code. This really depends on too many factors to give any meaningful answer. >What could I monitor to determine how effective or impactful that change is? Last/Mean/Max cycle. It's the number of microseconds spent in backfill, this time doesn't include a time when backfill stops itself to relinquish locks to allow other operations processing. This process is controlled by bf_yield_interval=# and bf_yield_sleep=# parameters. By default backfill stops every 2 seconds for 0.5 second, which means that for bf_interval of 30 seconds the cycle time (as shown by sdiag) can be 30*(1-1/2*0.5) ~ 22.5. If you see that backfill is reaching this value it means that it stopped further processing of jobs because of the time limit, in other words, it didn't go deeper in the queue because of low bf_interval. >sdiag seems to provide a lot of good info. I suspect that's not all. Is there a way to get a more comprehensive set of stats? While I could do an sdiag -r followed by a sdiag say 5 min later, to get near-real time data, is there a better way to do this in order to analyze scheduler behavior? What is the impact of doing that? Calling sdiag has a minimal impact. Some sites execute it periodically and store returned values in a database with visualization frontends like ELK or TICK. This info can also be retrived from slurmrestd[1]. There are also many opensource projects you can find easily using "slurm dashboard" keywords. I'm not mentioning anythings specifically since I don't have enough experience with those tools to recommend one. An equivalent tool for slurmdbd is `sacctmgr show stats`. Looking at your config one more time I've noticed that your longest allowed time limit is 30hours. You should adjust bf_window to two times that value (it's in minutes), so you need bf_window=3600 cheers Marcin [1]https://slurm.schedmd.com/SLUG19/REST_API.pdf
Tony, We improved handling of "ReqNodeNotAvail, Reserved for maintenance" state reason, this is cleared now automaticall when the reservation ends so should switch to "None" and get set by next scheduler iteration. Changed by 3d6902ebe9d75[1] which will be released in Slurm 20.02.6. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/3d6902ebe9d757f752dd57a1cca69db08d4b956a
*** Ticket 10331 has been marked as a duplicate of this ticket. ***