| Summary: | Reason not being refreshed to be reflective of true reason | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Anthony DelSorbo <anthony.delsorbo> |
| Component: | Scheduling | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | issp2020support, sts |
| Version: | 20.02.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| URL: | We | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=10069 | ||
| Site: | NOAA | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | NESCC | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 20.02.6 20.11pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
hera slurm.conf
sdiag output - about 15 min since sdiag -r |
||
|
Description
Anthony DelSorbo
2020-09-02 08:26:28 MDT
Created attachment 15692 [details]
hera slurm.conf
Tony,
It looks like the job reason wasn't updated when you removed the reservation - I'll take a look into the code to check how we can improve that.
Another thing is that those jobs are not reached by the backfill scheduler. Could you please change your SchedulerParameters to be:
>SchedulerParameters=kill_invalid_depend,bf_max_job_test=500,bf_interval=60
and call `scontrol reconfigure` and `sdiag -r`.
After a few minutes, let say 10, please execute `sdiag` (without any option) and share the results with me.
cheers,
Marcin
(In reply to Marcin Stolarek from comment #2) Marcin, Thanks for the quick reply! > Another thing is that those jobs are not reached by the backfill scheduler. > Could you please change your SchedulerParameters to be: > >SchedulerParameters=kill_invalid_depend,bf_max_job_test=500,bf_interval=60 > > and call `scontrol reconfigure` and `sdiag -r`. > > After a few minutes, let say 10, please execute `sdiag` (without any option) > and share the results with me. I suspected the backfill scheduler depth would be the response to my query. However, since this is a production system, I am hesitant to make any changes since I don't know what the effect that will have on overall scheduling and system performance. Would you provide more guidance on this? Perhaps some pros and cons that I can include in my justification for a change request that management can review? I'd appreciate any insight you may have. Best, Tony. OK.. let's just share `sdiag` first. In your current config backfill depth is limited by the default value of bf_max_job_test=100. Another parameter like that is bf_interval - which defines a frequency (and by default a time limit) each backfill iteration can take. Increasing it generally makes the system more responsive for other RPCs, but means that jobs won't be backfilled more frequently than the defined value. We'll probably have to add bf_continue as well to reach the whole queue - this switch allows backfill to continue work over the queue after locks release. You may find a detailed explanation of those under SchedulerParameters section[1] of man slurm.conf cheers, Marcin [1]https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters Created attachment 15697 [details]
sdiag output - about 15 min since sdiag -r
Tony, Based on the sdiag output I'd suggest increasing bf_max_job_test=500 and add bf_continue to your scheduler parameters as a general recommendation, however, after some additional code analysis I see it won't change pending jobs reason. From what I see those jobs are still evaluated by both backfill and main scheduler but the reason is not updated. I'll have to check the details of it to see how we can improve that. I'll keep you posted on the progress. cheers, Marcin PS. As a workaround, you can call `scontrol release JOBID` on those jobs. After that, they will get displayed with "(None)" reason which will be updated in next scheduler cycle. (In reply to Marcin Stolarek from comment #7) > PS. As a workaround, you can call `scontrol release JOBID` on those jobs. > After that, they will get displayed with "(None)" reason which will be > updated in next scheduler cycle. Marcin, Thanks for the update. It appears the queue got short enough overnight for the backfill to reach all the jobs. Thanks for the workaround. I will try that next time this happens. It seems the default bf_interval is 30 min. This seems awfully high. While "Higher values result in less overhead..." - and I understand that - how long does it typically take to run that part of the scheduling code. I realize that would depend on the number of jobs it had to process, is the time on the order of seconds or minutes? I'm considering bringing that value down to 5 min from the default. Would that be excessive? What could I monitor to determine how effective or impactful that change is? sdiag seems to provide a lot of good info. I suspect that's not all. Is there a way to get a more comprehensive set of stats? While I could do an sdiag -r followed by an sdiag say 5 min later, to get near-real time data, is there a better way to do this in order to analyze scheduler behavior? What is the impact of doing that? Best, Tony. Tony, >It seems the default bf_interval is 30 min. No, it's 30 seconds. >how long does it typically take to run that part of the scheduling code. This really depends on too many factors to give any meaningful answer. >What could I monitor to determine how effective or impactful that change is? Last/Mean/Max cycle. It's the number of microseconds spent in backfill, this time doesn't include a time when backfill stops itself to relinquish locks to allow other operations processing. This process is controlled by bf_yield_interval=# and bf_yield_sleep=# parameters. By default backfill stops every 2 seconds for 0.5 second, which means that for bf_interval of 30 seconds the cycle time (as shown by sdiag) can be 30*(1-1/2*0.5) ~ 22.5. If you see that backfill is reaching this value it means that it stopped further processing of jobs because of the time limit, in other words, it didn't go deeper in the queue because of low bf_interval. >sdiag seems to provide a lot of good info. I suspect that's not all. Is there a way to get a more comprehensive set of stats? While I could do an sdiag -r followed by a sdiag say 5 min later, to get near-real time data, is there a better way to do this in order to analyze scheduler behavior? What is the impact of doing that? Calling sdiag has a minimal impact. Some sites execute it periodically and store returned values in a database with visualization frontends like ELK or TICK. This info can also be retrived from slurmrestd[1]. There are also many opensource projects you can find easily using "slurm dashboard" keywords. I'm not mentioning anythings specifically since I don't have enough experience with those tools to recommend one. An equivalent tool for slurmdbd is `sacctmgr show stats`. Looking at your config one more time I've noticed that your longest allowed time limit is 30hours. You should adjust bf_window to two times that value (it's in minutes), so you need bf_window=3600 cheers Marcin [1]https://slurm.schedmd.com/SLUG19/REST_API.pdf Tony, We improved handling of "ReqNodeNotAvail, Reserved for maintenance" state reason, this is cleared now automaticall when the reservation ends so should switch to "None" and get set by next scheduler iteration. Changed by 3d6902ebe9d75[1] which will be released in Slurm 20.02.6. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/3d6902ebe9d757f752dd57a1cca69db08d4b956a *** Ticket 10331 has been marked as a duplicate of this ticket. *** |