| Summary: | Backfill slowness | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ryan Cox <ryan_cox> |
| Component: | Scheduling | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 15.08.12 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | BYU - Brigham Young University | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
sdiag slurmctld.log.bz2 squeue scontrol show job -o topology.conf |
||
Created attachment 3483 [details]
sdiag
Created attachment 3484 [details]
slurmctld.log.bz2
Created attachment 3485 [details]
squeue
Created attachment 3486 [details]
scontrol show job -o
Created attachment 3487 [details]
topology.conf
I should add that I've been play around with the backfill parameters a lot in an attempt to speed things up. We used to run with a bf_interval of 90 but we found that it often wouldn't get through more than a few jobs. It seems to speed up over time for some reason so the bulk of the backfill seemed to occur after that 90 seconds if we allowed it. ... I really have no explanation for this one. I grabbed this, but never sent a reply? That's hugely out of line for me. Are you still back on the 15.08 branch, or have you found a chance to move forward? There should be some performance gains in 17.02, although I understand if you're hesitant to move to that. If you haven't seen it, I think Doug Jacobsen did an excellent job of walking people through how NERSC approaches some of their priority and scheduler tuning. The presentation is here: https://slurm.schedmd.com/SLUG16/NERSC.pdf and may provide some insights. His mapping of priority to units of time I find rather inspiring. My usual starting points for tuning are: bf_continue bf_window=(enough minutes to cover the highest MaxTime on the cluster.) bf_resolution=(usually at least 600) but I can see you have all those bases covered. bf_min_prio_reserve may actually suit you well depending on your queue depth, although you'd need to jump to a 16.05 / 17.02 release to get that. The idea behind that being to only test if the lower priority jobs can launch immmediately, and not bother trying to slow them in to the backfill map otherwise. That has *huge* performance gains for them, and lets them keep their systems 95%+ occupied. You'd mentioned that the longer the bf_interval the more jobs tend to be backfilled - that matches up pretty closely with NERSC's experience, and is what led to Doug asking for that option. The jobs that can immediately backfill tend to be lower priority, and the backfill scheduler wastes a lot of time making plans about when and where the other jobs may be able to fit in, and doesn't get to evaluate those small jobs as frequently since they're buried way down in the queue. Let me know if any of that helps, or if you'd like us to dig through a bit further on what you'd attached. - Tim |
Created attachment 3482 [details] slurm.conf We've been having lots of backfill slowness for a while now. The main scheduler seems to be fast but backfill can be painfully slow sometimes and only get through 1-2 dozen jobs in 2 minutes. This has been going on for a while now but I haven't gotten around to diagnosing it. I'm attaching some logs from today as well as squeue and scontrol show job -o. I'm not really sure where to start. One thing I thought of today is that our topology.conf file doesn't connect all nodes to each other. Could that affect things somehow? It seems like single node jobs should still be fast to process in backfill. I ran through at least one iteration with Backfill and BackfillMap flags but the rest are with just +Backfill. Unfortunately the logs I got with BackfillMap show things moving much more quickly than usual, of course, but it would be good to hear what kind of information you're looking for anyway.