| Summary: | Lower priority jobs skips the queue | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Akmal Madzlan <akmalm> |
| Component: | Scheduling | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | phils |
| Version: | 14.11.10 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurmctld log | ||
Can you describe the problem in a bit more detail? It's not clear to me what may be happening - from the description and snippets of log it looks like those jobs are simply backfilling as intended ahead of other jobs on the system? Can you attach output from sinfo, squeue, and your slurm.conf file? sdiag doesn't tell me much here, and the reading through log without the slurm.conf and squeue status to refer to is difficult. cheers, - Tim (In reply to Tim Wickberg from comment #1) > Can you describe the problem in a bit more detail? > > It's not clear to me what may be happening - from the description and > snippets of log it looks like those jobs are simply backfilling as intended > ahead of other jobs on the system? Well, it may simply be backfill behaviour. Indeed, I have long believed that backfill is totally broken as designed. I will explain. My understanding is that the backfill scheduler is basically continuously walking the queue, trying every task to see if it can be scheduled yet. If a node becomes available while it's halfway through the queue, it will schedule the next task -- even if it's nowhere near the top priority. This is, I think we can all agree, broken behaviour for a queueing system. You might then argue "Then you don't want backfill! Stop using it!" The problem with that argument is that without backfill, a single task at the top of the queue that can't be scheduled -- e.g. because it requires a licence or feature or amount of memory not currently available -- will prevent anything else from being scheduled. Even if there are 100s of available nodes, and 100s of tasks that _can_ run, the high-priority task waiting for a specific resource blocks them all. So with a heterogeneous cluster, where some tasks can only run on certain nodes, we need backfill (or something like it). Normally I'm willing to put up with backfill's nonsense, because it tends to be fairly rare. The problem that Akmal described in comment 0, however, was that it was happening A LOT. Way more than I've ever seen. Fully 50% of my nodes were consumed by low-priority tasks! And it's not like those 50% of nodes all came available at once, and were scheduled in one giant backfill orgasm. It was one at a time, the same backfill queue-jumping behaviour replayed in single file, over and over. Furthermore, I would expect the backfill queue-jumping to be largely random: i.e. at whatever point in the queue it's reached when the node comes available, that's the task that get scheduled. But that's not what we see in Akmal's data. Despite there being dozens of jobs in the queue, it was always choosing consecutive tasks from the same job to jump the queue. This suggested to me that something worse than its normal misbehaviour was at play. > Can you attach output from sinfo, squeue, and your slurm.conf file? sdiag > doesn't tell me much here, and the reading through log without the > slurm.conf and squeue status to refer to is difficult. Normally yes; I would have this all in my copious scrollback, but something crashed my machine yesterday. I can tell you that all partitions were full (no idle nodes), and that as nodes finished these 5-10 minute tasks, they were either scheduling from a priority-600 job (correctly) or from this priority-190 job (incorrectly). I'm willing to accept that there's not enough information at this point to pursue this with fervor, and simply note it as a data point for the future. I'd also like to embark on a longer conversation about why backfill's design doesn't work for us, but that's probably not best done in bugzilla. (In reply to Phil Schwan from comment #2) > My understanding is that the backfill scheduler is basically continuously > walking the queue, trying every task to see if it can be scheduled yet. If > a node becomes available while it's halfway through the queue, it will > schedule the next task -- even if it's nowhere near the top priority. This > is, I think we can all agree, broken behaviour for a queueing system. That's not how backfill works, at least not with Slurm. (I can't speak for other schedulers, but if they do such a thing I would agree that it is broken behavior.) There's a set of locks used while running through the backfill scheduling pass. While running through node and job changes are seen by the backfill scheduler - there's no "jumping" the queue if a node was returned in the middle of schedule planning. The bf_continue setting can lead to something vaguely related to what you suggest, but that would only be for high-priority jobs submitted while the backfill scheduler is going through the schedule plan, and would be unlikely in most circumstances. The backfill scheduler specifically will never schedule a lower priority job if that job would impact the scheduled start time of a higher priority job. My guess is that you may be running either with very high default time limits (or worse, no time limit), and this is leading to the backfill scheduler opportunistically running anything with a relatively short time limit. With high, or no limits, it may be that the expected start time for most of your jobs is significantly far out in the future. > You might then argue "Then you don't want backfill! Stop using it!" > > The problem with that argument is that without backfill, a single task at > the top of the queue that can't be scheduled -- e.g. because it requires a > licence or feature or amount of memory not currently available -- will > prevent anything else from being scheduled. Even if there are 100s of > available nodes, and 100s of tasks that _can_ run, the high-priority task > waiting for a specific resource blocks them all. > > So with a heterogeneous cluster, where some tasks can only run on certain > nodes, we need backfill (or something like it). > > Normally I'm willing to put up with backfill's nonsense, because it tends to > be fairly rare. The problem that Akmal described in comment 0, however, was > that it was happening A LOT. Way more than I've ever seen. Fully 50% of my > nodes were consumed by low-priority tasks! > > And it's not like those 50% of nodes all came available at once, and were > scheduled in one giant backfill orgasm. It was one at a time, the same > backfill queue-jumping behaviour replayed in single file, over and over. > > Furthermore, I would expect the backfill queue-jumping to be largely random: > i.e. at whatever point in the queue it's reached when the node comes > available, that's the task that get scheduled. But that's not what we see > in Akmal's data. Despite there being dozens of jobs in the queue, it was > always choosing consecutive tasks from the same job to jump the queue. > > This suggested to me that something worse than its normal misbehaviour was > at play. Again, this would be best described by some other mix of settings. The "queue-jumping" behavior you describe does not exist in Slurm. > > Can you attach output from sinfo, squeue, and your slurm.conf file? sdiag > > doesn't tell me much here, and the reading through log without the > > slurm.conf and squeue status to refer to is difficult. > > Normally yes; I would have this all in my copious scrollback, but something > crashed my machine yesterday. > > I can tell you that all partitions were full (no idle nodes), and that as > nodes finished these 5-10 minute tasks, they were either scheduling from a > priority-600 job (correctly) or from this priority-190 job (incorrectly). > > I'm willing to accept that there's not enough information at this point to > pursue this with fervor, and simply note it as a data point for the future. > I'd also like to embark on a longer conversation about why backfill's design > doesn't work for us, but that's probably not best done in bugzilla. If you can provide better data to look through I'm happy to verify the behavior you're seeing. Otherwise I'll go ahead and mark this as resolved/infogiven for now - you can always reopen or submit a new bug later with further details. If you'd like to discuss some potential alternative scheduling logic here is fine, or you're welcome to email me directly (tim@schedmd.com). - Tim Thanks, Tim, for your detailed reply. (In reply to Tim Wickberg from comment #3) > > The bf_continue setting can lead to something vaguely related to what you > suggest, but that would only be for high-priority jobs submitted while the > backfill scheduler is going through the schedule plan, and would be unlikely > in most circumstances. Good to know; that's definitely not this situation. > The backfill scheduler specifically will never schedule a lower priority job > if that job would impact the scheduled start time of a higher priority job. > > My guess is that you may be running either with very high default time > limits (or worse, no time limit), and this is leading to the backfill > scheduler opportunistically running anything with a relatively short time > limit. With high, or no limits, it may be that the expected start time for > most of your jobs is significantly far out in the future. Our time limits are 1 hour for all jobs. We can't estimate how long a task will actually take, if for no other reason than because we have a heterogeneous cluster -- i.e. runtime strongly depends on which node gets assigned! As we're unwilling to terminate tasks after a fixed arbitrary time, it means that everything ends up set to 1h. Would this lead to the behaviour that I describe? I had it happen again just now -- literally as I was writing this -- albeit in a much more limited form than in comment 0. Fortunately, the situation was fairly static at the time, with relatively few jobs on the system, and no new ones being submitted. The three nodes onto which tasks end up incorrectly scheduled (pnod0813, pnod0815, pnod0817) are covered by two partitions (teamswanGpu and idleGpu). teamswanGpu has a higher priority (21) than idleGpu (11). There are three relevant jobs in this example: 100_job - priority 100 on partitions teamswanGpu,idleGpu 150_job - priority 150 on idleGpu 50_job - priority 50 on idleGpu In this situation, with teamswanGpu being the partition of higher priority, I would expect three tasks from the 100_job to be scheduled when those three nodes came available. The teamswanGpu job: > JobId=17249010 ArrayJobId=17249010 ArrayTaskId=2190 JobName=100_job > Priority=100 Nice=0 Account=user QOS=normal > JobState=PENDING Reason=Resources Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A > SubmitTime=2015-12-16T16:01:37 EligibleTime=2015-12-16T16:01:38 > StartTime=2016-12-20T08:28:54 EndTime=2015-12-21T09:16:04 > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > Partition=teamswan,teamswanGpu,idle,tomo AllocNode:Sid=pud178:29605 The high-priority idleGpu job: > JobId=17305346 ArrayJobId=17305346 ArrayTaskId=2822 JobName=150_job > Priority=150 Nice=0 Account=user QOS=normal > JobState=PENDING Reason=Resources Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A > SubmitTime=2015-12-19T11:33:42 EligibleTime=2015-12-19T11:33:45 > StartTime=2016-12-20T08:14:36 EndTime=2015-12-21T09:14:36 > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > Partition=idleGpu,teamcottGpu AllocNode:Sid=pud80:1348 The low-priority idleGpu job: > JobId=17115775 ArrayJobId=17115775 ArrayTaskId=104 JobName=50_job > Priority=50 Nice=0 Account=user QOS=normal > JobState=PENDING Reason=Resources Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A > SubmitTime=2015-12-09T18:57:19 EligibleTime=2015-12-09T18:57:19 > StartTime=Unknown EndTime=2015-12-21T09:03:43 > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > Partition=idleGpu AllocNode:Sid=pud155:2818 All were submitted more than a day ago. When those three nodes came free, these three tasks were all scheduled out-of-order (and even within job 17115775, if it were operating in proper priority order, why would it not take 3 consecutive tasks? 93-101 remained pending.) > idleGpu 50 50_job user R 0:08 1 pnod0813 17115775_92 > idleGpu 50 50_job user R 0:08 1 pnod0815 17115775_102 > idleGpu 50 50_job user R 0:08 1 pnod0817 17115775_103 When I requeued them, the nodes were then assigned to tasks from the 150_job: > idleGpu 150 150_job user R 25:31 1 pnod0813 17305346_2036 > idleGpu 150 150_job user R 25:31 1 pnod0815 17305346_2037 > idleGpu 150 150_job user R 25:31 1 pnod0817 17305346_2038 From slurmctld.log: > [2015-12-21T08:03:43.014] backfill: Started JobId=17115775_92 (17305337) on pnod0813 > [2015-12-21T08:03:43.017] backfill: Started JobId=17115775_102 (17317739) on pnod0815 > [2015-12-21T08:03:43.024] backfill: Started JobId=17115775_103 (17317740) on pnod0817 > ... > [2015-12-21T08:04:02.313] Requeuing JobID=17115775_103(17317740) State=0x0 NodeCnt=0 > [2015-12-21T08:04:02.314] Requeuing JobID=17115775_92(17305337) State=0x0 NodeCnt=0 > [2015-12-21T08:04:02.314] Requeuing JobID=17115775_102(17317739) State=0x0 NodeCnt=0 > ... > [2015-12-21T08:04:33.073] backfill: Started JobId=17305346_2036 (17317749) on pnod0813 > [2015-12-21T08:04:33.078] backfill: Started JobId=17305346_2037 (17317750) on pnod0815 > [2015-12-21T08:04:33.084] backfill: Started JobId=17305346_2038 (17317751) on pnod0817 I'm very happy to send sinfo, squeue, and slurm.conf, but as they contain some potentially-sensitive information about users and jobs, I'd prefer to send them outside of the public bugzilla. Is there anything else I should gather while the event is fresh? There are two DebugFlags that may shed some light - "backfill" and "backfillmap". Note that these will generate a considerable amount of output, you'd likely only want them on briefly using "scontrol setdebugflags +backfill" (then -backfill to disable). If you see this happen again, output from `sinfo`, `squeue`, `scontrol show nodes`, `scontrol show jobs` plus the logs for slurmctld would be helpful. Emailing them to me is fine. I'm checking back in on open bugs - has this recurred in the past few weeks? It definitely has, but so unpredictably that I'm not sure yet how to capture it in the logs (short of enabling backfill{,map} full-time, which I'm not sure is practical...)
I may go ahead and enable them the next time our cluster is a little quieter. It's chockablock right now.
If you have "group" limits by association or QOS (e.g. total number of running jobs by group), a heuristic in Slurm's backfills scheduler could have been responsible for this. The relevant patch is here: https://github.com/SchedMD/slurm/commit/3ee1632f96e943227365429db4e58d5467bc1a2a.patch (In reply to Moe Jette from comment #9) > If you have "group" limits by association or QOS (e.g. total number of > running jobs by group), a heuristic in Slurm's backfills scheduler could > have been responsible for this. A good one to fix, but nope, we're not doing any group limits. I'll spend some time this week trying to reproduce it on one of our quieter clusters where I can turn the logging all the way up. Any luck reproducing this? I'm going to go ahead and mark this as infogiven for now. Please re-open when/if this reoccurs. Yep, fair enough. I wasn't able to reproduce on a different, more-idle cluster, so I suspect I'll just have to do it on our big one and cope with the massive log output. |
Created attachment 2496 [details] slurmctld log teamfraser,team 500 999_migprep_dz4_filter phils PD 0:00 1 (Resources) 17119018_[3077-3492] * It's eligible to run on many partitions (specifically including teamrottoGpu and idleGpu, which will be relevant in a moment): Partition=teamfraser,teamfraserGpu,idle,idleGpu,teamrottoGpu,teamregentGpu,tomo * It was submitted 45 minutes ago, so there should be no other tasks, on any of those partitions, started in the last 45 minutes. And there aren't. Except for all of these: idleGpu 190 wa512Pr_KDM phils R 30:01 1 pnod0804 17100520_700 teamrottoGpu 190 wa512Pr_KDM phils R 26:41 1 pnod0787 17100520_701 idleGpu 190 wa512Pr_KDM phils R 26:41 1 clus290 17100520_702 idleGpu 190 wa512Pr_KDM phils R 23:21 1 pnod0822 17100520_703 idleGpu 190 wa512Pr_KDM phils R 20:00 1 pnod0805 17100520_704 idleGpu 190 wa512Pr_KDM phils R 16:40 1 clus361 17100520_705 idleGpu 190 wa512Pr_KDM phils R 15:49 1 pnod0871 17100520_706 idleGpu 190 wa512Pr_KDM phils R 14:59 1 pnod0799 17100520_707 teamrottoGpu 190 wa512Pr_KDM phils R 12:29 1 clus145 17100520_708 idleGpu 190 wa512Pr_KDM phils R 12:29 1 pnod0798 17100520_709 teamrottoGpu 190 wa512Pr_KDM phils R 9:59 1 pnod0753 17100520_710 teamrottoGpu 190 wa512Pr_KDM phils R 9:09 1 pnod0751 17100520_711 idleGpu 190 wa512Pr_KDM phils R 9:09 1 clus282 17100520_712 idleGpu 190 wa512Pr_KDM phils R 8:19 1 clus281 17100520_713 teamrottoGpu 190 wa512Pr_KDM phils R 7:28 1 pnod0779 17100520_714 idleGpu 190 wa512Pr_KDM phils R 7:28 1 clus627 17100520_715 idleGpu 190 wa512Pr_KDM phils R 6:38 1 clus359 17100520_716 idleGpu 190 wa512Pr_KDM phils R 4:08 1 pnod0792 17100520_717 teamrottoGpu 190 wa512Pr_KDM phils R 1:38 1 clus271 17100520_718 teamrottoGpu 190 wa512Pr_KDM phils R 10:01 1 pnod0764 17100520_719 idleGpu 190 wa512Pr_KDM phils R 5:51 1 clus158 17100520_720 idleGpu 190 wa512Pr_KDM phils R 30:52 1 pnod0827 17100520_738 idleGpu 190 wa512Pr_KDM phils R 20:00 1 pnod0821 17100520_739 idleGpu 190 wa512Pr_KDM phils R 14:59 1 pnod0806 17100520_740 idleGpu 190 wa512Pr_KDM phils R 12:29 1 pnod0809 17100520_741 I keep requeueing them, and they keep slowly coming back on. Furthermore, it's only those tasks. Most of those jobs is scheduled by backfill. Here's sdiag output if it could be any help [root@pud22 ~]# sdiag ******************************************************* sdiag output at Thu Dec 10 09:13:30 2015 Data since Thu Dec 10 08:00:00 2015 ******************************************************* Server thread count: 3 Agent queue size: 0 Jobs submitted: 105 Jobs started: 367 Jobs completed: 361 Jobs canceled: 7 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 29442 Max cycle: 85066 Total cycles: 423 Mean cycle: 29755 Mean depth cycle: 25 Cycles per minute: 5 Last queue length: 18 Backfilling stats Total backfilled jobs (since last slurm start): 16224 Total backfilled jobs (since last stats cycle start): 42 Total cycles: 85 Last cycle when: Thu Dec 10 09:13:23 2015 Last cycle: 27023 Max cycle: 105529 Mean cycle: 60974 Last depth cycle: 19 Last depth cycle (try sched): 19 Depth Mean: 32 Depth Mean (try depth): 25 Last queue length: 18 Queue length mean: 32 [root@pud22 ~]# scontrol show config | grep SchedulerPa SchedulerParameters = bf_continue,bf_max_job_test=1000,bf_interval=50,bf_resolution=600,defer,max_depend_depth=3,sched_interval=20,batch_sched_delay=3,bf_max_job_part=100,nohold_on_prolog_fail