I had two users submit similar jobs using the --exclude option to make sure that they’re trying to run on different nodes. We’re still seeing the problem where only one of the user’s jobs will run at a time, leaving the other nodes idle. One user’s jobs will run for a while, then when fairshare kicks in the first user’s running jobs will finish and the second user’s jobs will start leaving the remaining pending jobs of the first user as pending because of priority while the nodes they didn’t exclude remain idle. The second user’s jobs will then run for a while until the fairshare kicks in again and the process repeats itself. The user whose jobs are currently running and whose are idle will switch back and forth until eventually all jobs from both users are complete.
Hi Amzie, I'm not certain if I fully understand the issue but, is it possible that both users request or need the same nodes, or at least some of them? Note that if only one of the nodes in the nodelist of the running jobs is requested or needed by the jobs of the queue of the other user, then the behavior that you are seeing is expected, right? For example, if the cluster has 3 nodes and each job of each user is asking for 2 nodes, the fairshare is expected to work as you described, swapping jobs of each user in the queue, but 1 node is going to be always idle. Do you think that you could be facing this? In fact, I would say that this is not related to Fairshare, but to Backfill. Fairshare is a way to update the Priority of the jobs in the queue, and from what you say it looks that it's working fine. By the other hand, Backfill is a way to avoid idle resources by running jobs will less priority into them, only when doing it won't delay jobs with higher Priority. Do think that the jobs in the queue can be run in the idle resources without delaying jobs with higher Priority? Did I understand you correctly? Albert
Hi Amzie, Did the comment #1 solve your question?
Comment 1 solved the question Thank you From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, March 19, 2019 5:11 AM To: Amzie McWhorter <almcwhorter@raytheon.com> Subject: [External] [Bug 6668] unbalanced fairshare Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=6668#c2> on bug 6668<https://bugs.schedmd.com/show_bug.cgi?id=6668> from Albert Gil<mailto:albert.gil@schedmd.com> Hi Amzie, Did the comment #1<show_bug.cgi?id=6668#c1> solve your question? ________________________________ You are receiving this mail because: * You reported the bug.
Nice to hear that it helped. Closing as infogiven.