Ticket 6668 - unbalanced fairshare
Summary: unbalanced fairshare
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 18.08.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Albert Gil
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-03-11 07:48 MDT by Amzie
Modified: 2019-03-19 07:54 MDT (History)
1 user (show)

See Also:
Site: Raytheon Missile, Space and Airborne
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Amzie 2019-03-11 07:48:48 MDT
I had two users submit similar jobs using the --exclude option to make sure that they’re trying to run on different nodes.  We’re still seeing the problem where only one of the user’s jobs will run at a time, leaving the other nodes idle.  One user’s jobs will run for a while, then when fairshare kicks in the first user’s running jobs will finish and the second user’s jobs will start leaving the remaining pending jobs of the first user as pending because of priority while the nodes they didn’t exclude remain idle.  The second user’s jobs will then run for a while until the fairshare kicks in again and the process repeats itself.  The user whose jobs are currently running and whose are idle will switch back and forth until eventually all jobs from both users are complete.
Comment 1 Albert Gil 2019-03-11 08:56:05 MDT
Hi Amzie,

I'm not certain if I fully understand the issue but, is it possible that both users request or need the same nodes, or at least some of them?

Note that if only one of the nodes in the nodelist of the running jobs is requested or needed by the jobs of the queue of the other user, then the behavior that you are seeing is expected, right?

For example, if the cluster has 3 nodes and each job of each user is asking for 2 nodes, the fairshare is expected to work as you described, swapping jobs of each user in the queue, but 1 node is going to be always idle.

Do you think that you could be facing this?

In fact, I would say that this is not related to Fairshare, but to Backfill.
Fairshare is a way to update the Priority of the jobs in the queue, and from what you say it looks that it's working fine.
By the other hand, Backfill is a way to avoid idle resources by running jobs will less priority into them, only when doing it won't delay jobs with higher Priority.

Do think that the jobs in the queue can be run in the idle resources without delaying jobs with higher Priority?


Did I understand you correctly?

Albert
Comment 2 Albert Gil 2019-03-19 06:10:31 MDT
Hi Amzie,
Did the comment #1 solve your question?
Comment 3 Amzie 2019-03-19 07:40:57 MDT
Comment 1 solved the question
Thank you

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, March 19, 2019 5:11 AM
To: Amzie McWhorter <almcwhorter@raytheon.com>
Subject: [External] [Bug 6668] unbalanced fairshare

Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=6668#c2> on bug 6668<https://bugs.schedmd.com/show_bug.cgi?id=6668> from Albert Gil<mailto:albert.gil@schedmd.com>

Hi Amzie,

Did the comment #1<show_bug.cgi?id=6668#c1> solve your question?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 4 Albert Gil 2019-03-19 07:54:17 MDT
Nice to hear that it helped.
Closing as infogiven.