6668 – unbalanced fairshare

Ticket 6668 - unbalanced fairshare

Summary: unbalanced fairshare

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	18.08.3
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Albert Gil
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-03-11 07:48 MDT by Amzie
Modified:	2019-03-19 07:54 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Raytheon Missile, Space and Airborne
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Amzie 2019-03-11 07:48:48 MDT

I had two users submit similar jobs using the --exclude option to make sure that they’re trying to run on different nodes.  We’re still seeing the problem where only one of the user’s jobs will run at a time, leaving the other nodes idle.  One user’s jobs will run for a while, then when fairshare kicks in the first user’s running jobs will finish and the second user’s jobs will start leaving the remaining pending jobs of the first user as pending because of priority while the nodes they didn’t exclude remain idle.  The second user’s jobs will then run for a while until the fairshare kicks in again and the process repeats itself.  The user whose jobs are currently running and whose are idle will switch back and forth until eventually all jobs from both users are complete.

Comment 1 Albert Gil 2019-03-11 08:56:05 MDT

Hi Amzie,

I'm not certain if I fully understand the issue but, is it possible that both users request or need the same nodes, or at least some of them?

Note that if only one of the nodes in the nodelist of the running jobs is requested or needed by the jobs of the queue of the other user, then the behavior that you are seeing is expected, right?

For example, if the cluster has 3 nodes and each job of each user is asking for 2 nodes, the fairshare is expected to work as you described, swapping jobs of each user in the queue, but 1 node is going to be always idle.

Do you think that you could be facing this?

In fact, I would say that this is not related to Fairshare, but to Backfill.
Fairshare is a way to update the Priority of the jobs in the queue, and from what you say it looks that it's working fine.
By the other hand, Backfill is a way to avoid idle resources by running jobs will less priority into them, only when doing it won't delay jobs with higher Priority.

Do think that the jobs in the queue can be run in the idle resources without delaying jobs with higher Priority?


Did I understand you correctly?

Albert

Comment 2 Albert Gil 2019-03-19 06:10:31 MDT

Hi Amzie,
Did the comment #1 solve your question?

Comment 3 Amzie 2019-03-19 07:40:57 MDT

Comment 1 solved the question
Thank you

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, March 19, 2019 5:11 AM
To: Amzie McWhorter <almcwhorter@raytheon.com>
Subject: [External] [Bug 6668] unbalanced fairshare

Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=6668#c2> on bug 6668<https://bugs.schedmd.com/show_bug.cgi?id=6668> from Albert Gil<mailto:albert.gil@schedmd.com>

Hi Amzie,

Did the comment #1<show_bug.cgi?id=6668#c1> solve your question?

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 4 Albert Gil 2019-03-19 07:54:17 MDT

Nice to hear that it helped.
Closing as infogiven.