9729 – topology/tree + select/cons_tres node selection not working as documented

Ticket 9729 - topology/tree + select/cons_tres node selection not working as documented

Summary: topology/tree + select/cons_tres node selection not working as documented

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.11.x
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-09-02 19:36 MDT by Felix Abecassis
Modified:	2021-07-21 15:41 MDT (History)
CC List:	2 users (show)

See Also:	9863
Site:	NVIDIA (PSLA)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	21.08
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Felix Abecassis 2020-09-02 19:36:11 MDT

Closely related to https://bugs.schedmd.com/show_bug.cgi?id=9624

From https://slurm.schedmd.com/topology.html:
The basic algorithm is to identify the lowest level switch in the hierarchy that can satisfy a job's request and then allocate resources on its underlying leaf switches using a best-fit algorithm.

From https://slurm.schedmd.com/slurm.conf.html:
Given network topology information, Slurm allocates all of a job's resources onto a single leaf of the network (if possible) using a best-fit algorithm. Otherwise it will allocate a job's resources onto multiple leaf switches so as to minimize the use of higher-level switches. 


But experimenting with a simple topology/tree configuration, it doesn't seem to work this way. Instead, it's just minimizing the number of leaf switches, and that can lead to a suboptimal node allocation.

Consider this simple 3-level (leaf-spine-core) tree topology.conf file:
SwitchName=leaf01  Nodes=node-1,node-2
SwitchName=leaf02  Nodes=node-3
SwitchName=leaf03  Nodes=node-4
SwitchName=spine01 Switches=leaf01,leaf02,leaf03

SwitchName=leaf04  Nodes=node-5,node-6
SwitchName=spine02 Switches=leaf04

SwitchName=core Switches=spine01,spine02


Trying to schedule a 4 nodes job yields the following:
$ srun -N4 -l bash -c 'echo $SLURM_STEP_NODELIST $SLURMD_NODENAME' | sort                                                                                                                                                                                                               
0: node-[1-2,5-6] node-1
1: node-[1-2,5-6] node-2
2: node-[1-2,5-6] node-5
3: node-[1-2,5-6] node-6
As you can see, those 4 nodes are connected through the core switch. It means more hops during communications.

Hence, it's  not following the documentation stating:
> identify the lowest level switch in the hierarchy that can satisfy a job's request
If it was the case, it would have selected switch spine01 as the lowest level switch.

It's not minimizing the number of total switches either, since this allocation involves 5 switches: leaf01, spine01, core, leaf04, spine02 (2 leaf switches).

Whereas an allocation from spine01 would be using only 4 switches, but 3 leaf switches: leaf01, leaf02, leaf03, spine01.

The code comment indeed directly contradicts the documentation: https://github.com/SchedMD/slurm/blob/master/src/plugins/select/cons_tres/job_test.c#L2073
> Allocate resources to job using a minimal leaf switch count 


Looking more at the code, it's not trying to find the lowest top-level switch, but instead the *highest* top-level switch that can reach all available nodes:
https://github.com/SchedMD/slurm/blob/90181a5207a0fe9013e827f7f91d1cbd18848335/src/plugins/select/cons_tres/job_test.c#L2220-L2223
https://github.com/SchedMD/slurm/blob/90181a5207a0fe9013e827f7f91d1cbd18848335/src/plugins/select/cons_tres/job_test.c#L2250-L2258

In our topology example above, the top-level switch will be the "core" switch, spanning all the cluster.

Afterwards, it tries to minimize the number of leaf switches in a greedy fashion, by adding the larger leaf switches first, regardless of the overall topology:
https://github.com/SchedMD/slurm/blob/90181a5207a0fe9013e827f7f91d1cbd18848335/src/plugins/select/cons_tres/job_test.c#L2574
That's why we end with 2 leaf switches in the example above.

Comment 2 Dominik Bartkiewicz 2020-09-07 04:34:42 MDT

Hi

You are right. This documentation is mostly relevant to cons_res.
As you know, we have a couple of open bugs related to topology/tree.
After we solve them, we will prepare proper documentation with a description of behavior and limitations.

Dominik

Comment 21 Dominik Bartkiewicz 2021-07-15 05:37:49 MDT

Hi

21.08 cons_tres topology/tree contains those patches.

https://github.com/SchedMD/slurm/commit/0a5d02a8a9134
https://github.com/SchedMD/slurm/commit/ceaeeda68ed01
https://github.com/SchedMD/slurm/commit/91b89a5c1d0e8
https://github.com/SchedMD/slurm/commit/c3aceaf91b02b

Now it should works as documented.

Dominik

Comment 22 Felix Abecassis 2021-07-21 15:41:58 MDT

Thank you, it seems to work as expected on a local setup with multiple slurmd!

We still need to test on a production cluster, but I will close this bug for now. Will reopen if I see an issue.