Ticket 9729 - topology/tree + select/cons_tres node selection not working as documented
Summary: topology/tree + select/cons_tres node selection not working as documented
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.11.x
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-09-02 19:36 MDT by Felix Abecassis
Modified: 2021-07-21 15:41 MDT (History)
2 users (show)

See Also:
Site: NVIDIA (PSLA)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Felix Abecassis 2020-09-02 19:36:11 MDT
Closely related to https://bugs.schedmd.com/show_bug.cgi?id=9624

From https://slurm.schedmd.com/topology.html:
The basic algorithm is to identify the lowest level switch in the hierarchy that can satisfy a job's request and then allocate resources on its underlying leaf switches using a best-fit algorithm.

From https://slurm.schedmd.com/slurm.conf.html:
Given network topology information, Slurm allocates all of a job's resources onto a single leaf of the network (if possible) using a best-fit algorithm. Otherwise it will allocate a job's resources onto multiple leaf switches so as to minimize the use of higher-level switches. 


But experimenting with a simple topology/tree configuration, it doesn't seem to work this way. Instead, it's just minimizing the number of leaf switches, and that can lead to a suboptimal node allocation.

Consider this simple 3-level (leaf-spine-core) tree topology.conf file:
SwitchName=leaf01  Nodes=node-1,node-2
SwitchName=leaf02  Nodes=node-3
SwitchName=leaf03  Nodes=node-4
SwitchName=spine01 Switches=leaf01,leaf02,leaf03

SwitchName=leaf04  Nodes=node-5,node-6
SwitchName=spine02 Switches=leaf04

SwitchName=core Switches=spine01,spine02


Trying to schedule a 4 nodes job yields the following:
$ srun -N4 -l bash -c 'echo $SLURM_STEP_NODELIST $SLURMD_NODENAME' | sort                                                                                                                                                                                                               
0: node-[1-2,5-6] node-1
1: node-[1-2,5-6] node-2
2: node-[1-2,5-6] node-5
3: node-[1-2,5-6] node-6
As you can see, those 4 nodes are connected through the core switch. It means more hops during communications.

Hence, it's  not following the documentation stating:
> identify the lowest level switch in the hierarchy that can satisfy a job's request
If it was the case, it would have selected switch spine01 as the lowest level switch.

It's not minimizing the number of total switches either, since this allocation involves 5 switches: leaf01, spine01, core, leaf04, spine02 (2 leaf switches).

Whereas an allocation from spine01 would be using only 4 switches, but 3 leaf switches: leaf01, leaf02, leaf03, spine01.

The code comment indeed directly contradicts the documentation: https://github.com/SchedMD/slurm/blob/master/src/plugins/select/cons_tres/job_test.c#L2073
> Allocate resources to job using a minimal leaf switch count 


Looking more at the code, it's not trying to find the lowest top-level switch, but instead the *highest* top-level switch that can reach all available nodes:
https://github.com/SchedMD/slurm/blob/90181a5207a0fe9013e827f7f91d1cbd18848335/src/plugins/select/cons_tres/job_test.c#L2220-L2223
https://github.com/SchedMD/slurm/blob/90181a5207a0fe9013e827f7f91d1cbd18848335/src/plugins/select/cons_tres/job_test.c#L2250-L2258

In our topology example above, the top-level switch will be the "core" switch, spanning all the cluster.

Afterwards, it tries to minimize the number of leaf switches in a greedy fashion, by adding the larger leaf switches first, regardless of the overall topology:
https://github.com/SchedMD/slurm/blob/90181a5207a0fe9013e827f7f91d1cbd18848335/src/plugins/select/cons_tres/job_test.c#L2574
That's why we end with 2 leaf switches in the example above.
Comment 2 Dominik Bartkiewicz 2020-09-07 04:34:42 MDT
Hi

You are right. This documentation is mostly relevant to cons_res.
As you know, we have a couple of open bugs related to topology/tree.
After we solve them, we will prepare proper documentation with a description of behavior and limitations.

Dominik
Comment 21 Dominik Bartkiewicz 2021-07-15 05:37:49 MDT
Hi

21.08 cons_tres topology/tree contains those patches.

https://github.com/SchedMD/slurm/commit/0a5d02a8a9134
https://github.com/SchedMD/slurm/commit/ceaeeda68ed01
https://github.com/SchedMD/slurm/commit/91b89a5c1d0e8
https://github.com/SchedMD/slurm/commit/c3aceaf91b02b

Now it should works as documented.

Dominik
Comment 22 Felix Abecassis 2021-07-21 15:41:58 MDT
Thank you, it seems to work as expected on a local setup with multiple slurmd!

We still need to test on a production cluster, but I will close this bug for now. Will reopen if I see an issue.