Closely related to https://bugs.schedmd.com/show_bug.cgi?id=9624 From https://slurm.schedmd.com/topology.html: The basic algorithm is to identify the lowest level switch in the hierarchy that can satisfy a job's request and then allocate resources on its underlying leaf switches using a best-fit algorithm. From https://slurm.schedmd.com/slurm.conf.html: Given network topology information, Slurm allocates all of a job's resources onto a single leaf of the network (if possible) using a best-fit algorithm. Otherwise it will allocate a job's resources onto multiple leaf switches so as to minimize the use of higher-level switches. But experimenting with a simple topology/tree configuration, it doesn't seem to work this way. Instead, it's just minimizing the number of leaf switches, and that can lead to a suboptimal node allocation. Consider this simple 3-level (leaf-spine-core) tree topology.conf file: SwitchName=leaf01 Nodes=node-1,node-2 SwitchName=leaf02 Nodes=node-3 SwitchName=leaf03 Nodes=node-4 SwitchName=spine01 Switches=leaf01,leaf02,leaf03 SwitchName=leaf04 Nodes=node-5,node-6 SwitchName=spine02 Switches=leaf04 SwitchName=core Switches=spine01,spine02 Trying to schedule a 4 nodes job yields the following: $ srun -N4 -l bash -c 'echo $SLURM_STEP_NODELIST $SLURMD_NODENAME' | sort 0: node-[1-2,5-6] node-1 1: node-[1-2,5-6] node-2 2: node-[1-2,5-6] node-5 3: node-[1-2,5-6] node-6 As you can see, those 4 nodes are connected through the core switch. It means more hops during communications. Hence, it's not following the documentation stating: > identify the lowest level switch in the hierarchy that can satisfy a job's request If it was the case, it would have selected switch spine01 as the lowest level switch. It's not minimizing the number of total switches either, since this allocation involves 5 switches: leaf01, spine01, core, leaf04, spine02 (2 leaf switches). Whereas an allocation from spine01 would be using only 4 switches, but 3 leaf switches: leaf01, leaf02, leaf03, spine01. The code comment indeed directly contradicts the documentation: https://github.com/SchedMD/slurm/blob/master/src/plugins/select/cons_tres/job_test.c#L2073 > Allocate resources to job using a minimal leaf switch count Looking more at the code, it's not trying to find the lowest top-level switch, but instead the *highest* top-level switch that can reach all available nodes: https://github.com/SchedMD/slurm/blob/90181a5207a0fe9013e827f7f91d1cbd18848335/src/plugins/select/cons_tres/job_test.c#L2220-L2223 https://github.com/SchedMD/slurm/blob/90181a5207a0fe9013e827f7f91d1cbd18848335/src/plugins/select/cons_tres/job_test.c#L2250-L2258 In our topology example above, the top-level switch will be the "core" switch, spanning all the cluster. Afterwards, it tries to minimize the number of leaf switches in a greedy fashion, by adding the larger leaf switches first, regardless of the overall topology: https://github.com/SchedMD/slurm/blob/90181a5207a0fe9013e827f7f91d1cbd18848335/src/plugins/select/cons_tres/job_test.c#L2574 That's why we end with 2 leaf switches in the example above.
Hi You are right. This documentation is mostly relevant to cons_res. As you know, we have a couple of open bugs related to topology/tree. After we solve them, we will prepare proper documentation with a description of behavior and limitations. Dominik
Hi 21.08 cons_tres topology/tree contains those patches. https://github.com/SchedMD/slurm/commit/0a5d02a8a9134 https://github.com/SchedMD/slurm/commit/ceaeeda68ed01 https://github.com/SchedMD/slurm/commit/91b89a5c1d0e8 https://github.com/SchedMD/slurm/commit/c3aceaf91b02b Now it should works as documented. Dominik
Thank you, it seems to work as expected on a local setup with multiple slurmd! We still need to test on a production cluster, but I will close this bug for now. Will reopen if I see an issue.