Ticket 4861 - Job allocations seem to span disjoint networks, despite using topology/tree
Summary: Job allocations seem to span disjoint networks, despite using topology/tree
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 17.11.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-03-02 12:41 MST by Kilian Cavalotti
Modified: 2018-03-20 08:41 MDT (History)
1 user (show)

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.5
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2018-03-02 12:41:04 MST
Hi!

We're probably doing something wrong here, so bear with me, but it looks like some job allocation can span disjoint networks defined in topology.conf.

The documentation at https://slurm.schedmd.com/topology.html states that that:
"""
compute nodes on switches that lack a common parent switch can be used, but no job will span leaf switches without a common parent. 
"""

We have the following configuration in topology.conf that basically looks like this:

SwitchName=core1 Switches=sw1,sw2
SwitchName=s1 Nodes=n11,n12
SwitchName=s2 Nodes=n21,n22

SwitchName=core2 Switches=sw3,sw4
SwitchName=s3 Nodes=n31,n32
SwitchName=s4 Nodes=n41,n42

And this in slurm.conf:

TopologyPlugin=topology/tree
TopologyParam=TopoOptional


And I have noticed some large job allocations (not using any --switches option) spanning across core1 and core2 (for instance, a single job allocating n11 and n41).


Is that expected? 

Thanks!
-- 
Kilian
Comment 1 Dominik Bartkiewicz 2018-03-05 02:43:34 MST
Hi

Yes, that is expected when TopoOptional is used,
unless jobs request for some switches.

Dominik
Comment 2 Kilian Cavalotti 2018-03-05 10:04:57 MST
Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #1)
> Yes, that is expected when TopoOptional is used,
> unless jobs request for some switches.

Ah I see, thanks!

Do you think it would be useful to add some clarification to the documentation?
The TopoOptional description in the slurm.conf man page doesn't mention anything about disjoint networks, it would probably be worth a mention that using this option could span jobs over disjoint networks.

Same thing for topology.conf something like: 
"no job will span leaf switches without a common parent (unless the TopologyParam=TopoOptional option is used)."

Thanks!
-- 
Kilian
Comment 5 Dominik Bartkiewicz 2018-03-20 04:56:02 MDT
Hi

As you suggested we added this info to doc
https://github.com/SchedMD/slurm/commit/2d09a777443ded4b1
It is in 17.11.5 and up.

Dominik
Comment 6 Kilian Cavalotti 2018-03-20 08:41:49 MDT
(In reply to Dominik Bartkiewicz from comment #5)
> Hi
> 
> As you suggested we added this info to doc
> https://github.com/SchedMD/slurm/commit/2d09a777443ded4b1
> It is in 17.11.5 and up.


Great, thanks!

Cheers,
-- 
Kilian