Ticket 8406

Summary:	use of topology
Product:	Slurm	Reporter:	Doug Meyer <dameyer>
Component:	Other	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	18.08.5
Hardware:	Linux
OS:	Linux
Site:	Raytheon Missile, Space and Airborne	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Doug Meyer 2020-01-28 08:54:36 MST

We are introducing topology to our clusters.  We have multiple clusters under job control by slurm, some with fabrics, some without.  A few questions:

1) If a cluster has a single fabric switch, does the cluster need to be documented in the slurm.conf for MPI jobs to be assigned?
2) If a cluster has no fabric, ref: Bug ID 7466, jobs will be scheduled but we can ignore the error messages, correct?
3) Are the users required to change their srun once topology is enforced or does slurm go for best fit?
4) If a cluster has a fabric but it is not declared in the topology.conf, will MPI jobs be assigned?

Thank you,
Doug

Comment 1 Felip Moll 2020-01-29 09:15:09 MST

Hi Doug,

(In reply to Doug Meyer from comment #0)
> We are introducing topology to our clusters.  We have multiple clusters
> under job control by slurm, some with fabrics, some without.  A few
> questions:

I am assuming that by "cluster" you meant "physical cluster" defined as a subset of nodes, and not the Slurm's 'cluster' terminology.

> 1) If a cluster has a single fabric switch, does the cluster need to be
> documented in the slurm.conf for MPI jobs to be assigned?

I understand you meant topology.conf? All nodes must be defined in both slurm.conf and topology.conf. Note that the topology is used to optimize job allocations to minimize network contention and is not *directly* related to MPI. Indeed an optimized allocation will make MPI processes to communicate faster than an under-optimized one which would have farther nodes.

If you are using topology.conf, then all nodes need to be defined in the file, otherwise they won't be able to be allocated and you will receive an error like:

"error: Unable to allocate resources: Requested node configuration is not available"

> 2) If a cluster has no fabric, ref: Bug ID 7466, jobs will be scheduled but
> we can ignore the error messages, correct?

You can ignore the error messages if you have intentionally configured multiple switches, and for some of them there's no way to communicate from one node to the other. For example:

#Nodes within IB network
SwitchName=fdrsw Nodes=gamba[1-20]
SwitchName=edrsw Nodes=patata[1-30]
SwitchName=ibsw0 Switches=fdrsw,edrsw

#Nodes within GB network
SwitchName=gbsw Nodes=llagosti[1-20]

You will receive the warning, because there's no way to reach IB from GB network, and the opposite, but if it is on purpose, just ignore it.

> 3) Are the users required to change their srun once topology is enforced or
> does slurm go for best fit?

No, Slurm will go for best fit. An exception would be if the previous srun requests more nodes than reachable through any switch, in the example below an 'srun -N70 would fail.

Also if the user specifically wants to use a determined number of switches they can use --switch flag.

> 4) If a cluster has a fabric but it is not declared in the topology.conf,
> will MPI jobs be assigned?

All nodes must be declared in topology.conf.


Does this resolve your questions?

I'm sure you've already did, but just in case I recommend reading: https://slurm.schedmd.com/topology.html

Comment 2 Doug Meyer 2020-01-29 10:26:50 MST

Hi,

Yes, I should have said partition.  There is no fabric connection between the different partitions.  

Since all the non-fabric (single thread) nodes are on a flat network, can we shortcut and just list all the "other" nodes against a single line as you have in your example despite the fact they are in different partitions?

Does the topology.conf need to be on each node just as the gres.conf does?

Thank you for the fast turn and clear answers.

Doug

Comment 3 Felip Moll 2020-01-29 10:46:50 MST

(In reply to Doug Meyer from comment #2)
> Hi,
> 
> Yes, I should have said partition.  There is no fabric connection between
> the different partitions.  
> 
> Since all the non-fabric (single thread) nodes are on a flat network, can we
> shortcut and just list all the "other" nodes against a single line as you
> have in your example despite the fact they are in different partitions?

Yes indeed, you can do that.

If partitions are disjoint, allocations won't happen between partitions even if all of them are under the same topology.conf switch.

Question: Are nodes visible all-to-all in your configuration when speaking about inter-daemon communication? (slurmctld<->slurmd, slurmd<->slurmd). If they are not, you may be interested in route/topology plugin (see man slurm.conf).

> Does the topology.conf need to be on each node just as the gres.conf does?

Yes, it needs to be in sync in each node.

slurmd may use this file when route/topology plugin is set, and by default the file is read on slurmd startup to fill up conf data structures.
 
> Thank you for the fast turn and clear answers.

You're welcome!.

Comment 4 Felip Moll 2020-01-31 06:16:03 MST

Doug,

I am closing this bug since requested info has been provided.

If you have further questions or issues feel free to mark this bug as OPEN again or just create a new bug.

Regards