Ticket 78

Summary: MPI parameters oob_tcp_if_exclude and oob_tcp_if_include are ignored by srun
Product: Slurm Reporter: Nancy <nancy.kritkausky>
Component: OtherAssignee: Moe Jette <jette>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: da, guillaume.papaure, Rod.Schultz
Version: 2.3.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8494
Site: CEA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Nancy 2012-07-10 02:31:55 MDT
When launched with srun, bullxmpi uses the eth0 interface for oob connections even if its excluded by MCA parameters.
With srun, the parameters oob_tcp_if_exclude and oob_tcp_if_include are ignored, and many sockets are created on the default interface. These parameters are functional when the application is launched with mpirun.
Comment 1 Moe Jette 2012-07-10 03:52:35 MDT
From what I can tell bullxmpi is a variant of OpenMPI, so I would expect slurm would be configured to use the mpi/openmpi or mpi/none plugin (both are essentially identical), correct? Both plugins do essentially nothing. Since slurm isn't creating these connections, I just need to know how a user specifies the oob_tcp_if_include or exclude parameters and how that information should be passed along to the spawned tasks (e.g. setting some environment variable perhaps). We'll need some advice from the bullxmpi experts.
Comment 2 Nancy 2012-07-10 07:13:23 MDT
I will request some more information for you on how this works.  Bull is using a flavor of OpenMPI.  I have requested the version.
Comment 3 Nancy 2012-07-25 05:21:34 MDT
Here is a recent comment on this problem.  I have also attached the versions of bullxmpi they are running.

   The problem exists on CEA/T100 also. The Slurm version is 2.3.3 and bullxmpi-1.1.14.
But I don't think this is a recent problem, I noticed this with our firsts test of srun a year ago (versions was slurm
2.2.x and bullxmpi 1.1.x).
With bullxmpi trace activated we see that the parameters are used and correctly managed (interfaces rejected or included),
but without effects on the final results : the eth0 is always used.
Same tests with salloc + mpirun is OK.

This problem can reproduced by an simple "hello world" :
export OMPI_MCA_oob_tcp_if_exclude=eth0
srun -n 2 -N 2 ./hello

I have the rpms, but they are very large, can I email them to you.
Comment 4 Moe Jette 2012-07-25 05:52:52 MDT
Please email the RPM for Bull's MPI directly to me.

SLURM is configured with MpiDefault=none, so all that srun is doing is launching the processes without doing anything with the network other than setting up application stdin/out/err over the communication network assigned to slurm (typically ethernet).

I would expect the application's libraries to interpret oob_tcp_if_include and oob_tcp_if_include and establish its own network connections for MPI. We will probably need to work with Bull's MPI developers to resolve this. Do you have any contact information for them?
Comment 5 Moe Jette 2012-09-27 11:02:10 MDT
I see quite a few places where bullxmpi sets and gets environment variables with a prefix of "OMPI_MCA" as shown below, but I see no signs of anything that would reference "OMPI_MCA_oob_tcp_if_exclude". I can confirm in my tests that srun does forward the environment variable to the spawned user tasks, but from that point on it is the responsibility of the MPI libraries to interpret those environment variables and open network connections. Perhaps Bull has an in-house MPI expert? I'd be happy to work with him, but am not really in a position to debug the MPI source code.

orte/mca/ess/env/ess_env_module.c:        nodelist = getenv("OMPI_MCA_orte_nodelist");
orte/mca/ess/lsf/ess_lsf_module.c:        nodelist = getenv("OMPI_MCA_orte_nodelist");
orte/mca/ess/slurmd/ess_slurmd_module.c:    putenv("OMPI_MCA_grpcomm=hier");
orte/mca/ess/slurmd/ess_slurmd_module.c:    putenv("OMPI_MCA_routed=direct");
Comment 6 Nancy 2012-09-28 01:36:16 MDT
Thanks for the analysis.  I will find out who you can interface with the Bull MPI team and we can go from there.
Nancy
Comment 7 Guillaume Papauré 2012-10-10 03:34:41 MDT
Hi, i'm the bullxmpi interface.
This mca paramter is ignored when using srun because, in this case, ranks are finding the ip address of their oob endpoints by calling gethosbyname (i think this is the only way they have to find it).
Is your /etc/hosts or dns server configured with ip from ib devices devices ?
Comment 8 Moe Jette 2012-10-17 04:19:07 MDT

*** This ticket has been marked as a duplicate of ticket 144 ***