Ticket 144

Summary: Improve network support with MPI
Product: Slurm Reporter: Moe Jette <jette>
Component: OtherAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: da, guillaume.papaure, matthieu.hautreux, nancy.kritkausky
Version: 2.5.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8494
Site: CEA Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 14.03.0pre6
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: pmi1 and pmi2 mpich2-1.5 working with Slurm scalability report

Description Moe Jette 2012-10-17 04:16:50 MDT
This bug is a replacement for bugs 77 and 78, both of which identify problems in how MPI jobs launched with srun use network resources. The problem is that MPI lacks network details when spawned. One way to address this is developing a new SLURM switch plugin, similar in functionality to switch/nrt (for IBM systems), but supporting a generic network interface. A proposed design is as follows:

1. When the slurmd daemon starts, it gathers network interface information (device names, types, and addresses) and sends the information to slurmctld with it's registration information.
2. The slurmctld maintains this network information about every node in the system.
3. When a job step is allocated resources, the network information available for the allocated nodes is sent to the slurmd daemons on compute nodes as part of the job credential.
4. Add a new SLURM API so that MPI can get the network information from the local slurmd daemon. This would be a data structure to simplify use, probably something like this:
  Node count
  For each node:
    Node name
    Network count
    For each network
      Device name
      Device type (IB, Eth, etc.)
      Address type (IP V4 or V6, a flag)
      Network address
We would also add a free data structure function.
5. Modify MPI to get this network information from SLURM.

Items 1, 2, and 3 can be done easily by building a variation of the existing switch/nrt plugin.
Item 4 requires a new API, but should also be simple to add.
Item 5 would require changes to Bull MPI. I would expect that would also be simple to add, but Guillaume Papaure can comment about that
Comment 1 Moe Jette 2012-10-17 04:18:27 MDT
*** Ticket 77 has been marked as a duplicate of this ticket. ***
Comment 2 Moe Jette 2012-10-17 04:19:07 MDT
*** Ticket 78 has been marked as a duplicate of this ticket. ***
Comment 3 Moe Jette 2012-10-17 10:22:01 MDT
Thank you Moe,

That is the problem that we discussed at Barcelona. I will have to ask
my colleagues working more specifically on that and try to reproduce
the problem. I was not sure that it was specifically a problem with
MPI but it may be the case.

Looking at openmpi website, an other way to directly use openmpi with
SLURM is through the use of the PMI library. This could be a good
workaround or help to identify this as a specific "srun --resv-port +
openMPI" issue.  Nethertheless, it requires to compile openmpi (or
bullxmpi) with "--with-pmi=...pmi_slurm...". You could perhaps propose
that to Guillaume in the meantime. I would definitely be in favor of
that, but on a long term basis, I am not sure that Bull would be
satisfied with that as bullxmpi is not GPL and linking to SLURM libpmi
requires to be GPL licensed....

The interesting link :
http://www.open-mpi.org/community/lists/users/2012/01/18137.php

HTH
Matthieu
Comment 4 Moe Jette 2012-10-17 10:50:04 MDT
Matthieu raises a good point, linking with a SLURM library would force the GPL license on bullxmpi. Other options include:

1. more than one environment variable
2. the slurmd or slurmstepd open a named socket where bullxmpi can read the data from
3. the slurmd or slurmstepd writes the information as a text file where bullxmpi can read the data from. the file would be deleted upon job termination
Comment 5 Moe Jette 2012-10-22 06:00:15 MDT
I reproduced the problem and it is as described in the bug report that
is to say only a problem with srun --resv-ports execution of
openmpi/bullxmpi.
It seems that when launched without orted, the communication between
processes are made directly (probably due to the
OMPI_MCA_routed=direct env variable forced in
orte/mca/ess/slurmd/ess_slurmd_module.c as you noticed in your comment
of bug 78). As a result, I am no longer sure that using the libpmi of
SLURM would really help, but it should be tried to see the behavior.
Indeed, the libpmi would still involve multiple autonomous processes
per node not routing their messages using the orted daemon. I am
wondering why this choice of direct routing of messages was made as it
is clearly an issue for large scale launches. I do not have enough
knowledge of openmpi to know if the design and the logic could mimic
the behavior of the orted daemon by routing the messages of the
processes of the different nodes locally first through the local rank
0 of every node. Adding more network info as proposed could help to
cope with bug 78 but not with bug 77 which seems induced by this
routed=direct design limitation only (bug 78 origin is the
gethostbyname used to query the endpoint for oob as described by
Guillaume when communicating)

Regards,
Matthieu
Comment 6 Moe Jette 2012-10-22 06:08:30 MDT
The port reservation logic was added to SLURM and OpenMPI a couple of years ago to improve OpenMPI initialization time. SLURM is configured with a set of ports and allocates those ports to the job step. SLURM is not opening any files, but only managing a set of numbers representing ports and allocating those numbers (ports) to job steps for their use. Each job step will be allocated a unique set of ports (no two steps will be allocated the same port on the same node at the same time). The number of ports reserved for each job step will be the maximum number of tasks (ranks) to be launched for the job step on any single node. 

This really seems like an OpenMPI problem, but I would be happy to work with Guillaume on any SLURM changes that would address this problem.

Moe
Comment 7 Guillaume Papauré 2013-03-26 05:23:22 MDT
Created attachment 217 [details]
pmi1 and pmi2 mpich2-1.5 working with Slurm scalability report
Comment 8 Guillaume Papauré 2013-03-26 05:30:07 MDT
As we discussed with Danny Auble Bull is looking for integrating pmi support in bullxMPI.
Slurm is already provinding full pmi1 library. Our tests have shown that pmi1 has some scalability issues on large clusters.
We've also seen that theses issues seems to be solved with pmi2 implemented in mpich2 running with slurm (see attached curves).
Actually Slurm is providing a pmi2 plugin with an API not at the same level as pmi1 : half part is already in the Slurm plugin and the second half, providing the final pmi2 API, is in mpich2.
Unfortunately the current API provided by Slurm is not sufficent to make us work with pmi2.
Our last week request was to integrate the second half of pmi2 in Slurm in order to implement the pmi2 support in Bullx/OpenMPI.
Do you still agree this request ?
mpich team has documented the pmi2 api : http://wiki.mpich.org/mpich/index.php/PMI_v2_API
mpich pmi2 implementation is available in mpich2-1.5 under the directory src/pmi/pmi2.

It seems that the second half of pmi2 is a network client, maybe this can also solve the GPL license problem by delivering this part under LGPL license?
Guillaume
Comment 9 Moe Jette 2013-03-26 09:55:15 MDT
Since the PMI2 client API links with the Slurm library and uses Slurm functions, it is not possible to simple change it's license to LGPL. The client functions will need to be re-written without re-using any GPL functions. I would guess this could be completed with less than one week of work. I am not promising to do the work, but it does seem fairly important. I did not do the original development work for PMI2, but releasing this library under GPL does not even make sense to me.
Comment 10 Moe Jette 2013-06-25 03:04:27 MDT
My note about licensing in comment #9 (26 March) is incorrect. In Slurm version 2.6, Slurm's mpi/pmi2 plugin sets an environment variable PMI_FD. All applications PMI interactions occur over this socket using a well defined protocol. The application does not need to link with any Slurm libraries. We do not believe Slurm's GPL license would have any impact upon the application's license.
Comment 11 Moe Jette 2013-06-28 07:43:13 MDT
Here is a bit more information based upon our telecon this week.

The logic in src/plugins/switch/nrt can be a helpful example of how to manage a job step's network details. Many of the data structures are based upon the IBM library and header file, but this is the basic mode of operation.

1. When the slurmd starts, it gathers network information including adapter name (eth0, mlx4_0, etc.), type (infiniband, ethernet), IP address and some IBM-specific information. This information is then sent to the slurmctld, which maintains a table of all the network details for every node)
2. When a user submits a job step, he can specify the network specifications (e.g. use 2 infiniband adapters).
3. The slurmctld satisfy the job step's network requirements (based upon its data in the table, which is maintained in the switch plugin) and sends that information in the job step credential.
4. The slurmd on each node allocates the appropriate network resources to the job step at startup time.

On non-IBM systems, we would need to do things slightly differently:
For #1. Collect network information using generic tools
For #4. Make the information available via PMI job attributes. This might be set by the srun command or as part of the job launch logic rather than by each slurmd since the information is already passing through all of the slurmd daemons involved in the process or launching the job step. There would need to be some more sophisticated logic for jobs that grow in size, but that covers the basics.
Comment 12 Moe Jette 2013-07-31 05:01:10 MDT
This comment is duplicated from bug 77. In the new code (version 13.12, in the master branch) we are getting network information to srun. The idea would be to make the information available using the PMI2 interface. 

Data now reaching srun. Configuration has:
SwitchType=switch/generic
DebugFlags=switch  # This is what generates all of the logs shown below

$ srun -N3 hostname
srun: switch_p_alloc_jobinfo() starting
srun: switch_p_unpack_jobinfo() starting
srun: node=smd1 name=eth0 family=IP_V4 addr=192.168.1.51
srun: node=smd1 name=eth1 family=IP_V4 addr=192.168.1.51
srun: node=smd1 name=ib0 family=IP_V4 addr=10.0.0.51
srun: node=smd1 name=eth0 family=IP_V6 addr=fe80::d267:e5ff:feea:6478
srun: node=smd1 name=ib0 family=IP_V6 addr=fe80::202:c903:4f:dbd9
srun: node=smd2 name=eth0 family=IP_V4 addr=192.168.1.52
srun: node=smd2 name=eth1 family=IP_V4 addr=192.168.1.52
srun: node=smd2 name=ib0 family=IP_V4 addr=10.0.0.52
srun: node=smd2 name=eth0 family=IP_V6 addr=fe80::d267:e5ff:fee9:e703
srun: node=smd2 name=ib0 family=IP_V6 addr=fe80::202:c903:4f:dc19
srun: node=smd3 name=eth0 family=IP_V4 addr=192.168.1.53
srun: node=smd3 name=eth1 family=IP_V4 addr=192.168.1.53
srun: node=smd3 name=ib0 family=IP_V4 addr=10.0.0.53
srun: node=smd3 name=eth0 family=IP_V6 addr=fe80::d267:e5ff:feea:649c
srun: node=smd3 name=ib0 family=IP_V6 addr=fe80::202:c903:4f:db65
srun: switch_p_pack_jobinfo() starting
srun: switch_p_pack_jobinfo() starting
srun: switch_p_pack_jobinfo() starting
smd1
smd2
smd3
Comment 13 Moe Jette 2013-11-15 03:56:30 MST
I had noted in the meeting minutes that Yiannis would provide you with the code changes for this bug.  He is still testing and completing this code.  He plans to send it to you by the end of next week, (e.g. Nov 22)
Comment 14 Moe Jette 2014-02-06 04:56:08 MST
I am closing this based upon the work performed for bug 77. There may be additional work to be performed in BullMPI, but I believe that Slurm is currently providing all of the information that we can for BullMPI to optimize network utilization.