This bug is a replacement for bugs 77 and 78, both of which identify problems in how MPI jobs launched with srun use network resources. The problem is that MPI lacks network details when spawned. One way to address this is developing a new SLURM switch plugin, similar in functionality to switch/nrt (for IBM systems), but supporting a generic network interface. A proposed design is as follows: 1. When the slurmd daemon starts, it gathers network interface information (device names, types, and addresses) and sends the information to slurmctld with it's registration information. 2. The slurmctld maintains this network information about every node in the system. 3. When a job step is allocated resources, the network information available for the allocated nodes is sent to the slurmd daemons on compute nodes as part of the job credential. 4. Add a new SLURM API so that MPI can get the network information from the local slurmd daemon. This would be a data structure to simplify use, probably something like this: Node count For each node: Node name Network count For each network Device name Device type (IB, Eth, etc.) Address type (IP V4 or V6, a flag) Network address We would also add a free data structure function. 5. Modify MPI to get this network information from SLURM. Items 1, 2, and 3 can be done easily by building a variation of the existing switch/nrt plugin. Item 4 requires a new API, but should also be simple to add. Item 5 would require changes to Bull MPI. I would expect that would also be simple to add, but Guillaume Papaure can comment about that
*** Ticket 77 has been marked as a duplicate of this ticket. ***
*** Ticket 78 has been marked as a duplicate of this ticket. ***
Thank you Moe, That is the problem that we discussed at Barcelona. I will have to ask my colleagues working more specifically on that and try to reproduce the problem. I was not sure that it was specifically a problem with MPI but it may be the case. Looking at openmpi website, an other way to directly use openmpi with SLURM is through the use of the PMI library. This could be a good workaround or help to identify this as a specific "srun --resv-port + openMPI" issue. Nethertheless, it requires to compile openmpi (or bullxmpi) with "--with-pmi=...pmi_slurm...". You could perhaps propose that to Guillaume in the meantime. I would definitely be in favor of that, but on a long term basis, I am not sure that Bull would be satisfied with that as bullxmpi is not GPL and linking to SLURM libpmi requires to be GPL licensed.... The interesting link : http://www.open-mpi.org/community/lists/users/2012/01/18137.php HTH Matthieu
Matthieu raises a good point, linking with a SLURM library would force the GPL license on bullxmpi. Other options include: 1. more than one environment variable 2. the slurmd or slurmstepd open a named socket where bullxmpi can read the data from 3. the slurmd or slurmstepd writes the information as a text file where bullxmpi can read the data from. the file would be deleted upon job termination
I reproduced the problem and it is as described in the bug report that is to say only a problem with srun --resv-ports execution of openmpi/bullxmpi. It seems that when launched without orted, the communication between processes are made directly (probably due to the OMPI_MCA_routed=direct env variable forced in orte/mca/ess/slurmd/ess_slurmd_module.c as you noticed in your comment of bug 78). As a result, I am no longer sure that using the libpmi of SLURM would really help, but it should be tried to see the behavior. Indeed, the libpmi would still involve multiple autonomous processes per node not routing their messages using the orted daemon. I am wondering why this choice of direct routing of messages was made as it is clearly an issue for large scale launches. I do not have enough knowledge of openmpi to know if the design and the logic could mimic the behavior of the orted daemon by routing the messages of the processes of the different nodes locally first through the local rank 0 of every node. Adding more network info as proposed could help to cope with bug 78 but not with bug 77 which seems induced by this routed=direct design limitation only (bug 78 origin is the gethostbyname used to query the endpoint for oob as described by Guillaume when communicating) Regards, Matthieu
The port reservation logic was added to SLURM and OpenMPI a couple of years ago to improve OpenMPI initialization time. SLURM is configured with a set of ports and allocates those ports to the job step. SLURM is not opening any files, but only managing a set of numbers representing ports and allocating those numbers (ports) to job steps for their use. Each job step will be allocated a unique set of ports (no two steps will be allocated the same port on the same node at the same time). The number of ports reserved for each job step will be the maximum number of tasks (ranks) to be launched for the job step on any single node. This really seems like an OpenMPI problem, but I would be happy to work with Guillaume on any SLURM changes that would address this problem. Moe
Created attachment 217 [details] pmi1 and pmi2 mpich2-1.5 working with Slurm scalability report
As we discussed with Danny Auble Bull is looking for integrating pmi support in bullxMPI. Slurm is already provinding full pmi1 library. Our tests have shown that pmi1 has some scalability issues on large clusters. We've also seen that theses issues seems to be solved with pmi2 implemented in mpich2 running with slurm (see attached curves). Actually Slurm is providing a pmi2 plugin with an API not at the same level as pmi1 : half part is already in the Slurm plugin and the second half, providing the final pmi2 API, is in mpich2. Unfortunately the current API provided by Slurm is not sufficent to make us work with pmi2. Our last week request was to integrate the second half of pmi2 in Slurm in order to implement the pmi2 support in Bullx/OpenMPI. Do you still agree this request ? mpich team has documented the pmi2 api : http://wiki.mpich.org/mpich/index.php/PMI_v2_API mpich pmi2 implementation is available in mpich2-1.5 under the directory src/pmi/pmi2. It seems that the second half of pmi2 is a network client, maybe this can also solve the GPL license problem by delivering this part under LGPL license? Guillaume
Since the PMI2 client API links with the Slurm library and uses Slurm functions, it is not possible to simple change it's license to LGPL. The client functions will need to be re-written without re-using any GPL functions. I would guess this could be completed with less than one week of work. I am not promising to do the work, but it does seem fairly important. I did not do the original development work for PMI2, but releasing this library under GPL does not even make sense to me.
My note about licensing in comment #9 (26 March) is incorrect. In Slurm version 2.6, Slurm's mpi/pmi2 plugin sets an environment variable PMI_FD. All applications PMI interactions occur over this socket using a well defined protocol. The application does not need to link with any Slurm libraries. We do not believe Slurm's GPL license would have any impact upon the application's license.
Here is a bit more information based upon our telecon this week. The logic in src/plugins/switch/nrt can be a helpful example of how to manage a job step's network details. Many of the data structures are based upon the IBM library and header file, but this is the basic mode of operation. 1. When the slurmd starts, it gathers network information including adapter name (eth0, mlx4_0, etc.), type (infiniband, ethernet), IP address and some IBM-specific information. This information is then sent to the slurmctld, which maintains a table of all the network details for every node) 2. When a user submits a job step, he can specify the network specifications (e.g. use 2 infiniband adapters). 3. The slurmctld satisfy the job step's network requirements (based upon its data in the table, which is maintained in the switch plugin) and sends that information in the job step credential. 4. The slurmd on each node allocates the appropriate network resources to the job step at startup time. On non-IBM systems, we would need to do things slightly differently: For #1. Collect network information using generic tools For #4. Make the information available via PMI job attributes. This might be set by the srun command or as part of the job launch logic rather than by each slurmd since the information is already passing through all of the slurmd daemons involved in the process or launching the job step. There would need to be some more sophisticated logic for jobs that grow in size, but that covers the basics.
This comment is duplicated from bug 77. In the new code (version 13.12, in the master branch) we are getting network information to srun. The idea would be to make the information available using the PMI2 interface. Data now reaching srun. Configuration has: SwitchType=switch/generic DebugFlags=switch # This is what generates all of the logs shown below $ srun -N3 hostname srun: switch_p_alloc_jobinfo() starting srun: switch_p_unpack_jobinfo() starting srun: node=smd1 name=eth0 family=IP_V4 addr=192.168.1.51 srun: node=smd1 name=eth1 family=IP_V4 addr=192.168.1.51 srun: node=smd1 name=ib0 family=IP_V4 addr=10.0.0.51 srun: node=smd1 name=eth0 family=IP_V6 addr=fe80::d267:e5ff:feea:6478 srun: node=smd1 name=ib0 family=IP_V6 addr=fe80::202:c903:4f:dbd9 srun: node=smd2 name=eth0 family=IP_V4 addr=192.168.1.52 srun: node=smd2 name=eth1 family=IP_V4 addr=192.168.1.52 srun: node=smd2 name=ib0 family=IP_V4 addr=10.0.0.52 srun: node=smd2 name=eth0 family=IP_V6 addr=fe80::d267:e5ff:fee9:e703 srun: node=smd2 name=ib0 family=IP_V6 addr=fe80::202:c903:4f:dc19 srun: node=smd3 name=eth0 family=IP_V4 addr=192.168.1.53 srun: node=smd3 name=eth1 family=IP_V4 addr=192.168.1.53 srun: node=smd3 name=ib0 family=IP_V4 addr=10.0.0.53 srun: node=smd3 name=eth0 family=IP_V6 addr=fe80::d267:e5ff:feea:649c srun: node=smd3 name=ib0 family=IP_V6 addr=fe80::202:c903:4f:db65 srun: switch_p_pack_jobinfo() starting srun: switch_p_pack_jobinfo() starting srun: switch_p_pack_jobinfo() starting smd1 smd2 smd3
I had noted in the meeting minutes that Yiannis would provide you with the code changes for this bug. He is still testing and completing this code. He plans to send it to you by the end of next week, (e.g. Nov 22)
I am closing this based upon the work performed for bug 77. There may be additional work to be performed in BullMPI, but I believe that Slurm is currently providing all of the information that we can for BullMPI to optimize network utilization.