Launching an MPI executable with srun is slower than with mpirun and results in many more sockets are used. When launching an 8000 core IMB benchmark with srun, one node creates more than 8000 sockets and the other have on average 80 sockets open. With mpirun all nodes have on average only 40 sockets open. Moreover, the benchmark takes on average 25 seconds to start with srun instead of 15 seconds with mpirun.
What MPI implementation is being used (SLURM supports many) and what version of that MPI? Why is starting slower (and probably executing faster) a severity 2 problem?
Sorry about the serverity level, I was guess was not paying attention. There was not much information on the report. I have asked for the version number and will report as soon as I know.
If we are going to support bullxmpi then we will probably require a copy to install on our cluster or access to some bull system with the appropriate software. We have a small x86 cluster with InfiniBand in house. Do you have any suggestions on how to proceed?
Yes, I will get a copy of the mpi rpm they are running and hopefully their slurm.conf file and we can go from there.
Created attachment 97 [details] scontrol show config result
Nancy, Were you able to get a Bull MPI RPM for us to use or would it be possible to get access to a Bull test cluster? Moe
Hi Moe, I do have the rpms, but they are too big to add as an attachment. Can I try emailing them to you? Nancy
Sure, try sending by email.
I have sent an email to guillaume.papaure@ext.bull.net hoping that he can investigate this.
I have not been able to reproduce this so far. does the srun process have all the open file descriptors or the spawned task(s)? What is the srun --mpi option and the value of MpiDefault in slurm.conf? Could you run "lsof -p <pid>" on the process with all of the open file descriptors?
Hi, i'm pretty sure that this bug is quite the same that http://bugs.schedmd.com/show_bug.cgi?id=78. At MPI_Init bullxmpi has to do out of band communications: - when launched with mpirun an orted is the father of all processes on one node. Processes are taking their informations from this orted. each orted are taking their informations from mpirun : this is the "routed" algorithm - when lauched with srun, processes has to communicate directly with each other (srun is their father): this is the direct algorithm. This is done through sockets, this is why the --resv-ports is mandatory when using srun with BullxMPI. The solution would be, in the srun case, that slurm gives us more informations, especially on network devices available on each nodes of the allocation. Guillaume
SLURM does not keep much network information except for the network that it uses for communications (typically ethernet, but no details about the infiniband). What can we do to make this work better? We could possibly develop a SLURM "switch" plugin for the system. The "switch/nrt" plugin for IBM systems has the slurmd daemon get network information when the daemon starts (device file names, addresses, etc.), transfers that information to the slurmctld daemon, which then includes the information in the job step startup message. We might do the same thing and include the details in environment variables for mpi to use.
Currently bullxmpi aldready retrieves slurm informations with SLURM_* environment variables. So i think your idea is the good one for both slurm and bullxmpi. If you agree i can propose a syntax for this future environment variable?
I agree, although I am not certain who will implement this or when.
Here is a proposal syntax for a 2 nodes allocation: SLURM_NODELIST_IFCONFIG=node55[eth0(inet)='60.2.62.6',eth0(inet6)='fe80::a00:38ff:fe37:79ca/64',ib0(inet)='60.64.2.6',ib0(inet6)='fe80::a00:3800:137:e705/64',ib0:0(inet)='160.64.2.6',lo(inet)='127.0.0.1',lo(inet6)='::1/128'];node56[eth0(inet)='60.2.62.7',eth0(inet6)='fe80::a00:38ff:fe35:72c0/64',ib0(inet)='60.64.2.7',ib0(inet6)='fe80::a00:3800:135:72c3/64',ib0:0(inet)='160.64.2.7',lo(inet)='127.0.0.1',lo(inet6)='::1/128'] I have understood that the IBM swith plugin may have more information that these ones, if you think they can be usefull for us maybe you can extend the syntax.
The information provided on an IBM system is identical to this, but the information is made available with function calls to an IBM daemon rather than an environment variable. I am also concerned about the environment variable being very long and slow to parse for large systems. What do you think about new calls to the local slurmd to get this information?
You're right, there is scale issues with environment variables. One environment variable is limited to 128kB, with my sample syntax the limit is roughly reached on a 5000 nodes system. So, like you, i'll have to find time an someone to discuss about the protocol. Maybe we can open a new bug about this new feature already common to multiple bug reports ?
*** This ticket has been marked as a duplicate of ticket 144 ***
I am re-opening this ticked based upon the telecon of 6 June. Let me explain how the switch/nrt plugin works. When the slurmd starts, it collects network information for its node (IP address, device type (ethernet, Infiniband, etc.) and switch window states and counts (IBM-specific information). This information is sent from the slurmd daemon to the slurmctld daemon as part of node registration information. When a Slurm job step allocation occurs, it selects network resources for that job step and sends the information as part of the job step launch specification. The slurmd then allocates the appropriate network resources before task spawning and de-allocates those resources when the step ends. What we might do for MPI is: 1. Collect similar network information when the slurmd starts and send it to slurmstepd (similar to the switch/nrt logic today). 2. Include the network information as part of the job step launch specification (also similar to the switch/nrt logic today). 3. We don't need to actually allocate or de-allocate network resources for the job, but want to make the information available to the job via some RPC. We want to make the information available directly from the slurmd rather than slurmstepd for performance reasons (no bottleneck). It would be pretty simple to add something for this, adding a new function calls to the Slurm library (get information, and free the returned memory), but that would need to be under the GPL license. We could add a scontrol command that returns the same string as you describe, the scontrol output would not be under GPL, but that would be slightly more difficult to use than a structure. What do you think?
(In reply to Guillaume Papauré from comment #15) > Here is a proposal syntax for a 2 nodes allocation: > SLURM_NODELIST_IFCONFIG=node55[eth0(inet)='60.2.62.6',eth0(inet6)='fe80::a00: > 38ff:fe37:79ca/64',ib0(inet)='60.64.2.6',ib0(inet6)='fe80::a00:3800:137:e705/ > 64',ib0:0(inet)='160.64.2.6',lo(inet)='127.0.0.1',lo(inet6)='::1/128']; > node56[eth0(inet)='60.2.62.7',eth0(inet6)='fe80::a00:38ff:fe35:72c0/64', > ib0(inet)='60.64.2.7',ib0(inet6)='fe80::a00:3800:135:72c3/64',ib0: > 0(inet)='160.64.2.7',lo(inet)='127.0.0.1',lo(inet6)='::1/128'] > > I have understood that the IBM swith plugin may have more information that > these ones, if you think they can be usefull for us maybe you can extend the > syntax. I have begun work on a new plugin called "switch/generic" to collect information of the type proposed by Guillaume above. I expect that it will be complete in time for the next major release. I'm hoping that Guillaume or someone else at Bull can handle the PMI and MPI side of things if I can get the data out to the srun command. Sample of data currently collected shown below (as collected by getifaddrs() and filtering out loopbacks: slurmd: switch/generic name=eth0 ip_version=IP_V4 address=192.168.1.51 slurmd: switch/generic name=eth1 ip_version=IP_V4 address=192.168.1.51 slurmd: switch/generic name=ib0 ip_version=IP_V4 address=10.0.0.51 slurmd: switch/generic name=eth0 ip_version=IP_V6 address=fe80::d267:e5ff:feea:6478 slurmd: switch/generic name=ib0 ip_version=IP_V6 address=fe80::202:c903:4f:dbd9
Data now reaching srun. Configuration has: SwitchType=switch/generic DebugFlags=switch $ srun -N3 hostname srun: switch_p_alloc_jobinfo() starting srun: switch_p_unpack_jobinfo() starting srun: node=smd1 name=eth0 family=IP_V4 addr=192.168.1.51 srun: node=smd1 name=eth1 family=IP_V4 addr=192.168.1.51 srun: node=smd1 name=ib0 family=IP_V4 addr=10.0.0.51 srun: node=smd1 name=eth0 family=IP_V6 addr=fe80::d267:e5ff:feea:6478 srun: node=smd1 name=ib0 family=IP_V6 addr=fe80::202:c903:4f:dbd9 srun: node=smd2 name=eth0 family=IP_V4 addr=192.168.1.52 srun: node=smd2 name=eth1 family=IP_V4 addr=192.168.1.52 srun: node=smd2 name=ib0 family=IP_V4 addr=10.0.0.52 srun: node=smd2 name=eth0 family=IP_V6 addr=fe80::d267:e5ff:fee9:e703 srun: node=smd2 name=ib0 family=IP_V6 addr=fe80::202:c903:4f:dc19 srun: node=smd3 name=eth0 family=IP_V4 addr=192.168.1.53 srun: node=smd3 name=eth1 family=IP_V4 addr=192.168.1.53 srun: node=smd3 name=ib0 family=IP_V4 addr=10.0.0.53 srun: node=smd3 name=eth0 family=IP_V6 addr=fe80::d267:e5ff:feea:649c srun: node=smd3 name=ib0 family=IP_V6 addr=fe80::202:c903:4f:db65 srun: switch_p_pack_jobinfo() starting srun: switch_p_pack_jobinfo() starting srun: switch_p_pack_jobinfo() starting smd1 smd2 smd3
Created attachment 593 [details] PMI2 attributes for network info (from switch/generic) Here is a proposed patch to PMI2 to add "job attributes" for network addresses (from switch/generic) and ports. This is needed for a modex-less initialisation in MPI using only local PMI2 queries.
Hi Piotr, I worked with the patch you provided, I modified the mpich2 code to invoke the PMI2_Info_GetJobAttr() using the PMI_netinfo_of_task_$taskid key and the code seems to work fine. However I have the following questions/observations. 1) Do we need all the ifconfig information to be propagated to slurmctld and srun in the first place? It seems to me we only need the local information in every stepd, meaning: (prometeo,(eth0,IP_V4,192.168.1.78),(eth0,IP_V6,fe80::8e89:a5ff:fec6:dd4e)) which will be returned by the PMI2_Info_GetJobAttr(). 2) If 1) is true then in pmi2.c:_handle_info_getjobattr() we can implement the system call necessary to get the ifconfig information and send them back to the MPI library. This will simplify the implementation and will avoid calling slurmctld in info.c:job_attr_get_netinfo()/slurm_job_step_layout_get() as you do today to get the nodeid which is going to be a scalability problem for big parallel jobs as you can imagine. 3) If we do 2) then MPI will call PMI2_Info_GetJobAttr() will get the (key, value) information and puts them and fence them just like it does today so the functionality is unchanged. 4) Could you share your PMI2 test code with us. It would be great to have a test program which excersises all PMI2 calls. We could then add it to our batter of unit tests. David
The idea is to give up using the switch/generic plugin and implement the network information collection inside the pmi2.c module. Leave the framework of your patch and inside job_attr_get_netinfo() call get the ifconfig information. David
Hi Piotr, there is another issue with the patch. When the JOB_ATTR_RESV_PORTS key is specified the you look for the job reserve ports. 1) These ports may not be specified by the user on the srun command line. If they are not the code will core dump: snprintf(attr, PMI2_MAX_VALLEN, "%s", stepmsg->job_steps[0].resv_ports); 2) There is probably no need to call the slurm API to get these values since they are already in the job's environment:SLURM_STEP_RESV_PORTS. 3) Slurm documentation (http://slurm.schedmd.com/mpi_guide.html) states the MpiParams=ports is only needed for OpenMPI. I am working with MPICH2 which is the only version I have that supports PMI2, so reserve ports is always NULL. David
Hi David, Thanks a lot for the feedback ! To respond to the first message about the address information: 1) Concerning the ifconfig information propagation, the reason we want it is because we need not only info which is local to the node, but also global info. Our purpose is to avoid exchanging network information globaly at the time MPI initializes itself. Instead we want to have this information available locally inside slurm in advance. We want to replace global calls (using KVS put/get) by local calls to Job Attributes. Maybe this is not possible in Slurm, or I don't understand something…? 2) slurm_job_step_layout_get() is indeed a performance problem and I did not realise this is because this function does remote calls. Is there a better to way of doing : task_id -> node_id ? 4) I joined a test file that does raw PMI2 API calls, and compares job attribute results with expected results obtained with the key-value store: pmi2_test_addresses.c For the resv-ports info: 1') Thanks for spotting this. 2') I added this info because in bullx MPI / open MPI we query the environment, we go through 2 different routes if we use Slurm environment variables or pmi2 3') When using PMI2 one can use the Key-Value Store to exchange addresses+ports : this is what already exists in MPICH2, OpenMPI and bullx MPI. In the case when we got addresses through Job Attributes, we stilld don't know on which ports the remote MPI tasks are listening, so we use resv-ports to fix them (OpenMPI without PMI2 already does this to avoid exchanging this info)
Created attachment 603 [details] test case for job attributes
Created attachment 606 [details] pmi2 server code
Hi, I am appending the proposed diffs for the pmi2 server code and also a simple pmi2 program to test the calls. As discussed now all PMI2 calls are 'local' meaning to the slurmstepd except the PMI2_Fence() which goes to srun and from srun the collected data are redistributed to each slurmstepd. This communication uses the fan out algorithm on the way to srun and from srun to the stepds. Let me know how it goes. David
Created attachment 607 [details] pmi2 api test program
Hi, Thanks for your explanation. The scheme you propose is in fact what we are currently using with PMI2 (except the ifconfig is in mpi not in slurm). I'm appending a simpler patch using switch/generic and making only local calls (an attribute gives us the mapping), and the program to test it by comparison with distant calls (put/fence/get)
Created attachment 608 [details] pmi2 patch v2 using switch/generic node addresses
Created attachment 609 [details] test case for patch pmi2+switch/generic v2
Done in commit 9d5e0753276277674. Close. David
hi David, I have a few comments about the current PMI2 network attribute - you still have "PMI_netinfo_of_task_$task" whereas you don't seem to need the task number. - there seem to be an ifname appended to the address for ipv6 addresses of the form : "fe80::a00:38ff:fe37:88ae%eth0" I propose to use switch/generic in more straightforward way using node instead of task addresses with "PMI_netinfo_of_node_$node". This keeps things local, and we use the mapping on the MPI side.
Created attachment 617 [details] use addresses of nodes instead of tasks
Piotr, as discussed we don't want to use the plugin as it will sends data multiple times across the system. From slurmd->slurmctld->srun then to slurmd with job allocation again and then from slurmstepd to the MPI_Init and back to srun when pmi2 does fence. We can get the very same information locally and avoid this data passing overhead. - I can will remove the task id if you don't need it. - The ifname is appended by the getifaddrs() system call, you can check the manpage. I tested it on several distributions, CentOS, RedHat and Ubuntu and they all return it: (zeus,(p2p1,IP_V4,192.168.1.181),(p2p1,IP_V6,fe80::a00:27ff:fe28:66d4%p2p1)) (smd1,(eth0,IP_V4,192.168.1.51),(eth1,IP_V4,192.168.1.51),(ib0,IP_V4,10.0.0.51),(eth0,IP_V6,fe80::d267:e5ff:feea:6478%eth0),(ib0,IP_V6,fe80::202:c903:4f:dbd9%ib0)) I think the name is part of the address. David
To expand upon David's comment, use of the switch/generic plugin might make sense if the srun command stored the data directly as it came in from slurmctld and made it available to the PMI2 library as needed, which was what I had in mind. As I understand it, the original patch resulted in the full job's network configuration being moved from the slurmctld, to srun, to slurmd, to slurmstepd, then the PMI2 library would read it's node's network information from the data structure and send it back up to slurmd and srun, then the data would all be moved back down to the application again after the fence, which appears to be a lot of redundant data movement. In the case of the switch/nrt plugin, there is job allocation specific information that the application can not gather locally, but is based upon scheduling decisions made by the slurmctld daemon (allocated switch windows)
I have removed the taskid from PMI_netinfo_of_task, note that I have removed the trailing underscore. Commit 6398cc167bf1322f14ebcffbc0411a5bf84a6f52. I have updated the unit test case program as well. David
Hi David, Do I have to understand from your comment that you are in fact mostly opposed to the usage of switch/generic plugin at all because it incurs too much communication at job startup time ? Or is it rather in its usage in the patch in attachment 617 [details] ? This patch only parses node_id and calls switch_g_get_jobinfo() so I believe this does not imply any remote calls, nor any pmi fence to get the information. Piotr
Hi Piotr, yes we oppose the usage of switch/generic plugin because it creates communication overhead at job startup and also at system startup. The information we need are in the system already without using the switch/generic plugin. David
(In reply to David Bigagli from comment #41) > Hi Piotr, > yes we oppose the usage of switch/generic plugin because it creates > communication overhead at job startup and also at system startup. > The information we need are in the system already without using the > switch/generic plugin. > > David The original PMI2 implementation resulted in a lot of unnecessarty data movement. See comment 38 for details