77 – Launching an MPI job with srun opens 8000 sockets vs mpirun opens an average of 40 sockets

Ticket 77 - Launching an MPI job with srun opens 8000 sockets vs mpirun opens an average of 40 sockets

Summary: Launching an MPI job with srun opens 8000 sockets vs mpirun opens an average ...

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	2.3.x
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2012-07-10 02:19 MDT by Nancy
Modified:	2020-02-12 04:19 MST (History)
CC List:	4 users (show)

See Also:	8494
Site:	CEA
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.03.0pre6
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
scontrol show config result (5.07 KB, text/plain) 2012-07-25 05:25 MDT, Nancy	Details
PMI2 attributes for network info (from switch/generic) (6.95 KB, patch) 2014-01-17 00:18 MST, Piotr Lesnicki	Details \| Diff
test case for job attributes (6.41 KB, text/x-csrc) 2014-01-28 00:36 MST, Piotr Lesnicki	Details
pmi2 server code (7.21 KB, patch) 2014-01-29 10:10 MST, David Bigagli	Details \| Diff
pmi2 api test program (1.87 KB, patch) 2014-01-29 10:12 MST, David Bigagli	Details \| Diff
pmi2 patch v2 using switch/generic node addresses (6.88 KB, patch) 2014-01-30 19:18 MST, Piotr Lesnicki	Details \| Diff
test case for patch pmi2+switch/generic v2 (8.70 KB, text/x-csrc) 2014-01-30 19:19 MST, Piotr Lesnicki	Details
use addresses of nodes instead of tasks (4.58 KB, patch) 2014-02-04 00:56 MST, Piotr Lesnicki	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Nancy 2012-07-10 02:19:54 MDT

Launching an MPI executable with srun is slower than with mpirun and results in many more sockets are used.

When launching an 8000 core IMB benchmark with srun, one node creates more than 8000 sockets and the other have on average 80 sockets open. With mpirun all nodes have on average only 40 sockets open.
Moreover, the benchmark takes on average 25 seconds to start with srun instead of 15 seconds with mpirun.

Comment 1 Moe Jette 2012-07-10 03:21:04 MDT

What MPI implementation is being used (SLURM supports many) and what version of that MPI?
Why is starting slower (and probably executing faster) a severity 2 problem?

Comment 2 Nancy 2012-07-10 07:11:23 MDT

Sorry about the serverity level, I was guess was not paying attention.  There was not much information on the report.  I have asked for the version number and will report as soon as I know.

Comment 3 Moe Jette 2012-07-10 07:14:32 MDT

If we are going to support bullxmpi then we will probably require a copy to install on our cluster or access to some bull system with the appropriate software. We have a small x86 cluster with InfiniBand in house. Do you have any suggestions on how to proceed?

Comment 4 Nancy 2012-07-10 07:17:21 MDT

Yes, I will get a copy of the mpi rpm they are running and hopefully their slurm.conf file and we can go from there.

Comment 5 Nancy 2012-07-25 05:25:29 MDT

Created attachment 97 [details]
scontrol show config result

Comment 6 Moe Jette 2012-09-24 06:13:47 MDT

Nancy,

Were you able to get a Bull MPI RPM for us to use or would it be possible to get access to a Bull test cluster?

Moe

Comment 7 Nancy 2012-09-24 08:33:49 MDT

Hi Moe,
I do have the rpms, but they are too big to add as an attachment.  Can I try emailing them to you?  
Nancy

Comment 8 Moe Jette 2012-09-24 08:34:42 MDT

Sure, try sending by email.

Comment 9 Moe Jette 2012-10-16 02:18:41 MDT

I have sent an email to guillaume.papaure@ext.bull.net hoping that he can investigate this.

Comment 10 Moe Jette 2012-10-16 02:46:33 MDT

I have not been able to reproduce this so far.

does the srun process have all the open file descriptors or the spawned task(s)?
What is the srun --mpi option and the value of MpiDefault in slurm.conf?
Could you run "lsof -p <pid>" on the process with all of the open file descriptors?

Comment 11 Guillaume Papauré 2012-10-16 03:01:12 MDT

Hi,
i'm pretty sure that this bug is quite the same that http://bugs.schedmd.com/show_bug.cgi?id=78.

At MPI_Init bullxmpi has to do out of band communications:
- when launched with mpirun an orted is the father of all processes on one node. Processes are taking their informations from this orted. each orted are taking their informations from mpirun : this is the "routed" algorithm
- when lauched with srun, processes has to communicate directly with each other (srun is their father): this is the direct algorithm. This is done through sockets, this is why the --resv-ports is mandatory when using srun with BullxMPI.

The solution would be, in the srun case, that slurm gives us more informations, especially on network devices available on each nodes of the allocation.

Guillaume

Comment 12 Moe Jette 2012-10-16 03:15:26 MDT

SLURM does not keep much network information except for the network that it uses for communications (typically ethernet, but no details about the infiniband). What can we do to make this work better?

We could possibly develop a SLURM "switch" plugin for the system. The "switch/nrt" plugin for IBM systems has the slurmd daemon get network information when the daemon starts (device file names, addresses, etc.), transfers that information to the slurmctld daemon, which then includes the information in the job step startup message. We might do the same thing and include the details in environment variables for mpi to use.

Comment 13 Guillaume Papauré 2012-10-16 03:25:39 MDT

Currently bullxmpi aldready retrieves slurm informations with SLURM_* environment variables. So i think your idea is the good one for both slurm and bullxmpi.
If you agree i can propose a syntax for this future environment variable?

Comment 14 Moe Jette 2012-10-16 03:27:56 MDT

I agree, although I am not certain who will implement this or when.

Comment 15 Guillaume Papauré 2012-10-16 04:05:00 MDT

Here is a proposal syntax for a 2 nodes allocation:
SLURM_NODELIST_IFCONFIG=node55[eth0(inet)='60.2.62.6',eth0(inet6)='fe80::a00:38ff:fe37:79ca/64',ib0(inet)='60.64.2.6',ib0(inet6)='fe80::a00:3800:137:e705/64',ib0:0(inet)='160.64.2.6',lo(inet)='127.0.0.1',lo(inet6)='::1/128'];node56[eth0(inet)='60.2.62.7',eth0(inet6)='fe80::a00:38ff:fe35:72c0/64',ib0(inet)='60.64.2.7',ib0(inet6)='fe80::a00:3800:135:72c3/64',ib0:0(inet)='160.64.2.7',lo(inet)='127.0.0.1',lo(inet6)='::1/128']

I have understood that the IBM swith plugin may have more information that these ones, if you think they can be usefull for us maybe you can extend the syntax.

Comment 16 Moe Jette 2012-10-16 04:30:20 MDT

The information provided on an IBM system is identical to this, but the information is made available with function calls to an IBM daemon rather than an environment variable. I am also concerned about the environment variable being very long and slow to parse for large systems. What do you think about new calls to the local slurmd to get this information?

Comment 17 Guillaume Papauré 2012-10-16 21:02:52 MDT

You're right, there is scale issues with environment variables.
One environment variable is limited to 128kB, with my sample syntax the limit is roughly reached on a 5000 nodes system.
So, like you, i'll have to find time an someone to discuss about the protocol.
Maybe we can open a new bug about this new feature already common to multiple bug reports ?

Comment 18 Moe Jette 2012-10-17 04:18:27 MDT


*** This ticket has been marked as a duplicate of ticket 144 ***

Comment 19 Moe Jette 2013-06-06 07:15:51 MDT

I am re-opening this ticked based upon the telecon of 6 June.

Let me explain how the switch/nrt plugin works. When the slurmd starts, it collects network information for its node (IP address, device type (ethernet, Infiniband, etc.) and switch window states and counts (IBM-specific information). This information is sent from the slurmd daemon to the slurmctld daemon as part of node registration information. When a Slurm job step allocation occurs, it selects network resources for that job step and sends the information as part of the job step launch specification. The slurmd then allocates the appropriate network resources before task spawning and de-allocates those resources when the step ends.

What we might do for MPI is:
1. Collect similar network information when the slurmd starts and send it to slurmstepd (similar to the switch/nrt logic today).
2. Include the network information as part of the job step launch specification (also similar to the switch/nrt logic today).
3. We don't need to actually allocate or de-allocate network resources for the job, but want to make the information available to the job via some RPC. We want to make the information available directly from the slurmd rather than slurmstepd for performance reasons (no bottleneck). It would be pretty simple to add something for this, adding a new function calls to the Slurm library (get information, and free the returned memory), but that would need to be under the GPL license. We could add a scontrol command that returns the same string as you describe, the scontrol output would not be under GPL, but that would be slightly more difficult to use than a structure.

What do you think?

Comment 20 Moe Jette 2013-07-26 11:29:07 MDT

(In reply to Guillaume Papauré from comment #15)
> Here is a proposal syntax for a 2 nodes allocation:
> SLURM_NODELIST_IFCONFIG=node55[eth0(inet)='60.2.62.6',eth0(inet6)='fe80::a00:
> 38ff:fe37:79ca/64',ib0(inet)='60.64.2.6',ib0(inet6)='fe80::a00:3800:137:e705/
> 64',ib0:0(inet)='160.64.2.6',lo(inet)='127.0.0.1',lo(inet6)='::1/128'];
> node56[eth0(inet)='60.2.62.7',eth0(inet6)='fe80::a00:38ff:fe35:72c0/64',
> ib0(inet)='60.64.2.7',ib0(inet6)='fe80::a00:3800:135:72c3/64',ib0:
> 0(inet)='160.64.2.7',lo(inet)='127.0.0.1',lo(inet6)='::1/128']
> 
> I have understood that the IBM swith plugin may have more information that
> these ones, if you think they can be usefull for us maybe you can extend the
> syntax.

I have begun work on a new plugin called "switch/generic" to collect information of the type proposed by Guillaume above. I expect that it will be complete in time for the next major release. I'm hoping that Guillaume or someone else at Bull can handle the PMI and MPI side of things if I can get the data out to the srun command. Sample of data currently collected shown below (as collected by getifaddrs() and filtering out loopbacks:
slurmd: switch/generic name=eth0 ip_version=IP_V4 address=192.168.1.51
slurmd: switch/generic name=eth1 ip_version=IP_V4 address=192.168.1.51
slurmd: switch/generic name=ib0 ip_version=IP_V4 address=10.0.0.51
slurmd: switch/generic name=eth0 ip_version=IP_V6 address=fe80::d267:e5ff:feea:6478
slurmd: switch/generic name=ib0 ip_version=IP_V6 address=fe80::202:c903:4f:dbd9

Comment 21 Moe Jette 2013-07-30 08:19:17 MDT

Data now reaching srun. Configuration has:
SwitchType=switch/generic
DebugFlags=switch

$ srun -N3 hostname
srun: switch_p_alloc_jobinfo() starting
srun: switch_p_unpack_jobinfo() starting
srun: node=smd1 name=eth0 family=IP_V4 addr=192.168.1.51
srun: node=smd1 name=eth1 family=IP_V4 addr=192.168.1.51
srun: node=smd1 name=ib0 family=IP_V4 addr=10.0.0.51
srun: node=smd1 name=eth0 family=IP_V6 addr=fe80::d267:e5ff:feea:6478
srun: node=smd1 name=ib0 family=IP_V6 addr=fe80::202:c903:4f:dbd9
srun: node=smd2 name=eth0 family=IP_V4 addr=192.168.1.52
srun: node=smd2 name=eth1 family=IP_V4 addr=192.168.1.52
srun: node=smd2 name=ib0 family=IP_V4 addr=10.0.0.52
srun: node=smd2 name=eth0 family=IP_V6 addr=fe80::d267:e5ff:fee9:e703
srun: node=smd2 name=ib0 family=IP_V6 addr=fe80::202:c903:4f:dc19
srun: node=smd3 name=eth0 family=IP_V4 addr=192.168.1.53
srun: node=smd3 name=eth1 family=IP_V4 addr=192.168.1.53
srun: node=smd3 name=ib0 family=IP_V4 addr=10.0.0.53
srun: node=smd3 name=eth0 family=IP_V6 addr=fe80::d267:e5ff:feea:649c
srun: node=smd3 name=ib0 family=IP_V6 addr=fe80::202:c903:4f:db65
srun: switch_p_pack_jobinfo() starting
srun: switch_p_pack_jobinfo() starting
srun: switch_p_pack_jobinfo() starting
smd1
smd2
smd3

Comment 22 Piotr Lesnicki 2014-01-17 00:18:25 MST

Created attachment 593 [details]
PMI2 attributes for network info (from switch/generic)

Here is a proposed patch to PMI2 to add "job attributes" for network addresses (from switch/generic) and ports. This is needed for a modex-less initialisation in MPI using only local PMI2 queries.

Comment 23 David Bigagli 2014-01-22 10:42:52 MST

Hi Piotr,
        I worked with the patch you provided, I modified the mpich2 code
to invoke the PMI2_Info_GetJobAttr() using the PMI_netinfo_of_task_$taskid key
and the code seems to work fine. However I have the following questions/observations.

1) Do we need all the ifconfig information to be propagated to slurmctld
and srun in the first place? It seems to me we only need the local information
in every stepd, meaning:

(prometeo,(eth0,IP_V4,192.168.1.78),(eth0,IP_V6,fe80::8e89:a5ff:fec6:dd4e))

which will be returned by the PMI2_Info_GetJobAttr().

2) If 1) is true then in pmi2.c:_handle_info_getjobattr() we can implement
the system call necessary to get the ifconfig information and send them
back to the MPI library. This will simplify the implementation and will
avoid calling slurmctld in

info.c:job_attr_get_netinfo()/slurm_job_step_layout_get()

as you do today to get the nodeid which is going to be a scalability problem
for big parallel jobs as you can imagine.

3) If we do 2) then MPI will call PMI2_Info_GetJobAttr() will get the
(key, value) information and puts them and fence them just like it does today
so the functionality is unchanged.

4) Could you share your PMI2 test code with us. It would be great to have
a test program which excersises all PMI2 calls. We could then add it to
our batter of unit tests.

David

Comment 24 David Bigagli 2014-01-23 04:23:12 MST

The idea is to give up using the switch/generic plugin and implement 
the network information collection inside the pmi2.c module.
Leave the framework of your patch and inside job_attr_get_netinfo()
call get the ifconfig information.

David

Comment 25 David Bigagli 2014-01-24 08:07:45 MST

Hi Piotr,
         there is another issue with the patch. When the JOB_ATTR_RESV_PORTS key
is specified the you look for the job reserve ports. 

1) These ports may not be specified by the user on the srun command line.
If they are not the code will core dump:

snprintf(attr, PMI2_MAX_VALLEN, "%s",
        stepmsg->job_steps[0].resv_ports);

2) There is probably no need to call the slurm API to get these values since
they are already in the job's environment:SLURM_STEP_RESV_PORTS.

3) Slurm documentation (http://slurm.schedmd.com/mpi_guide.html) states the MpiParams=ports is only needed for OpenMPI. I am working with MPICH2 which
is the only version I have that supports PMI2, so reserve ports is always NULL.

David

Comment 26 Piotr Lesnicki 2014-01-28 00:35:13 MST

Hi David,

Thanks a lot for the feedback !

To respond to the first message about the address information:

1) Concerning the ifconfig information propagation, the reason we want
   it is because we need not only info which is local to the node, but
   also global info.

   Our purpose is to avoid exchanging network information globaly at
   the time MPI initializes itself. Instead we want to have this
   information available locally inside slurm in advance. We want to
   replace global calls (using KVS put/get) by local calls to Job
   Attributes. Maybe this is not possible in Slurm, or I don't
   understand something…?

2) slurm_job_step_layout_get() is indeed a performance problem and I
   did not realise this is because this function does remote calls.

   Is there a better to way of doing : task_id -> node_id ?

4) I joined a test file that does raw PMI2 API calls, and compares
   job attribute results with expected results obtained with the
   key-value store:

   pmi2_test_addresses.c


For the resv-ports info:

1') Thanks for spotting this.

2') I added this info because in bullx MPI / open MPI we query the
    environment, we go through 2 different routes if we use Slurm
    environment variables or pmi2

3') When using PMI2 one can use the Key-Value Store to exchange
    addresses+ports : this is what already exists in MPICH2, OpenMPI
    and bullx MPI.

    In the case when we got addresses through Job Attributes, we
    stilld don't know on which ports the remote MPI tasks are
    listening, so we use resv-ports to fix them (OpenMPI without PMI2
    already does this to avoid exchanging this info)

Comment 27 Piotr Lesnicki 2014-01-28 00:36:55 MST

Created attachment 603 [details]
test case for job attributes

Comment 28 David Bigagli 2014-01-29 10:10:41 MST

Created attachment 606 [details]
pmi2 server code

Comment 29 David Bigagli 2014-01-29 10:11:14 MST

Hi,
   I am appending the proposed diffs for the pmi2 server code and also a simple pmi2 program to test the calls. 

As discussed now all PMI2 calls are 'local' meaning to the slurmstepd
except the PMI2_Fence() which goes to srun and from srun the collected data
are redistributed to each slurmstepd. This communication uses the fan out 
algorithm on the way to srun and from srun to the stepds.

Let me know how it goes.

David

Comment 30 David Bigagli 2014-01-29 10:12:27 MST

Created attachment 607 [details]
pmi2 api test program

Comment 31 Piotr Lesnicki 2014-01-30 19:16:36 MST

Hi,

Thanks for your explanation. The scheme you propose is in fact what we
are currently using with PMI2 (except the ifconfig is in mpi not in
slurm).

I'm appending a simpler patch using switch/generic and making only
local calls (an attribute gives us the mapping), and the program to
test it by comparison with distant calls (put/fence/get)

Comment 32 Piotr Lesnicki 2014-01-30 19:18:47 MST

Created attachment 608 [details]
pmi2 patch v2 using switch/generic node addresses

Comment 33 Piotr Lesnicki 2014-01-30 19:19:52 MST

Created attachment 609 [details]
test case for patch pmi2+switch/generic v2

Comment 34 David Bigagli 2014-02-03 09:27:40 MST

Done in commit 9d5e0753276277674. Close.

David

Comment 35 Piotr Lesnicki 2014-02-04 00:55:07 MST

hi David,

I have a few comments about the current PMI2 network attribute

- you still have "PMI_netinfo_of_task_$task" whereas you don't seem to
  need the task number.

- there seem to be an ifname appended to the address for ipv6
  addresses of the form : "fe80::a00:38ff:fe37:88ae%eth0"


I propose to use switch/generic in more straightforward way using node
instead of task addresses with "PMI_netinfo_of_node_$node". This keeps
things local, and we use the mapping on the MPI side.

Comment 36 Piotr Lesnicki 2014-02-04 00:56:14 MST

Created attachment 617 [details]
use addresses of nodes instead of tasks

Comment 37 David Bigagli 2014-02-04 04:01:50 MST

Piotr, as discussed we don't want to use the plugin as it will sends data multiple times across the system. From slurmd->slurmctld->srun then to slurmd with job allocation again and then from slurmstepd to the MPI_Init and back to
srun when pmi2 does fence. We can get the very same information locally and 
avoid this data passing overhead.

- I can will remove the task id if you don't need it.

- The ifname is appended by the getifaddrs() system call, you can check the   manpage. I tested it on several distributions, CentOS, RedHat and Ubuntu and they all return it:

(zeus,(p2p1,IP_V4,192.168.1.181),(p2p1,IP_V6,fe80::a00:27ff:fe28:66d4%p2p1))
(smd1,(eth0,IP_V4,192.168.1.51),(eth1,IP_V4,192.168.1.51),(ib0,IP_V4,10.0.0.51),(eth0,IP_V6,fe80::d267:e5ff:feea:6478%eth0),(ib0,IP_V6,fe80::202:c903:4f:dbd9%ib0))

I think the name is part of the address.

David

Comment 38 Moe Jette 2014-02-04 04:33:13 MST

To expand upon David's comment, use of the switch/generic plugin might make sense if the srun command stored the data directly as it came in from slurmctld and made it available to the PMI2 library as needed, which was what I had in mind.

As I understand it, the original patch resulted in the full job's network configuration being moved from the slurmctld, to srun, to slurmd, to slurmstepd, then the PMI2 library would read it's node's network information from the data structure and send it back up to slurmd and srun, then the data would all be moved back down to the application again after the fence, which appears to be a lot of redundant data movement.

In the case of the switch/nrt plugin, there is job allocation specific information that the application can not gather locally, but is based upon scheduling decisions made by the slurmctld daemon (allocated switch windows)

Comment 39 David Bigagli 2014-02-04 06:08:19 MST

I have removed the taskid from PMI_netinfo_of_task, note that I have removed
the trailing underscore. Commit 6398cc167bf1322f14ebcffbc0411a5bf84a6f52.
I have updated the unit test case program as well.

David

Comment 40 Piotr Lesnicki 2014-02-13 02:40:39 MST

Hi David,

Do I have to understand from your comment that you are in fact mostly
opposed to the usage of switch/generic plugin at all because it incurs
too much communication at job startup time ?

Or is it rather in its usage in the patch in attachment 617 [details] ? This
patch only parses node_id and calls switch_g_get_jobinfo() so I
believe this does not imply any remote calls, nor any pmi fence to get
the information.

Piotr

Comment 41 David Bigagli 2014-02-13 03:31:25 MST

Hi Piotr,
         yes we oppose the usage of switch/generic plugin because it creates
communication overhead at job startup and also at system startup.
The information we need are in the system already without using the
switch/generic plugin.

David

Comment 42 Moe Jette 2014-02-13 03:52:41 MST

(In reply to David Bigagli from comment #41)
> Hi Piotr,
>          yes we oppose the usage of switch/generic plugin because it creates
> communication overhead at job startup and also at system startup.
> The information we need are in the system already without using the
> switch/generic plugin.
> 
> David

The original PMI2 implementation resulted in a lot of unnecessarty data movement. See comment 38 for details