459 – slow mpi_init with intel mpi and srun

Ticket 459 - slow mpi_init with intel mpi and srun

Summary: slow mpi_init with intel mpi and srun

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	2.6.x
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	David Bigagli
QA Contact:

URL:

Duplicates (1):	531 (view as ticket list)
Depends on:
Blocks:

Reported:	2013-10-11 10:04 MDT by Yiannis Georgiou
Modified:	2014-04-09 05:33 MDT (History)
CC List:	3 users (show)

See Also:
Site:	Meteo France
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.03.1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf file (2.75 KB, application/octet-stream) 2013-10-11 10:04 MDT, Yiannis Georgiou	Details
attachment-27434-0.html (2.67 KB, text/html) 2014-01-20 02:44 MST, David Bigagli	Details
intel pmi tests (25.79 KB, application/pdf) 2014-03-18 05:52 MDT, Yiannis Georgiou	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Yiannis Georgiou 2013-10-11 10:04:06 MDT

Created attachment 443 [details]
slurm.conf file

The issue has been reported on MeteoFrance site with slurm version 2.6.0 and has been reproduced on internal smaller cluster. The performance degradation may happen starting with 40 nodes. Here is the difference in execution times: 

Results with intel mpi & srun :

# grep real slurm-324554*
slurm-3245546.out:real 0m35.889s
slurm-3245547.out:real 0m39.664s

Result with bullx mpi & srun (bullxmpi is based on openmpi):

# grep real slurm-324591*
slurm-3245917.out:real 0m5.740s
slurm-3245918.out:real 0m4.093s

Results with intel mpi & mpirun :

grep real slurm-324642*
slurm-3246427.out:real 0m7.925s
slurm-3246428.out:real 0m7.821s

This is just a first report to see if you have seen any issues like this before.
I didn't have the chance to reproduce it myself because the testing cluster is currently full but I will continue to investigate and come back to you when I have more info. 

In the meantime I wanted to ask :
1) The recommended way to use srun with Intel MPI is by using the pmi library. Is there any other way to use srun with IntelMPI ?

2) Have you ever tested pmi2 library with IntelMPI? I think that some developments are needed from the IntelMPI side for this support. Did you have the chance to talk with Intel about that? Do you want us to deal with this? Since pmi2 is far more scalable than pmi, I think we need to push them to support it as fast as possible.

let me know what you think
Thanks,
Yiannis

PS:

Here are the info needed to reproduce and the slurm.conf attached:

# cat hw4.c
#include<mpi.h>

main(int argc, char **argv)
{
  int size, rank;
  char hostname[1024];

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank); /*-- entre 0 et size-1 --*/
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  hostname[1023] = '\0';
  gethostname(hostname, 1023);
  //printf("Hello world, rank : %d, size : %d, hostname : %s\n", rank, size, hostname);
  if (rank==0) printf("Hostname: %s\n", hostname);
  MPI_Finalize();
}

# cat env_bench
source /opt/intel/mpi-rt/4.0.3/bin/mpivars.sh
source /opt/intel/composer_xe_2013.3.163/bin/iccvars.sh intel64


# cat Makefile
# Makefile

# source /opt/intel/bin/iccvars.sh intel64
# source /opt/intel/mpi-rt/4.0.3/bin64/mpivars.sh


SRC_BB = hw4.c
OBJ_BB = hw4.o
BIN_BB = hw4


CC = icc
MPICC = mpicc
#CFLAGS = -I/opt/mpi/bullxmpi//1.2.4.1/include/ -L/opt/mpi/bullxmpi/1.2.4.1/lib/ -align -lmpi
CFLAGS = -I/opt/intel/mpi-rt/4.0.3/include64 -L/opt/intel/mpi-rt/4.0.3/lib64/ -lmpi
LDLIBS =

all: $(BIN_BB)


BIN: $(BIN_BB)
        $(MPICC) $(CFLAGS) $(OBJ_BB) -o $(BIN_BB) $(LDLIBS)

OBJ: $(OBJ_BB)
        $(MPICC) $(CFLAGS) $(SRC_BB) -o $(OBJ_BB)

clean:
        rm -f $(BIN_BB) $(OBJ_BB)



# cat Bull_run_srun.sh
#!/bin/bash
#SBATCH --exclusive
#SBATCH --time=00:05:00
#SBATCH --partition B710

set -x

source env_bench
for i in $(nodeset -e $SLURM_NODELIST); do
    for j in $(seq 1 24) ; do
        echo $i
    done
done > machinefile_$SLURM_JOBID

printf "NODE %d\n" $(nodeset -c $SLURM_NODELIST)
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so

time srun --resv-ports ~/hw4
#time mpirun -machinefile ~/machinefile_$SLURM_JOBID ~/hw4

sed 's@^@Bull_run.sh: @' /tmp/slurmd/job*${SLURM_JOBID}/slurm_script

Comment 1 Danny Auble 2013-10-11 10:14:53 MDT

Yiannis, could you confirm they have things set up as described on the mpi page?

http://slurm.schedmd.com/mpi_guide.html#intel_srun

Also what version of Intel MPI are they using?

Comment 2 Yiannis Georgiou 2013-10-11 10:31:05 MDT

Danny,

from what I see things are set up correctly and they are using Intel version 4.1.1.036 but we have reproduced with 4.0.3 .

Yiannis

Comment 3 Moe Jette 2013-10-14 14:25:44 MDT

We should have an Intel MPI license on Tuesday to begin testing.

Comment 4 David Bigagli 2013-10-15 09:46:38 MDT

Hi Yiannis, we got the Intel MPI and I started investigating.
Indeed we see some slowdown in run times with srun compared to 
mpirun. Well keep you posted.

 David

Comment 5 David Bigagli 2013-10-17 12:18:08 MDT

Hi Yiannis, we see the slowdown being caused by the interaction between
the MPI application and the libpmi. I timed the test code like this:

{
        gettimeofday(&tv, NULL);
        printf("start: %d %.15s.%d\n",
               num, ctime(&tv.tv_sec) + 4, (int)tv.tv_usec);

        MPI_Init(&argc, &argv);

        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &size);
        if (0)
                printf("hello, world (rank %d of %d)\n", rank, size );
        MPI_Finalize();

        gettimeofday(&tv2, NULL);
        printf("end:   %d %.15s.%d %f\n",
               num, ctime(&tv2.tv_sec) + 4, (int)tv2.tv_usec,
               ((tv2.tv_sec - tv.tv_sec) * 1000.0
                + (tv2.tv_usec - tv.tv_usec) / 1000.0));
}

there are several protocol message between the libmpi.so and srun.
What pmi library do you use in bullmpix? Is there any difference
between the one we use in Slurm?

 David

Comment 6 David Bigagli 2013-10-21 07:18:17 MDT

Hi Yiannis, would it be possible for us to get a copy of bullx mpi for linux?
Does bulx mpi use pmi? I would assume so since it is based on openmpi which 
uses pmi in the integration with slurm.

Based on my test intel library does not appear to support pmi2.

David

Comment 7 Yiannis Georgiou 2013-10-22 04:50:33 MDT

Hi David,

bullxmpi can function with both pmi and pmi2, but we have made pmi2 as the default method because the scalability is much improved. The bullxmpi tests shown in a previous message of this bug are made with pmi though, in order to show the difference with IntelMPI and pmi... 
I'm not sure if I can manage to give you bullxmpi code ... but I can make the tests that you want me to on our side here and make comparisons if you have any patch that you want to test.

Since IntelMPI does not support pmi2 yet, I think we need to ask them to support it as soon as possible. Perhaps we need to push from both sides to be more efficient!

Yiannis

Comment 8 David Bigagli 2013-10-22 05:00:15 MDT

Hi Yiannis,
            I am interested to see if in your case most of the time is 
spent in MPI_Init() as well using the IntelMPI and srun and then measure 
the same using bullxmpi. I narrowed it down by measuring the time inside 
the hello.c program, I have posted it in bugzilla.

I did not mean to ask for the bullxmpi code, but for the library and the 
header file, however if you can perform the test that is good to.

Who do we have to push at Intel?

On 10/22/2013 10:50 AM, bugs@schedmd.com wrote:
> *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=459#c7> on bug 459
> <http://bugs.schedmd.com/show_bug.cgi?id=459> from Yiannis Georgiou
> <mailto:gohnwej@gmail.com> *
>
> Hi David,
>
> bullxmpi can function with both pmi and pmi2, but we have made pmi2 as the
> default method because the scalability is much improved. The bullxmpi tests
> shown in a previous message of this bug are made with pmi though, in order to
> show the difference with IntelMPI and pmi...
> I'm not sure if I can manage to give you bullxmpi code ... but I can make the
> tests that you want me to on our side here and make comparisons if you have any
> patch that you want to test.
>
> Since IntelMPI does not support pmi2 yet, I think we need to ask them to
> support it as soon as possible. Perhaps we need to push from both sides to be
> more efficient!
>
> Yiannis
>
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>   * You are watching someone on the CC list of the bug.
>   * You are watching the assignee of the bug.
>

Comment 9 David Bigagli 2013-11-04 09:57:30 MST

Hi Yiannis,
           did you have a chance to run the test?

David

Comment 10 David Bigagli 2013-11-14 05:46:46 MST

We try the latest version of the intel MPI library, version 5.0.0 but the
performance numbers have not changed.

 David

Comment 11 David Bigagli 2013-11-21 07:13:46 MST

*** Ticket 531 has been marked as a duplicate of this ticket. ***

Comment 12 Yiannis Georgiou 2014-01-20 00:20:18 MST

Hello David,

sorry for the long delay on this one. Here are the results on the tests that you asked me

The tests were made upon 20 nodes with 240 cpus in total. 

Version                  Average MPI_Init (sec)
Intel srun (libpmi)  :      2.82 
Intel mpirun         :      0.28
OpenMPI srun (libpmi):      2.72
OpenMPI mpirun       :      1.64


So to answer to your question, indeed the tests show that the degradation with libpmi happens with both Intel and OpenMPI. BullxMPI is compiled with pmi2 by default so it should not be used in the comparison.  

By the way a colleague in BULL have made tests measuring the time of the Intel srun with libmpi on a larger cluster using different values for PMI_TIME variable and it seems that lowering this variable improves the time significantly: 

Nodes Ntasks PMI_TIME=500 PMI_TIME=10
10 40 4.485 3.696477
20 80 5.320 3.396676
100 400 11.966 2.217788
400 1600 111.757 9.523558
900 3600 551.599 30.90825

We are starting using this variable to workaround the delays. 
The default value of PMI_TIME in slurm/src/api/slurm_pmi.c is 500. Do you think we should drop this to a lower value or should we work an optimization of the logic in _delay_rpc function which makes use of PMI_TIME?

Thanks
Yiannis

Comment 13 David Bigagli 2014-01-20 02:44:32 MST

Hi, that is a good finding, let me run tests and investigate.

bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=459
>
>--- Comment #12 from Yiannis Georgiou <yiannis.georgiou@bull.net> ---
>
>Hello David,
>
>sorry for the long delay on this one. Here are the results on the tests
>that
>you asked me
>
>The tests were made upon 20 nodes with 240 cpus in total. 
>
>Version                  Average MPI_Init (sec)
>Intel srun (libpmi)  :      2.82 
>Intel mpirun         :      0.28
>OpenMPI srun (libpmi):      2.72
>OpenMPI mpirun       :      1.64
>
>
>So to answer to your question, indeed the tests show that the
>degradation with
>libpmi happens with both Intel and OpenMPI. BullxMPI is compiled with
>pmi2 by
>default so it should not be used in the comparison.  
>
>By the way a colleague in BULL have made tests measuring the time of
>the Intel
>srun with libmpi on a larger cluster using different values for
>PMI_TIME
>variable and it seems that lowering this variable improves the time
>significantly: 
>
>Nodes Ntasks PMI_TIME=500 PMI_TIME=10
>10 40 4.485 3.696477
>20 80 5.320 3.396676
>100 400 11.966 2.217788
>400 1600 111.757 9.523558
>900 3600 551.599 30.90825
>
>We are starting using this variable to workaround the delays. 
>The default value of PMI_TIME in slurm/src/api/slurm_pmi.c is 500. Do
>you think
>we should drop this to a lower value or should we work an optimization
>of the
>logic in _delay_rpc function which makes use of PMI_TIME?
>
>Thanks
>Yiannis
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.
>You are watching someone on the CC list of the bug.
>You are watching the assignee of the bug.

Comment 14 David Bigagli 2014-01-20 02:44:44 MST

Created attachment 594 [details]
attachment-27434-0.html

Comment 15 David Bigagli 2014-01-21 07:04:20 MST

Hi Yiannis,
          I ran few tests but I am unable to confirm your numbers, in my environment changing the PMI_TIME values does not speed up the MPI_Init().
This is most likely because of my environment, I run on a single node 
using multiple-slurmd. 

However since it does work for you I think you should definitely use it.
I see that with the default value (500) the rpc are being delayed no more
than ~100 microseconds, using 10 the delays are around .1 microseconds.

My suggestion at this point is to tune up this parameter rather then change
the code.

Is there any way we could have access to some of these large systems you have access to for us to test stuff?

Thanks,
       David

Comment 16 David Bigagli 2014-02-20 03:47:57 MST

Hi Yiannis,
           another thing to try is the srun environment variable
SLURM_PMI_KVS_NO_DUP_KEYS which tells the PMI that there are no duplicate keys
so the code skips the checking for duplicate keys which is O(n^2) (sigh..)
export SLURM_PMI_KVS_NO_DUP_KEYS=yes should speed the things up a little
bit if there are a lot of keys,value pairs.


David

Comment 17 Yiannis Georgiou 2014-03-18 05:50:59 MDT

Hello David ,

here are some new results of testing the different combination of parameters...
thanks to Hugo Meiland for performing the tests and thanks to SARA admins for allowing us to do the tests on their cluster,

It looks like these are the best settings:

export PMI_TIME=1
export SLURM_PMI_KVS_NO_DUP_KEYS=yes

you can find the results on the attached pdf

Yiannis

Comment 18 Yiannis Georgiou 2014-03-18 05:52:37 MDT

Created attachment 703 [details]
intel pmi tests

Comment 19 David Bigagli 2014-03-18 07:11:58 MDT

Thank you guys. Couple of questions.
What is the difference between the 2 sets of tests?
The original performance numbers were gathered with 40 node and looking 
at these numbers it seems that we have improved the performance by 
10seconds. Is that what you see?

On 03/18/2014 10:52 AM, bugs@schedmd.com wrote:
> *Comment # 18 <http://bugs.schedmd.com/show_bug.cgi?id=459#c18> on bug
> 459 <http://bugs.schedmd.com/show_bug.cgi?id=459> from Yiannis Georgiou
> <mailto:yiannis.georgiou@bull.net> *
>
> Createdattachment 703  <attachment.cgi?id=703>  [details]  <attachment.cgi?id=703&action=edit>
> intel pmi tests
>
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>   * You are watching someone on the CC list of the bug.
>   * You are watching the assignee of the bug.
>

Comment 20 Yiannis Georgiou 2014-03-19 05:23:53 MDT

David,

the difference between the 2 sets is the usage of :
export SLURM_PMI_KVS_NO_DUP_KEYS=yes

The first set of tests use this variable and the second no.

I think that we should not compare these results with the initial results made on 40 nodes because the hardware is different.
What counts is that we confirmed that by using the parameters of 

export PMI_TIME < 500
and
export SLURM_PMI_KVS_NO_DUP_KEYS=yes

we get better performance than when have the default parameters which are PMI_TIME=500 and SLURM_PMI_KVS_NO_DUP_KEYS=no

Using the environmental variables is fine for us but following the above results will you consider changing the default values of the parameters?

thanks,
Yiannis



(In reply to David Bigagli from comment #19)
> Thank you guys. Couple of questions.
> What is the difference between the 2 sets of tests?
> The original performance numbers were gathered with 40 node and looking 
> at these numbers it seems that we have improved the performance by 
> 10seconds. Is that what you see?
> 
> On 03/18/2014 10:52 AM, bugs@schedmd.com wrote:
> > *Comment # 18 <http://bugs.schedmd.com/show_bug.cgi?id=459#c18> on bug
> > 459 <http://bugs.schedmd.com/show_bug.cgi?id=459> from Yiannis Georgiou
> > <mailto:yiannis.georgiou@bull.net> *
> >
> > Createdattachment 703  <attachment.cgi?id=703>  [details]  <attachment.cgi?id=703&action=edit>
> > intel pmi tests
> >
> > ------------------------------------------------------------------------
> > You are receiving this mail because:
> >
> >   * You are on the CC list for the bug.
> >   * You are the assignee for the bug.
> >   * You are watching someone on the CC list of the bug.
> >   * You are watching the assignee of the bug.
> >

Comment 21 David Bigagli 2014-03-19 05:38:35 MDT

Yiannis, then there is a typo in the document since it shows:

export SLURM_PMI_KVS_NO_DUP_KEYS=yes

for both cases, that's why I was confused.
It makes sense now. :-) 
The best speed up seems to be at 128 cpus, 3072 cores and pmitime = 1.

I think we can change the default parameters of SLURM_PMI_KVS_NO_DUP_KEYS, as there are no dup keys by default so we can save on the strcmp(), I am not 
sure about the pmitime as that parameter is there prevent a wave of messages hitting srun, so I am not sure about the possible side effects of lowering it.


David

Comment 22 David Bigagli 2014-03-31 07:47:05 MDT

Hi Yannis,
           I changed the default behavior not to check for duplicate keys.
The code is in 14.03 commit 88cafae91de4d5. 

Can we close this ticket now?

David

Comment 23 David Bigagli 2014-04-09 05:33:25 MDT

Fixed.

 David