Created attachment 443 [details] slurm.conf file The issue has been reported on MeteoFrance site with slurm version 2.6.0 and has been reproduced on internal smaller cluster. The performance degradation may happen starting with 40 nodes. Here is the difference in execution times: Results with intel mpi & srun : # grep real slurm-324554* slurm-3245546.out:real 0m35.889s slurm-3245547.out:real 0m39.664s Result with bullx mpi & srun (bullxmpi is based on openmpi): # grep real slurm-324591* slurm-3245917.out:real 0m5.740s slurm-3245918.out:real 0m4.093s Results with intel mpi & mpirun : grep real slurm-324642* slurm-3246427.out:real 0m7.925s slurm-3246428.out:real 0m7.821s This is just a first report to see if you have seen any issues like this before. I didn't have the chance to reproduce it myself because the testing cluster is currently full but I will continue to investigate and come back to you when I have more info. In the meantime I wanted to ask : 1) The recommended way to use srun with Intel MPI is by using the pmi library. Is there any other way to use srun with IntelMPI ? 2) Have you ever tested pmi2 library with IntelMPI? I think that some developments are needed from the IntelMPI side for this support. Did you have the chance to talk with Intel about that? Do you want us to deal with this? Since pmi2 is far more scalable than pmi, I think we need to push them to support it as fast as possible. let me know what you think Thanks, Yiannis PS: Here are the info needed to reproduce and the slurm.conf attached: # cat hw4.c #include<mpi.h> main(int argc, char **argv) { int size, rank; char hostname[1024]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); /*-- entre 0 et size-1 --*/ MPI_Comm_size(MPI_COMM_WORLD, &size); hostname[1023] = '\0'; gethostname(hostname, 1023); //printf("Hello world, rank : %d, size : %d, hostname : %s\n", rank, size, hostname); if (rank==0) printf("Hostname: %s\n", hostname); MPI_Finalize(); } # cat env_bench source /opt/intel/mpi-rt/4.0.3/bin/mpivars.sh source /opt/intel/composer_xe_2013.3.163/bin/iccvars.sh intel64 # cat Makefile # Makefile # source /opt/intel/bin/iccvars.sh intel64 # source /opt/intel/mpi-rt/4.0.3/bin64/mpivars.sh SRC_BB = hw4.c OBJ_BB = hw4.o BIN_BB = hw4 CC = icc MPICC = mpicc #CFLAGS = -I/opt/mpi/bullxmpi//1.2.4.1/include/ -L/opt/mpi/bullxmpi/1.2.4.1/lib/ -align -lmpi CFLAGS = -I/opt/intel/mpi-rt/4.0.3/include64 -L/opt/intel/mpi-rt/4.0.3/lib64/ -lmpi LDLIBS = all: $(BIN_BB) BIN: $(BIN_BB) $(MPICC) $(CFLAGS) $(OBJ_BB) -o $(BIN_BB) $(LDLIBS) OBJ: $(OBJ_BB) $(MPICC) $(CFLAGS) $(SRC_BB) -o $(OBJ_BB) clean: rm -f $(BIN_BB) $(OBJ_BB) # cat Bull_run_srun.sh #!/bin/bash #SBATCH --exclusive #SBATCH --time=00:05:00 #SBATCH --partition B710 set -x source env_bench for i in $(nodeset -e $SLURM_NODELIST); do for j in $(seq 1 24) ; do echo $i done done > machinefile_$SLURM_JOBID printf "NODE %d\n" $(nodeset -c $SLURM_NODELIST) export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so time srun --resv-ports ~/hw4 #time mpirun -machinefile ~/machinefile_$SLURM_JOBID ~/hw4 sed 's@^@Bull_run.sh: @' /tmp/slurmd/job*${SLURM_JOBID}/slurm_script
Yiannis, could you confirm they have things set up as described on the mpi page? http://slurm.schedmd.com/mpi_guide.html#intel_srun Also what version of Intel MPI are they using?
Danny, from what I see things are set up correctly and they are using Intel version 4.1.1.036 but we have reproduced with 4.0.3 . Yiannis
We should have an Intel MPI license on Tuesday to begin testing.
Hi Yiannis, we got the Intel MPI and I started investigating. Indeed we see some slowdown in run times with srun compared to mpirun. Well keep you posted. David
Hi Yiannis, we see the slowdown being caused by the interaction between the MPI application and the libpmi. I timed the test code like this: { gettimeofday(&tv, NULL); printf("start: %d %.15s.%d\n", num, ctime(&tv.tv_sec) + 4, (int)tv.tv_usec); MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); if (0) printf("hello, world (rank %d of %d)\n", rank, size ); MPI_Finalize(); gettimeofday(&tv2, NULL); printf("end: %d %.15s.%d %f\n", num, ctime(&tv2.tv_sec) + 4, (int)tv2.tv_usec, ((tv2.tv_sec - tv.tv_sec) * 1000.0 + (tv2.tv_usec - tv.tv_usec) / 1000.0)); } there are several protocol message between the libmpi.so and srun. What pmi library do you use in bullmpix? Is there any difference between the one we use in Slurm? David
Hi Yiannis, would it be possible for us to get a copy of bullx mpi for linux? Does bulx mpi use pmi? I would assume so since it is based on openmpi which uses pmi in the integration with slurm. Based on my test intel library does not appear to support pmi2. David
Hi David, bullxmpi can function with both pmi and pmi2, but we have made pmi2 as the default method because the scalability is much improved. The bullxmpi tests shown in a previous message of this bug are made with pmi though, in order to show the difference with IntelMPI and pmi... I'm not sure if I can manage to give you bullxmpi code ... but I can make the tests that you want me to on our side here and make comparisons if you have any patch that you want to test. Since IntelMPI does not support pmi2 yet, I think we need to ask them to support it as soon as possible. Perhaps we need to push from both sides to be more efficient! Yiannis
Hi Yiannis, I am interested to see if in your case most of the time is spent in MPI_Init() as well using the IntelMPI and srun and then measure the same using bullxmpi. I narrowed it down by measuring the time inside the hello.c program, I have posted it in bugzilla. I did not mean to ask for the bullxmpi code, but for the library and the header file, however if you can perform the test that is good to. Who do we have to push at Intel? On 10/22/2013 10:50 AM, bugs@schedmd.com wrote: > *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=459#c7> on bug 459 > <http://bugs.schedmd.com/show_bug.cgi?id=459> from Yiannis Georgiou > <mailto:gohnwej@gmail.com> * > > Hi David, > > bullxmpi can function with both pmi and pmi2, but we have made pmi2 as the > default method because the scalability is much improved. The bullxmpi tests > shown in a previous message of this bug are made with pmi though, in order to > show the difference with IntelMPI and pmi... > I'm not sure if I can manage to give you bullxmpi code ... but I can make the > tests that you want me to on our side here and make comparisons if you have any > patch that you want to test. > > Since IntelMPI does not support pmi2 yet, I think we need to ask them to > support it as soon as possible. Perhaps we need to push from both sides to be > more efficient! > > Yiannis > > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You are the assignee for the bug. > * You are watching someone on the CC list of the bug. > * You are watching the assignee of the bug. >
Hi Yiannis, did you have a chance to run the test? David
We try the latest version of the intel MPI library, version 5.0.0 but the performance numbers have not changed. David
*** Ticket 531 has been marked as a duplicate of this ticket. ***
Hello David, sorry for the long delay on this one. Here are the results on the tests that you asked me The tests were made upon 20 nodes with 240 cpus in total. Version Average MPI_Init (sec) Intel srun (libpmi) : 2.82 Intel mpirun : 0.28 OpenMPI srun (libpmi): 2.72 OpenMPI mpirun : 1.64 So to answer to your question, indeed the tests show that the degradation with libpmi happens with both Intel and OpenMPI. BullxMPI is compiled with pmi2 by default so it should not be used in the comparison. By the way a colleague in BULL have made tests measuring the time of the Intel srun with libmpi on a larger cluster using different values for PMI_TIME variable and it seems that lowering this variable improves the time significantly: Nodes Ntasks PMI_TIME=500 PMI_TIME=10 10 40 4.485 3.696477 20 80 5.320 3.396676 100 400 11.966 2.217788 400 1600 111.757 9.523558 900 3600 551.599 30.90825 We are starting using this variable to workaround the delays. The default value of PMI_TIME in slurm/src/api/slurm_pmi.c is 500. Do you think we should drop this to a lower value or should we work an optimization of the logic in _delay_rpc function which makes use of PMI_TIME? Thanks Yiannis
Hi, that is a good finding, let me run tests and investigate. bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=459 > >--- Comment #12 from Yiannis Georgiou <yiannis.georgiou@bull.net> --- > >Hello David, > >sorry for the long delay on this one. Here are the results on the tests >that >you asked me > >The tests were made upon 20 nodes with 240 cpus in total. > >Version Average MPI_Init (sec) >Intel srun (libpmi) : 2.82 >Intel mpirun : 0.28 >OpenMPI srun (libpmi): 2.72 >OpenMPI mpirun : 1.64 > > >So to answer to your question, indeed the tests show that the >degradation with >libpmi happens with both Intel and OpenMPI. BullxMPI is compiled with >pmi2 by >default so it should not be used in the comparison. > >By the way a colleague in BULL have made tests measuring the time of >the Intel >srun with libmpi on a larger cluster using different values for >PMI_TIME >variable and it seems that lowering this variable improves the time >significantly: > >Nodes Ntasks PMI_TIME=500 PMI_TIME=10 >10 40 4.485 3.696477 >20 80 5.320 3.396676 >100 400 11.966 2.217788 >400 1600 111.757 9.523558 >900 3600 551.599 30.90825 > >We are starting using this variable to workaround the delays. >The default value of PMI_TIME in slurm/src/api/slurm_pmi.c is 500. Do >you think >we should drop this to a lower value or should we work an optimization >of the >logic in _delay_rpc function which makes use of PMI_TIME? > >Thanks >Yiannis > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. >You are watching someone on the CC list of the bug. >You are watching the assignee of the bug.
Created attachment 594 [details] attachment-27434-0.html
Hi Yiannis, I ran few tests but I am unable to confirm your numbers, in my environment changing the PMI_TIME values does not speed up the MPI_Init(). This is most likely because of my environment, I run on a single node using multiple-slurmd. However since it does work for you I think you should definitely use it. I see that with the default value (500) the rpc are being delayed no more than ~100 microseconds, using 10 the delays are around .1 microseconds. My suggestion at this point is to tune up this parameter rather then change the code. Is there any way we could have access to some of these large systems you have access to for us to test stuff? Thanks, David
Hi Yiannis, another thing to try is the srun environment variable SLURM_PMI_KVS_NO_DUP_KEYS which tells the PMI that there are no duplicate keys so the code skips the checking for duplicate keys which is O(n^2) (sigh..) export SLURM_PMI_KVS_NO_DUP_KEYS=yes should speed the things up a little bit if there are a lot of keys,value pairs. David
Hello David , here are some new results of testing the different combination of parameters... thanks to Hugo Meiland for performing the tests and thanks to SARA admins for allowing us to do the tests on their cluster, It looks like these are the best settings: export PMI_TIME=1 export SLURM_PMI_KVS_NO_DUP_KEYS=yes you can find the results on the attached pdf Yiannis
Created attachment 703 [details] intel pmi tests
Thank you guys. Couple of questions. What is the difference between the 2 sets of tests? The original performance numbers were gathered with 40 node and looking at these numbers it seems that we have improved the performance by 10seconds. Is that what you see? On 03/18/2014 10:52 AM, bugs@schedmd.com wrote: > *Comment # 18 <http://bugs.schedmd.com/show_bug.cgi?id=459#c18> on bug > 459 <http://bugs.schedmd.com/show_bug.cgi?id=459> from Yiannis Georgiou > <mailto:yiannis.georgiou@bull.net> * > > Createdattachment 703 <attachment.cgi?id=703> [details] <attachment.cgi?id=703&action=edit> > intel pmi tests > > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You are the assignee for the bug. > * You are watching someone on the CC list of the bug. > * You are watching the assignee of the bug. >
David, the difference between the 2 sets is the usage of : export SLURM_PMI_KVS_NO_DUP_KEYS=yes The first set of tests use this variable and the second no. I think that we should not compare these results with the initial results made on 40 nodes because the hardware is different. What counts is that we confirmed that by using the parameters of export PMI_TIME < 500 and export SLURM_PMI_KVS_NO_DUP_KEYS=yes we get better performance than when have the default parameters which are PMI_TIME=500 and SLURM_PMI_KVS_NO_DUP_KEYS=no Using the environmental variables is fine for us but following the above results will you consider changing the default values of the parameters? thanks, Yiannis (In reply to David Bigagli from comment #19) > Thank you guys. Couple of questions. > What is the difference between the 2 sets of tests? > The original performance numbers were gathered with 40 node and looking > at these numbers it seems that we have improved the performance by > 10seconds. Is that what you see? > > On 03/18/2014 10:52 AM, bugs@schedmd.com wrote: > > *Comment # 18 <http://bugs.schedmd.com/show_bug.cgi?id=459#c18> on bug > > 459 <http://bugs.schedmd.com/show_bug.cgi?id=459> from Yiannis Georgiou > > <mailto:yiannis.georgiou@bull.net> * > > > > Createdattachment 703 <attachment.cgi?id=703> [details] <attachment.cgi?id=703&action=edit> > > intel pmi tests > > > > ------------------------------------------------------------------------ > > You are receiving this mail because: > > > > * You are on the CC list for the bug. > > * You are the assignee for the bug. > > * You are watching someone on the CC list of the bug. > > * You are watching the assignee of the bug. > >
Yiannis, then there is a typo in the document since it shows: export SLURM_PMI_KVS_NO_DUP_KEYS=yes for both cases, that's why I was confused. It makes sense now. :-) The best speed up seems to be at 128 cpus, 3072 cores and pmitime = 1. I think we can change the default parameters of SLURM_PMI_KVS_NO_DUP_KEYS, as there are no dup keys by default so we can save on the strcmp(), I am not sure about the pmitime as that parameter is there prevent a wave of messages hitting srun, so I am not sure about the possible side effects of lowering it. David
Hi Yannis, I changed the default behavior not to check for duplicate keys. The code is in 14.03 commit 88cafae91de4d5. Can we close this ticket now? David
Fixed. David