Ticket 531 - srun is slower starting hybrid mpi/openmp jobs than mpirun
Summary: srun is slower starting hybrid mpi/openmp jobs than mpirun
Status: RESOLVED DUPLICATE of ticket 459
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 2.6.x
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2013-11-21 06:37 MST by Rod Schultz
Modified: 2013-11-21 07:13 MST (History)
3 users (show)

See Also:
Site: Meteo France
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Source tar file (769.59 KB, application/x-compressed)
2013-11-21 06:37 MST, Rod Schultz
Details
Text of original report (1.95 KB, text/plain)
2013-11-21 06:37 MST, Rod Schultz
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Rod Schultz 2013-11-21 06:37:10 MST
Created attachment 530 [details]
Source tar file

David,
We have another support request.
Meteo France has observed that it takes 8 times as long to start a hybrid mpi/openmp job with srun than with mpirun.
I have had problems reproducing the problem because it they are using intel mpi and we have not configured that on our test cluster. In addition, they observed it using 24 nodes with 12 cores per node. I suppose I could try and emulate the cluster with –enable-front-end, but since the complaint is performance I’m not sure emulation doesn’t introduce side effects. 
So let me explain the environment. Maybe you will have an idea on how to evaluate the root problem using their program without reproducing their environment. If you want me to try the emulation route I can, but we can find a better way of observing the problem that would probably be more productive.
Attached is the tar file sent with the support request. I’ve also attached a text file containing the bug report we received.
It contains a src folder. It also contains an Excel spreadsheet analyzing results and a lot of sample outputs. I don’t think they are particularly useful other than proving there is a problem.
Here are their instructions for reproducing the problem.
I had to include in the path the location of mpiifort, but maybe that is because we haven’t configured intel-mpi. (PATH=/opt/intel/impi/4.1.1/bin:$PATH)
I also had to change they source program, reducing the size of the array by setting LG = 10000 on line 8, otherwise I got a segmentation fault.
    tar xvfoz mpirun_vs_srun.tgz
    cd mpirun_vs_srun
    ROOT=$PWD
    cd $ROOT/src
    icc -c wtime.C
    ifort -c starter.f90 
    ifort -o starter.exe starter.o wtime.o -lstdc++
    mpiifort -c -openmp main.f90
    mpiifort -openmp -o main.intel.exe main.o wtime.o -lstdc++
At this point, main.intel.exe has been built in src.
The rest of the instructions I believe are for automating running with either mpirun or srun. The scripts have some stuff relevant to their site, particularly about which distributed file system is present that I don’t think are relevant to the problem.
## cd $ROOT
## nohup ./go &
## ./extract.sh mpirun
## ./extract.sh srun
## ls -l log.*run.txt

At this point, I can do
mpirun –np 2 main.intel.exe
On my system, when I do 
srun main.intel.exe
I get this error, which I suspect is because we haven’t configured intel-mpi.
Comment 1 Rod Schultz 2013-11-21 06:37:51 MST
Created attachment 531 [details]
Text of original report
Comment 2 David Bigagli 2013-11-21 07:10:54 MST
Hi Rod,
       I will mark this as duplicate of 459 reported by Yiannis,
we have tracked the slow down to be in MPI_Init() when teh mpi library 
calls the pmi module. There is not much we can do given the current 
implementation.

On 11/21/2013 12:58 PM, bugs@schedmd.com wrote:
> David Bigagli <mailto:david@schedmd.com> changed bug 531
> <http://bugs.schedmd.com/show_bug.cgi?id=531>
> What 	Removed 	Added
> Assignee 	jette@schedmd.com 	david@schedmd.com
>
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>   * You are watching someone on the CC list of the bug.
>   * You are watching the assignee of the bug.
>
Comment 3 David Bigagli 2013-11-21 07:13:46 MST
Thanks.

*** This ticket has been marked as a duplicate of ticket 459 ***