|
Description
Damien
2018-03-14 04:18:00 MDT
Hi Damien, In my installation I see that OpenMPI links against the libslurm.so.31 version, I am investigating why. Can you show me your ./configure line of Slurm and OpenMPI? And also an "ldd ...lib/openmpi/*" and find what is linked in regards to Slurm? It seems that OpenMPI is linking not only with libpmi but also with libslurm. Is there any chance that you removed the old libslurm.so.x but kept the old libpmix.so.y, or in a more general way that you kept some Slurm old files/libraries? And last: mdtest: error: slurm_receive_msg: Invalid Protocol Version 8192 from uid=10189 at 172.16.200.9:60670 This seems to possibly indicate that slurmd, slurmctld and/or client tools are running different versions. Can you check all daemons have been correctly restarted and are running the last versions? Hi Damien, Is this problem still occurring? Can you refer to comment 3, comment 4 and comment 5? Thanks! Hi Felip
We are still seeing this in our old versions of old software that depends on openmpi, For example:
---
>> /usr/local/gromacs/2016.4-openmpi-cuda8.0/bin/gmx_amr.hassan@monash.edu:
>> error while loading shared libraries: libslurm.so.30: cannot open shared
>> object file: No such file or directory
>> /usr/local/gromacs/2016.4-openmpi-cuda8.0/bin/gmx_mpi: error while
>> loading shared libraries: libslurm.so.30: cannot open shared object file:
>> No such file or directory
>> /usr/local/gromacs/2016.4-openmpi-cuda8.0/bin/gmx_mpi: error while
>> loading shared libraries: libslurm.so.30: cannot open shared object file:
>> No such file or directory
>> /usr/local/gromacs/2016.4-openmpi-cuda8.0/bin/gmx_mpi: error while
>> loading shared libraries: libslurm.so.30: cannot open shared object file:
>> No such file or directory
>> /usr/local/gromacs/2016.4-openmpi-cuda8.0/bin/gmx_mpi: error while
>> loading shared libraries: libslurm.so.30: cannot open shared object file:
>> No such file or directory
>> /usr/local/gromacs/2016.4-openmpi-cuda8.0/bin/gmx_mpi: error while
>> loading shared libraries: libslurm.so.30: cannot open shared object file:
>> No such file or directory
>> /usr/local/gromacs/2016.4-openmpi-cuda8.0/bin/gmx_mpi: error while
>> loading shared libraries: libslurm.so.30: cannot open shared object file:
>> No such file or directory
>> /usr/local/gromacs/2016.4-openmpi-cuda8.0/bin/gmx_mpi: error while
>> loading shared libraries: libslurm.so.30: cannot open shared object file:
>> No such file or directory
>> /usr/local/gromacs/2016.4-openmpi-cuda8.0/bin/gmx_mpi: error while
>> loading shared libraries: libslurm.so.30: cannot open shared object file:
>> No such file or directory
---
Is your openmpi still linked with libslurm ?
Cheers
Damien
(In reply to Damien from comment #7) > Hi Felip > Is your openmpi still linked with libslurm ? Yes, it seems so. Can you please refer to comment 3, comment 4 and comment 5? Hi Felip For OpenMPI /openmpi-2.0.2/configure --prefix=/usr/local/openmpi/2.02-gcc4 --with-slurm --with-pmi=/opt/slurm-16.05.4 --enable-static --enable-shared --enable-mpi-fortran --with-mxm=/opt/mellanox/mxm --with-verbs Is this the right practice ? Cheers Damien (In reply to Felip Moll from comment #3) > Hi Damien, > > In my installation I see that OpenMPI links against the libslurm.so.31 > version, I am investigating why. > > Can you show me your ./configure line of Slurm and OpenMPI? > > And also an "ldd ...lib/openmpi/*" and find what is linked in regards to > Slurm? > > It seems that OpenMPI is linking not only with libpmi but also with libslurm. (In reply to Damien from comment #10) > Hi Felip > > For OpenMPI > > /openmpi-2.0.2/configure --prefix=/usr/local/openmpi/2.02-gcc4 --with-slurm > --with-pmi=/opt/slurm-16.05.4 --enable-static --enable-shared > --enable-mpi-fortran --with-mxm=/opt/mellanox/mxm --with-verbs > > Is this the right practice ? > > > Cheers > > Damien > > > > (In reply to Felip Moll from comment #3) > > Hi Damien, > > > > In my installation I see that OpenMPI links against the libslurm.so.31 > > version, I am investigating why. > > > > Can you show me your ./configure line of Slurm and OpenMPI? > > > > And also an "ldd ...lib/openmpi/*" and find what is linked in regards to > > Slurm? > > > > It seems that OpenMPI is linking not only with libpmi but also with libslurm. Hi Damien, It seems I finally have identified the problem. In our .la file, the library description for libtool, we have the dlname pointing to 'libslurm.so.31'. In theory we should point it to dlname='libslurm.so.0' or dlname='libslurm.so', these are two symlinks that points to the real libslurm.so.XY.Z.K library. Can you make a test, just doing a symlink from the missing libslurm.so.30 to the current libslurm.so library? Please tell me the results. Damien,
There's another thing to comment here.
This:
/openmpi-2.0.2/configure --prefix=/usr/local/openmpi/2.02-gcc4 --with-slurm --with-pmi=/opt/slurm-16.05.4 --enable-static --enable-shared --enable-mpi-fortran --with-mxm=/opt/mellanox/mxm --with-verbs
will use paths and link openmpi against the /opt/slurm-16.05.4 libraries directories. This means that it will broke your installation if you upgrade slurm version and try to use openmpi with the new one.
Look at the output of 'ldd /usr/local/openmpi/2.02-gcc4/lib/openmpi/mca_pmix_s1.so', you will see the wrong paths in there, like the one for libpmi.so.0, in my case:
libpmi.so.0 => /home/user/slurm/17.02/llagosti/lib/libpmi.so.0 (0x00007f7755c2c000)
libslurm.so.31 => /home/user/slurm/17.02/llagosti/lib/libslurm.so.31 (0x00007f77557f9000)
What you should do here is to create a symlink from something like /opt/slurm-current to /opt/slurm-XX.YY, and link-compile openmpi against this directory without version information. This way you will be able to support upgrades in the future.
At that point and if I am not wrong, if you recompile the same openmpi installation but now pointing to a different slurm directory, Gromacs and other software shouldn't have problems.
I still have to double-check about comment 12, but I think that what I just explained here will do the trick.
More on this...
In what regards to the output of 'ldd /usr/local/openmpi/2.02-gcc4/lib/openmpi/mca_pmix_s1.so':
libpmi.so.0 => /home/user/slurm/17.02/llagosti/lib/libpmi.so.0 (0x00007f7755c2c000)
libslurm.so.31 => /home/user/slurm/17.02/llagosti/lib/libslurm.so.31 (0x00007f77557f9000)
I don't understand why this library is linked against libslurm.so.31 and not against libpmi.so.0 alone. The application should just link against libpmi.
I will investigate that, it probably has something to do with OpenMPI build scripts.
Btw, I need more information on this comment 0:
> it breaks some MPI software like Gromacs.
> We resolved this by compiling a new openmpi module against this new version of SLURM,
Should I understand that all your software is linked against openmpi, and that compiling a new openmpi module and putting it in place everything should be working again?
We are discussing it internally too.
Hi Felip I have recompiled gromacs 2018 based on the new slurm 17.11.04, have not try the Sym-link method yet. Going to re-do VASP, I will try this with Sym-links instead. Hope it will work too. Thanks. Damien (In reply to Felip Moll from comment #12) > (In reply to Damien from comment #10) > > Hi Felip > > > > For OpenMPI > > > > /openmpi-2.0.2/configure --prefix=/usr/local/openmpi/2.02-gcc4 --with-slurm > > --with-pmi=/opt/slurm-16.05.4 --enable-static --enable-shared > > --enable-mpi-fortran --with-mxm=/opt/mellanox/mxm --with-verbs > > > > Is this the right practice ? > > > > > > Cheers > > > > Damien > > > > > > > > (In reply to Felip Moll from comment #3) > > > Hi Damien, > > > > > > In my installation I see that OpenMPI links against the libslurm.so.31 > > > version, I am investigating why. > > > > > > Can you show me your ./configure line of Slurm and OpenMPI? > > > > > > And also an "ldd ...lib/openmpi/*" and find what is linked in regards to > > > Slurm? > > > > > > It seems that OpenMPI is linking not only with libpmi but also with libslurm. > > Hi Damien, > > It seems I finally have identified the problem. In our .la file, the library > description for libtool, we have the dlname pointing to 'libslurm.so.31'. In > theory we should point it to dlname='libslurm.so.0' or dlname='libslurm.so', > these are two symlinks that points to the real libslurm.so.XY.Z.K library. > > Can you make a test, just doing a symlink from the missing libslurm.so.30 to > the current libslurm.so library? > > Please tell me the results. Damien, Final conclusion/diagnostic. The problem is found in applications linked against Slurm "libpmi.la" file. OpenMPI folks links incorrectly against this file, where it should link against libpmi.so. You have to recompile your actual OpenMPI with the new Slurm version, overriding the actual openmpi installation in order to no cause problems on other softwares linked against openmpi. I am trying to find a solution for the future, maybe on OpenMPI side. (In reply to Felip Moll from comment #28) > Damien, > > Final conclusion/diagnostic. > > The problem is found in applications linked against Slurm "libpmi.la" file. > > OpenMPI folks links incorrectly against this file, where it should link > against libpmi.so. > > You have to recompile your actual OpenMPI with the new Slurm version, > overriding the actual openmpi installation in order to no cause problems on > other softwares linked against openmpi. > > I am trying to find a solution for the future, maybe on OpenMPI side. Hi Damien, Can you comment about the provided workaround? Does it work for you? Thanks Hi Felip We are hoping that either OpenMPI and Slurm will have a compromised or collective solution. I believe other Slurm sites are asking for this too. Cheers Damien (In reply to Damien from comment #31) > Hi Felip > > We are hoping that either OpenMPI and Slurm will have a compromised or > collective solution. > > > I believe other Slurm sites are asking for this too. > > > Cheers > > Damien yea, this is what I am working on, first want to do a consistent proposal, but at the moment no more news (I didn't have time to finish it). In any case, you will need to recompile OpenMPI for sure. Created attachment 6750 [details]
config.log
For reference this is a config.log of open mpi.
I am currently tracking this issue with Open MPI folks in: https://github.com/open-mpi/ompi/issues/5124 They changed the bug to an enhancement request: https://github.com/open-mpi/ompi/issues/5145 On the Slurm side we can see if we can always remove the .la files to avoid other softwares to link against it. Just a reminder, this discussion is currently stalled and we must still find an agreement on what to change in OpenMPI/Slurm to make it work. Damien, We've finally found a workaround for this. We changed the linking of libpmi.so to libslurmfull.so instead of to libslurm.so.xx. This should fix your issue as long as the signatures contained in libslurmfull.so are not changed. libslurmfull.so is not versioned so the library will be found by the software linked to it, in your case, OpenMPI. This is not a good practice but won't harm anything here. The idea would be that when you upgrade Slurm, you must also upgrade the binaries that were linked to a particular version, but here you'll find the exception. Look at this as a RPM system, when you are trying to upgrade libxyz to a newer major version, then other packages would need also to be upgraded since this would be a dependency. In any case, the change is commited in 18.08 in 364ef72fb27f512, it is safe to backport it and apply to a 17.11 if you want (but without support). Now the libpmi.la contains a dependency lib pointing to lib/slurm/libslurmfull.la instead of to libslurm.la, which removes the versioning problem. Finally, we encourage most of our customers to link to pmi2 or pmix instead of to pmi, so this workaround wouldn't be a major issue anymore. I am heading to close this bug now. Regards, Felip M Hi Felip Sorry, I might need to re-open this for a query. How to link this via pmix ? This is our ./configure flags for slurm and openmpi: openmpi ./configure --prefix=/usr/local/openmpi/3.1.4-mlx --with-slurm --with-pmix --enable-static --enable-shared --with-mxm=/opt/mellanox/mxm --enable-mpi-fortran --with-verbs --with-pmi=/opt/slurm-latest /opt/slurm-latest is a sym-link slurm v18 ./configure --prefix=/opt/slurm-18.08.6-2 --with-munge=/opt/munge-0.5.11 --enable-pam --with-pmix=/usr/local/pmix/latest This is not working for openmpi Kindly advise. Thanks Damien Hi Felip Sorry, I might need to re-open this for a query. How to link this via pmix ? This is our ./configure flags for slurm and openmpi: openmpi ./configure --prefix=/usr/local/openmpi/3.1.4-mlx --with-slurm --with-pmix --enable-static --enable-shared --with-mxm=/opt/mellanox/mxm --enable-mpi-fortran --with-verbs --with-pmi=/opt/slurm-latest /opt/slurm-latest is a sym-link slurm v18 ./configure --prefix=/opt/slurm-18.08.6-2 --with-munge=/opt/munge-0.5.11 --enable-pam --with-pmix=/usr/local/pmix/latest This is not working for openmpi Kindly advise. Thanks Damien Hi Damien, Felip's currently on vacation. In bug 7236 I recently added a full example of working configure lines for the three software components. I believe your problem is you are instructing OpenMPI --with-pmix as-is while Slurm is configured to make use of the external --with-pmix=/usr/local/pmix/latest and that looks incoherent. Hi Alejandro Firstly, Thanks for your reply. I am sorry to bother Felip. Is there a recommend/approval pmix version under slurm/contribs? similar to pmi, like: -- /slurm/18.06/build/contribs/pmix -- So we don't have to obtain and test the external source versions. Cheers Damien (In reply to Damien from comment #51) > Hi Alejandro > > Firstly, Thanks for your reply. I am sorry to bother Felip. > > > Is there a recommend/approval pmix version under slurm/contribs? similar to > pmi, like: > -- > /slurm/18.06/build/contribs/pmix > -- > > > So we don't have to obtain and test the external source versions. > > > > > Cheers > > Damien Hi Damien, I am back already, and you don't bother, don't worry :) Alex is correct, you must instruct both openmpi and slurm to configure against the same pmix. There's not any contribs for pmix, you must install an standalone one. Slurm 18.08+ supports PMIx v1.2+, v2.x and v3.x., I recommend using a git tag and not the master one since you can have incompatibilities or other problems as it is considered unstable, i.e. OpenMPI didn't support 3.2.x some weeks ago. You can even install different releases, for example 1.2.5 3.1.2 and 2.1.7 and compile slurm with: --with-pmix=path_to_pmix/1.2:path_to_pmix/2.1:path_to_pmix/3.1 Then they will be available in Slurm, check with 'srun --mpi=list'. After it, you can have multiple OpenMPI, IntelMPI and other implementations compiled against the needed version, i.e. IntelMPI against pmix 1.2.4, and two OpenMPI, one against 2.1 and one against 3.1. Then the user would do the proper choose in Slurm: module load <the selected mpi> srun --mpi=pmix_vX foo.bar Does it resolve your questions? Offtopic: In cases like this bug, it would had been ok to open a new one since it is a separate issue and other colleagues could take the bug and respond earlier. Damien, I am marking this as infogiven. Feel free to open new bugs if questions arise. Felip Hi Felip Thanks for your help and advice. Cheers Dsamien |