Summary: | OpenMPI internal PMIx with Slurm | ||
---|---|---|---|
Product: | Slurm | Reporter: | Jed storey <jstorey2009> |
Component: | Build System and Packaging | Assignee: | Jacob Jenson <jacob> |
Status: | RESOLVED INVALID | QA Contact: | |
Severity: | 6 - No support contract | ||
Priority: | --- | CC: | joshua.schwartz, sts |
Version: | 17.11.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=8625 | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Jed storey
2018-06-16 06:26:44 MDT
I can confirm slurm 17.11.7 and openmpi 3.1.0 work with the srun --mpi=pmi2 option. To get this to work, I installed all slurm rpms on my headnode and all but the db and ctl rpms on a compute node. Then I reconfigured and reinstalled OpenMPI with the following options: "--prefix=/opt/openmpi-3.1.0 --with-verbs --with-slurm --with-pmi=/usr". Oddly, using the "-with-slurm --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr/" didn't work because it couldn't find the libpmi files. There is current an openmpi bug thread about this, and it's definitely an openmpi problem, not a slurm problem. Anyways, srun --mpi=pmi2 now works, which is good. I still can't find answers to my previous questions, though. I'm having many of the exact same issues with this. I can't get OpenMPI to build properly against an external PMIx because it decides that its internal version is newer. But then I can't get SLURM to build against that same PMIx version because the OpenMPI RPM install doesn't include the pmix_common.h for its internal PMIx. This is a real mess and it shouldn't be so difficult to get SLURM, OpenMPI, and PMIx all installed with compatible versions. The closest I got was with using the --with-pmix options on both SLURM and OpenMPI, but then I get segfaults while running which google-fu seems to indicate are from mismatched versions (which isn't a surprise since OpenMPI discards my --with-pmix option in favor of its internal version): [hostname:177125] *** Process received signal *** [hostname:177125] Signal: Segmentation fault (11) [hostname:177125] Signal code: Invalid permissions (2) [hostname:177125] Failing at address: 0xa3fc28 [hostname:177125] [ 0] /usr/lib64/libpthread.so.0(+0xf6d0)[0x7fd7da0926d0] [hostname:177125] [ 1] [0xa3fc28] [hostname:177125] *** End of error message *** The magic for me to finally get this working: the following options to the OpenMPI rpmbuild: rpmbuild \ --define "configure_options --with-pmix=${PMIX} --with-libevent=/usr --with-hwloc=/usr --with-ompi-pmix-rte --with-slurm" -ba openmpi.spec then the following options to the SLURM rpmbuild: rpmbuild \ --define "_with_pmix --with-pmix=${PMIX}" \ -ba slurm.spec This document was the most helpful: https://pmix.org/code/building-the-pmix-reference-server/ Make sure your hwloc-devel and libevent-devel package versions are acceptable based on: https://pmix.org/code/getting-the-pmix-reference-server/ Also might be worth looking at a couple other issues I had to work around which may affect you depending on what you're doing: OpenMPI: https://github.com/open-mpi/ompi/issues/6900 https://github.com/open-mpi/ompi/issues/6914 SLURM: https://bugs.schedmd.com/show_bug.cgi?id=7584 |