Ticket 15594

Summary: Cannot get MPI to run via srun (pmix)
Product: Slurm Reporter: Mike Jarsulic <mjarsulic>
Component: PMIxAssignee: Skyler Malinowski <skyler>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: mcmullan
Version: 21.08.6   
Hardware: Linux   
OS: Linux   
Site: University of Chicago Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: RHEL
Machine Name: Randi CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Mike Jarsulic 2022-12-09 09:39:00 MST
Hello,

This ticket or for configuration support but is holding us up from moving forward with our SLURM implementation. We are having issues getting both OpenMPI and MPICH to work via srun. Here is what the environment looks like:

1. SLURM is version 21.08.6 and was installed by our HPC vendor. I asked about how SLURM was configured and they responded that SLURM was built straight from the tarball with "rpmbuild -tb slurm-21.08.6.tar.bz2".

2. I have installed pmix 2.2.5 on the system with the following configuration:  ./configure --enable-static --prefix=/apps/default --disable-debug

3. We are using LMOD and have both openmpi and mpich installed for each compiler. At the moment, I am focusing on the gcc11 compiler for each.

4. OpenMPI was built with the following configuration:  ./configure --prefix=/apps/software/gcc-11.3.0/openmpi/4.1.4 --enable-mpirun-prefix-by-default --enable-static --without-verbs --with-ucx --with-knem=/opt/knem-1.1.4.90mlnx1 CC=gcc --with-cuda CXX=g++ FC=gfortran --with-pmi=/apps/default

5. MPICH was built with the following configuration:  ./configure --prefix=/apps/software/gcc-11.3.0/mpich/4.0.3-srun --with-pmilib=slurm --with-pmi=pmi2 --with-slurm --with-pm=none FFLAGS=-fallow-argument-mismatch FCFLAGS=-fallow-argument-mismatch --with-ucx --with-pmix=/apps/default


OpenMPI:

I am running OpenMPI as follows.

# salloc -N 5 --exclusive
salloc: Granted job allocation 42
# module load gcc/11.3.0
# module load openmpi/4.1.4 
# srun /admin/mpi_test/mpi_hello-gcc11-openmpi 
[warn] event_active: event has no event_base set.
[warn] event_active: event has no event_base set.
[warn] event_active: event has no event_base set.
[warn] event_active: event has no event_base set.
[warn] event_active: event has no event_base set.

After printing the following warnings, the job just hangs. Using an interactive job, I confirmed that the environment variables set but LMOD have been passed properly. I also confirmed that each of the assigned compute nodes have the process running on it (but they are not accumulating any time.


MPICH:

For MPICH, I am running it in the following manner.

# salloc -N 5 --exclusive
salloc: Granted job allocation 43
# srun --mpi=pmi2 /admin/mpi_test/mpi_hello-gcc11-srun 
Abort(672779791): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)....: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(221): 
MPID_Init(359).......: 
MPIR_pmi_init(151)...: PMIX_Init returned -44
Abort(672779791): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)....: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(221): 
MPID_Init(359).......: 
MPIR_pmi_init(151)...: PMIX_Init returned -44
Abort(672779791): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)....: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(221): 
MPID_Init(359).......: 
MPIR_pmi_init(151)...: PMIX_Init returned -44
Abort(672779791): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)....: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(221): 
MPID_Init(359).......: 
MPIR_pmi_init(151)...: PMIX_Init returned -44
Abort(672779791): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)....: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(221): 
MPID_Init(359).......: 
MPIR_pmi_init(151)...: PMIX_Init returned -44
srun: error: cri22cn122: task 1: Exited with exit code 15
srun: error: cri22cn123: task 2: Exited with exit code 15
srun: error: cri22cn124: task 3: Exited with exit code 15
srun: error: cri22cn121: task 0: Exited with exit code 15
srun: error: cri22cn125: task 4: Exited with exit code 15


I have the following questions:

1. Is there anything apparent that I am doing wrong?

2. I am planning on building and installing slurm-22.05.6 on the cluster but was trying to get MPI working with the previous install first. I plan to configure 22.05.6 as:  ./configure --with-ucx --with-pmix=/apps/default --enable-pam

Should I move forward with this before MPI is working? Also, is this the proper way to build SLURM to use pmi?

Any advice on this would be appreciated.

Mike
Comment 1 Skyler Malinowski 2022-12-09 12:49:53 MST
Hi Mike,

When Slurm is configured and compiled, it must be able to find the PMIx libraries to be able to integrate properly. Either from default path or explicitly set via --with-pmix=PATH. Do you know if the RPMs were created on a machine with PMIx present? Our documentation has some information about [MPI](https://slurm.schedmd.com/mpi_guide.html#pmix) integration that discussed the build process.

If it was not built with support, then that would most likely be the cause of the problem. You can also run the following command to see the available plugins to the Slurm client:
srun --mpi=list

If the old (21.08) RPMs were built without PMIx support, then it may be easier to just build the new (22.05) RPMs with PMIx support and then go from there.

-- Skyler
Comment 2 Skyler Malinowski 2022-12-13 10:02:51 MST
Dropping severity, awaiting reply.
Comment 3 Skyler Malinowski 2023-01-03 11:31:16 MST
I will assume this is no longer an issue given the severity and lack of reply and close the ticket. If this is still an issue, please reopen the ticket and I will be more than happy to assist.

Thanks,
Skyler