| Summary: | MPI_COMM_SPAWN fails with srun | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | issp2020support |
| Component: | Other | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 20.02.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=10092 | ||
| Site: | U of Tokyo | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
issp2020support
2020-10-12 23:42:41 MDT
Could you give me update? Sorry for late response, I was out last week. (In reply to issp2020support from comment #1) > Could you give me update? This is something related to OpenMPI. I am doing some tests. Will let you know about it asap. With Intel MPI it just works: ]$ srun -n4 ./master I'm parent I'm parent I'm parent I'm parent I'm 0 of 1 I'm spawned by MPI_Comm_spawn Master received value: 12345 I'm 0 of 1 I'm spawned by MPI_Comm_spawn Master received value: 12345 I'm 0 of 1 I'm spawned by MPI_Comm_spawn Master received value: 12345 I'm 0 of 1 I'm spawned by MPI_Comm_spawn Master received value: 12345 Hi, I investigated more this issue and also talked with OpenMPI developers. The point is that nowadays Slurm's PMI and PMI-2 API accepts the comm_spawn() calls and manages it, but back when the Slurm PMI support was originally written for OpenMPI it didn't support comm_spawn, so OpenMPI didn't implement that. This changed but there has not been any effort in OpenMPI to implement this. So basically if OpenMPI detects PMI or PMI-2 it won't pass the call to Slurm. It will however pass the request to the Slurm daemon under PMIx, but the Slurm PMIx plugin doesn't support that call and will return "not supported", so it doesn't work in that scenario either. Note also that Intel MPI has implemented support for dynamic processes (mpi_comm_spawn()) but only in their most recent version: Intel MPI Library 2019 release 8. (see release notes item: - PMI2 spawn support). It isn't either in 2021 (Beta) until Beta 07. To summarize: mpi_comm_spawn() is not supported in OpenMPI + Slurm because OpenMPI won't pass that call to Slurm's pmi or pmi-2, and in pmix plugin it is not supported either. Does it make sense? Thank you for the update. I understand that I need to use "mpirun" for mpi_comm_spawn. In this situation, I think I cannot get process accounting information such as cputime and memory usage because I don't use "srun". Is it correct? My customer wants those info. And could you give me permission to see bug #10092? >You are not authorized to access bug #10092. (In reply to issp2020support from comment #6) > Thank you for the update. > I understand that I need to use "mpirun" for mpi_comm_spawn. > > In this situation, I think I cannot get process accounting information such > as cputime and memory usage because I don't use "srun". > Is it correct? > My customer wants those info. > > And could you give me permission to see bug #10092? > >You are not authorized to access bug #10092. You will have accounting for the job but not for the sub-processes, because you won't launch Slurm steps. Correct. Why do your customer need to use mpi_comm_spawn()? Don't he have any other way to spawn processes? I don't know the reason at this moment. I will tell him the limitation WRT OpenMPI with mpi_comm_spawn() and ask him to use the latest Intel MPI. (In reply to issp2020support from comment #8) > I don't know the reason at this moment. > I will tell him the limitation WRT OpenMPI with mpi_comm_spawn() and ask him > to use the latest Intel MPI. Ok. I am closing this issue, we've opened bug 10092 to document all this stuff. Thanks and don't hesitate to ask again if more questions arise. |