Ticket 8785

Summary: Role of PMI_RANK environmental variable and whether it can be unset
Product: Slurm Reporter: BBP Administrator <bbp.administrator>
Component: SchedulingAssignee: Felip Moll <felip.moll>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bbp.administrator
Version: 19.05.4   
Hardware: Linux   
OS: Linux   
Site: EPFL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description BBP Administrator 2020-04-03 14:27:55 MDT
Dear Support team,

On our system we are using HPE's MPT MPI library and have following behaviour:

1. Outside slurm allocation : MPI linked job can be run without MPI launcher

$ ./hello_hpempi_221
Hello world from rank 0 on processor bbpv1.epfl.ch and core # 0


2. Inside slurm allocation : MPI linked job can NOT be run without MPI launcher

$ ./hello_hpempi_221
MPT ERROR: PMI2_Init


3. But, if I remove PMI_RANK variable inside SLURM allocation then everything works fine:

$ unset PMI_RANK
$ ./hello_hpempi_221
Hello world from rank 0 on processor r1i4n10 and core # 0


Typically we always use MPI launcher for parallel / MPI linked applications. But, there are certain uses cases where people would like use MPI linked library with Python modules. In that case, Python or IPython is launched without MPI launcher and result into above mentioned error.

The suggestion/solution from MPT developers is to unset PMI_RANK variable (e.g. for serial launcher).

I would like to understand:

* What is role of PMI_RANK env variable under SLURM?
* If it is unset, what is effect on mpi linked applications launched via mpi launcher?
* If it is unset, what is effect on mpi linked applications launched without mpi launcher i.e. serial execution?

I have seen that if I do single allocation with salloc and then launch multiple srun commands then each srun gets different PMI_RANK. If I unset PMI_RANK, does this affect anything?

This clarification will help us to solve software deployment issues we have on our cluster.

Thank you!
Comment 1 Felip Moll 2020-04-06 10:09:11 MDT
(In reply to BBP Administrator from comment #0)
> Dear Support team,
> 
> On our system we are using HPE's MPT MPI library and have following
> behaviour:
> 
> 1. Outside slurm allocation : MPI linked job can be run without MPI launcher
> 
> $ ./hello_hpempi_221
> Hello world from rank 0 on processor bbpv1.epfl.ch and core # 0

I'd like to see an strace of this execution. I am curious if HPE's MPT MPI is calling PMI2_Init or if it is initializing PMI_RANK variables. Does HPE MPT MPI use Slurm's pmi2 or its own implementation?. Can you also check if your hello_hpempi_221 runs in an environment with PMI_RANK set?

> 2. Inside slurm allocation : MPI linked job can NOT be run without MPI
> launcher
> 
> $ ./hello_hpempi_221
> MPT ERROR: PMI2_Init

Can you show me exactly the process you follow until you call the binary? Is this an salloc or an sbatch? Do you load any environment modules?
Also an 'ldd hello_hpempi_221' would be useful.

Is this the exact only error line you get?

> The suggestion/solution from MPT developers is to unset PMI_RANK variable
> (e.g. for serial launcher).
> 
> I would like to understand:
> 
> * What is role of PMI_RANK env variable under SLURM?

Slurm's PMI2 implementation in contribs/pmi2 uses the PMI_RANK environment variable to identify the rank of the running task. The rank is set in the environment among other env. vars by the PMI2 plugin under src/plugins/pmi2 when a single task is started by slurmstepd.

OpenMPI for example, can be compiled linked to Slurm's PMI2 implementation which will make apps compiled with it to use Slurm's pmi2. Then, when running with srun --mpi=pmi2 the plugin will be used to setup tasks.

I am curious on why unsetting PMI_RANK previous to running the application prevents the issue, because in theory who sets the rank is the pmi2 call when it is executed.

Here is the related path called from slurmstepd when starting a new task:

exec_task() -> _setup_mpi() -> mpi_hook_slurmstepd_task(job_mpi_info, &job->env)
                                                -> env_array_overwrite_fmt(env, "PMI_RANK", "%u", job->gtaskid); #We set others too: PMI_FD, PMI_JOBID, PMI_RANK, PMI_SIZE
                                                -> mpi_hook_slurmstepd_init(env) -> slurmstepd_init() -> mpi_hook_slurmstepd_init()

Once set up, the libpmi linked from Slurm's contribs/pmi2 uses the PMI_RANK first when it inits, and then it sets internally the variable PMI2_rank which can be later queried with standar PMI2 api calls, like PMI2_Job_GetRank.

contribs/pmi2/pmi2_api.c:

PMI2_Init()
            pmiid = getenv("PMI_RANK");
            if (pmiid) {
                init_kv_str(&pairs[npairs], PMIRANK_KEY, pmiid);
                PMI2_rank = strtol(pmiid, NULL, 10);
                ++npairs;
            }


So to summarize, slurmstepd sets PMI_RANK in the specific task environment. Then the task initializes pmi2 and allows the application to call PMI2 API.


All of this happens when your app is linked against Slurm's PMI2, and using srun --mpi=pmi2.
If it is linked against PMI-1 (which is deprecated), PMI_RANK is used also for communication with the slurmstepd: the socket opened in the node matches the task rank id.
A quick look to PMIx doesn't show me any use of PMI_RANK env. variables.

> * If it is unset, what is effect on mpi linked applications launched via mpi
> launcher?

I would need a bit of clarification on what "mpi launcher" means. Can you explain me a bit more?

> * If it is unset, what is effect on mpi linked applications launched without
> mpi launcher i.e. serial execution?
> I have seen that if I do single allocation with salloc and then launch
> multiple srun commands then each srun gets different PMI_RANK. If I unset
> PMI_RANK, does this affect anything?

It shouldn't affect anything because every srun will spawn a task which will set the rank in its own stepd.
The question is where does this initial PMI_RANK come.



At a first glance, I think removing the "initial" PMI_RANK shouldn't harm anything.
Try to respond my questions and I will continue analyzing the issue.

Thanks!
Comment 3 BBP Administrator 2020-04-16 18:15:10 MDT
> I'd like to see an strace of this execution.

Sure!

```
kumbhar@r1i7n21:~$ strace ./hello_221
execve("./hello_221", ["./hello_221"], [/* 124 vars */]) = 0
brk(NULL)                               = 0x602000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedb01000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/tls/x86_64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/tls/x86_64", 0x7fffffffb7f0) = -1 ENOENT (No such file or directory)
open("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/tls/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/tls", 0x7fffffffb7f0) = -1 ENOENT (No such file or directory)
open("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/x86_64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/x86_64", 0x7fffffffb7f0) = -1 ENOENT (No such file or directory)
open("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib", {st_mode=S_IFDIR|S_ISGID|0755, st_size=4096, ...}) = 0
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=232622, ...}) = 0
mmap(NULL, 232622, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fffedac8000
close(3)                                = 0
open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260l\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=141968, ...}) = 0
mmap(NULL, 2208904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffed6c7000
mprotect(0x7fffed6de000, 2093056, PROT_NONE) = 0
mmap(0x7fffed8dd000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x7fffed8dd000
mmap(0x7fffed8df000, 13448, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fffed8df000
close(3)                                = 0
open("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/libcpuset.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib64/libcpuset.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360B\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=59464, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedac7000
mmap(NULL, 2152768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffed4b9000
mprotect(0x7fffed4c5000, 2097152, PROT_NONE) = 0
mmap(0x7fffed6c5000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xc000) = 0x7fffed6c5000
close(3)                                = 0
open("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/libbitmask.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib64/libbitmask.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\20\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=15736, ...}) = 0
mmap(NULL, 2109576, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffed2b5000
mprotect(0x7fffed2b8000, 2093056, PROT_NONE) = 0
mmap(0x7fffed4b7000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7fffed4b7000
close(3)                                = 0
open("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/libmpi.so", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\201\3\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=11222272, ...}) = 0
mmap(NULL, 4107312, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffececa000
mprotect(0x7fffed097000, 2093056, PROT_NONE) = 0
mmap(0x7fffed296000, 28672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1cc000) = 0x7fffed296000
mmap(0x7fffed29d000, 97328, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fffed29d000
close(3)                                = 0
open("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240%\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2151672, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedac6000
mmap(NULL, 3981792, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffecafd000
mprotect(0x7fffeccbf000, 2097152, PROT_NONE) = 0
mmap(0x7fffecebf000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c2000) = 0x7fffecebf000
mmap(0x7fffecec5000, 16864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fffecec5000
close(3)                                = 0
open("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/libdl.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\r\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=19288, ...}) = 0
mmap(NULL, 2109712, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffec8f9000
mprotect(0x7fffec8fb000, 2097152, PROT_NONE) = 0
mmap(0x7fffecafb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7fffecafb000
close(3)                                = 0
open("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/librt.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib64/librt.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340!\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=43776, ...}) = 0
mmap(NULL, 2128920, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffec6f1000
mprotect(0x7fffec6f8000, 2093056, PROT_NONE) = 0
mmap(0x7fffec8f7000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7fffec8f7000
close(3)                                = 0
open("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220*\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=88776, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedac5000
mmap(NULL, 2184192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffec4db000
mprotect(0x7fffec4f0000, 2093056, PROT_NONE) = 0
mmap(0x7fffec6ef000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14000) = 0x7fffec6ef000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedac4000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedac2000
arch_prctl(ARCH_SET_FS, 0x7fffedac2740) = 0
mprotect(0x7fffecebf000, 16384, PROT_READ) = 0
mprotect(0x7fffec6ef000, 4096, PROT_READ) = 0
mprotect(0x7fffed8dd000, 4096, PROT_READ) = 0
mprotect(0x7fffec8f7000, 4096, PROT_READ) = 0
mprotect(0x7fffecafb000, 4096, PROT_READ) = 0
mprotect(0x7fffed4b7000, 4096, PROT_READ) = 0
mprotect(0x7fffed6c5000, 4096, PROT_READ) = 0
mprotect(0x7fffed296000, 8192, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ)     = 0
mprotect(0x7fffedb04000, 4096, PROT_READ) = 0
munmap(0x7fffedac8000, 232622)          = 0
set_tid_address(0x7fffedac2a10)         = 63005
set_robust_list(0x7fffedac2a20, 24)     = 0
rt_sigaction(SIGRTMIN, {0x7fffed6cd790, [], SA_RESTORER|SA_SIGINFO, 0x7fffed6d65d0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x7fffed6cd820, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7fffed6d65d0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=300000*1024, rlim_max=RLIM64_INFINITY}) = 0
futex(0x7fffecafc0b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
brk(NULL)                               = 0x602000
brk(0x623000)                           = 0x623000
lstat("/gpfs", {st_mode=S_IFDIR|0755, st_size=25, ...}) = 0
lstat("/gpfs/bbp.cscs.ch", {st_mode=S_IFDIR|0755, st_size=83, ...}) = 0
lstat("/gpfs/bbp.cscs.ch/ssd", {st_mode=S_IFDIR|0755, st_size=262144, ...}) = 0
lstat("/gpfs/bbp.cscs.ch/ssd/apps", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/gpfs/bbp.cscs.ch/ssd/apps/hpc", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
lstat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
lstat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
lstat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0", {st_mode=S_IFDIR|0775, st_size=16384, ...}) = 0
lstat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u", {st_mode=S_IFDIR|S_ISGID|0755, st_size=4096, ...}) = 0
lstat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib", {st_mode=S_IFDIR|S_ISGID|0755, st_size=4096, ...}) = 0
lstat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/libmpi.so", {st_mode=S_IFLNK|0777, st_size=12, ...}) = 0
readlink("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/libmpi.so", "libmpi_mt.so", 4095) = 12
lstat("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/libmpi_mt.so", {st_mode=S_IFREG|0755, st_size=11222272, ...}) = 0
open("/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/libpmi2.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=232622, ...}) = 0
mmap(NULL, 232622, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fffedac8000
close(3)                                = 0
open("/lib64/libpmi2.so", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\22\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=239232, ...}) = 0
mmap(NULL, 2199040, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffec2c2000
mprotect(0x7fffec2c9000, 2097152, PROT_NONE) = 0
mmap(0x7fffec4c9000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7000) = 0x7fffec4c9000
mmap(0x7fffec4cb000, 65024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fffec4cb000
close(3)                                = 0
mprotect(0x7fffec4c9000, 4096, PROT_READ) = 0
munmap(0x7fffedac8000, 232622)          = 0
write(11, "cmd=init pmi_version=2 pmi_subve"..., 40) = -1 EBADF (Bad file descriptor)
write(2, "MPT ERROR: PMI2_Init\n", 21MPT ERROR: PMI2_Init
)  = 21
exit_group(-1)                          = ?
+++ exited with 255 +++
```

> Does HPE MPT MPI use Slurm's pmi2 or its own implementation?. 

It uses PMI2.

> Can you also check if your hello_hpempi_221 runs in an environment with PMI_RANK set?

I should have mentioned this before! On our system we have following in slurm config:

SallocDefaultCommand="/usr/bin/srun -n1 -N1 --propagate=ALL --pty --preserve-env --mem-per-cpu=0 --mpi=pmi2 --gres=gpu:0 $SHELL -l"

IIRC this has been set to allow users to land on compute node by default. But this also then set all SLURM env variable including PMI_RANK.


> Can you show me exactly the process you follow until you call the binary? Is this an > salloc or an sbatch? Do you load any environment modules?
> Also an 'ldd hello_hpempi_221' would be useful.

I think SallocDefaultCommand clarifies the situation but here is what I do:

kumbhar@bbpv1:~$ salloc -A proj16 -N 1 --ntasks-per-node=36 --exclusive  -p prod  -C "cpu|nvme" --time=8:00:00
salloc: Granted job allocation 507422
Hostname: r1i7n21
User: kumbhar
bash: /usr/local/bin/admin.sh: No such file or directory
kumbhar@r1i7n21:~$ env | grep PMI
PMI_SIZE=1
PMI_RANK=0
PMI_JOBID=507422.0
PMI_FD=11
kumbhar@r1i7n21:~$ ldd hello_221
	linux-vdso.so.1 =>  (0x00007fffedb02000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fffed6c7000)
	libcpuset.so.1 => /lib64/libcpuset.so.1 (0x00007fffed4b9000)
	libbitmask.so.1 => /lib64/libbitmask.so.1 (0x00007fffed2b5000)
	libmpi.so => /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/tools/2020-02-01/linux-rhel7-x86_64/gcc-8.3.0/hpe-mpi-2.21-7pbszh6v5u/lib/libmpi.so (0x00007fffececa000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fffecafd000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fffed8e3000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fffec8f9000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fffec6f1000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fffec4db000)
kumbhar@r1i7n21:~$ ./hello_221
MPT ERROR: PMI2_Init


> Once set up, the libpmi linked from Slurm's contribs/pmi2 uses the PMI_RANK first 
> when it inits, and then it sets internally the variable PMI2_rank which can be later > queried with standar PMI2 api calls, like PMI2_Job_GetRank.

Ok understood. Thank you very much for detailed information.

> I would need a bit of clarification on what "mpi launcher" means. Can you explain me a bit more?

What I meant was, in my case, after salloc I land on compute node where PMI_RANK is already to 0. So if I unset PMI_RANK and then run MPI job with srun then if there will be any issue. (which you have answered already).

Just to give you an additional context about why this question arise:

* We have certain C++ application libraries that are linked to MPI library
* These C++ libraries are also used/linked with Python modules (say foo).
* When user allocate node in interactive session (with PMI_RANK set), they start python and do "import foo"
* As Python is launched without srun, import foo fails here with that "MPT ERROR: PMI2_Init"

And hence the question.
Comment 4 Felip Moll 2020-04-22 06:13:27 MDT
Yeah, you found the issue:

SallocDefaultCommand="/usr/bin/srun -n1 -N1 --propagate=ALL --pty --preserve-env --mem-per-cpu=0 --mpi=pmi2 --gres=gpu:0 $SHELL -l"

This forces pmi2 to be initialized in the $SHELL run by srun in salloc, and sets the PMI_RANK variable.

I recommend setting --mpi=none here as in the examples shown in man slurm.conf. Further srun commands inside the salloc will need to specify --mpi=pmi2 explicitly, or set the variable SLURM_MPI_TYPE before unless MpiDefault=pmi2 is set in slurm.conf, which is what I really recommend.

To summarize I would:

- Change sallocdefaultcommand to not initialize pmi2 on its own
- Set MpiDefault=pmi2 in slurm.conf if desired

You already know the other workarounds.

Does it make sense?
Comment 5 Felip Moll 2020-04-27 05:51:42 MDT
Since we have the diagnostic, the workaround and the solution, and since it seems there are no more questions, I am closing the bug as INFOGIVEN.

Please, mark it as OPEN again if it is still unresolved for you.

Thanks!
Felip
Comment 6 BBP Administrator 2020-05-25 14:34:55 MDT
Thanks Felip. As we understood the issue, we will fix this in our application workflow.