Ticket 6392

Summary:	PMIX 3.1.1 incompatible with slurm.
Product:	Slurm	Reporter:	Greg Wickham <greg.wickham>
Component:	Build System and Packaging	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	alex, artpol84, broderick, karasev.b, rhc
Version:	18.08.4
Hardware:	Linux
OS:	Linux
Site:	KAUST	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Greg Wickham 2019-01-22 22:06:57 MST

Our apps team request Slurm be built with pmix 3.1.1, however it isn't possible.

I'm unsure as to whether the developers know about pmix 3.1.1 incompatibility, hence raising this ticket.

In the file:

	slurm-18.08.4/src/plugins/mpi/pmix/pmixp_client.c

line 147 (first instance):

	PMIX_VAL_SET(&kvp->value, flag, 0);

“PMIX_VAL_SET” is a macro from /usr/include/pmix_common.h (version 2.2.1)

In version 3.1.1 it is missing.

Digging further it is pmix commit 47b8a8022a9d6cea8819c4365afd800b047c508e 
(Sun Aug 12 11:27:28 2018 -0700) that removes the macros.

[The issue is also present in the git head]

 -Greg

Comment 2 Greg Wickham 2019-01-23 04:20:18 MST

From a follow up on slurm-users, the issue was due to removing non-standard PMIX API (that slurm uses).

Refer to https://github.com/pmix/pmix/issues/1082

  -g

Comment 3 Alejandro Sanchez 2019-01-23 04:23:57 MST

Yeah I also detected something was wrong back in Nov. 2018 and opened a bug to coordinate the problem with the open-mpi guys in here:

https://github.com/open-mpi/ompi/issues/6095#issuecomment-440075506

Let me do some tests now see how do we proceed.

Comment 4 Alejandro Sanchez 2019-01-23 06:04:53 MST

These are my tests with:

PMIx 3.1 at
https://github.com/pmix/pmix/commit/ed763d698127497c72c614d65bdc47f1a33617bc

Slurm 18.08 at
https://github.com/SchedMD/slurm/commit/18952af9413636120e708db9327d9c9530bb

OpenMPI v4.0.x at
https://github.com/open-mpi/ompi/commit/3ef8a8b253ac0df8d1f38717c2c0cf6ff6ae

(Slurm configured with multiple slurmds option).

alex@polaris:~/t$ srun --mpi=pmix_v3 -n2 -N2 /home/alex/repos/pmix/build/3.1/test/pmix_client -n 2 --job-fence -c
==3865== OK
==3866== OK
alex@polaris:~/t$

alex@polaris:~/t$ srun --mpi=pmix_v3 -N2 -n4 --ntasks-per-node=2 mpi/xthi
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  polaris
  System call: unlink(2) /dev/shm/vader_segment.polaris.f976504a.0
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  polaris
  System call: unlink(2) /dev/shm/vader_segment.polaris.f976504a.1
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
Hello from rank 0, thread 0, on compute1. (core affinity = 0,4)
Hello from rank 0, thread 1, on compute1. (core affinity = 0,4)
Hello from rank 2, thread 0, on compute2. (core affinity = 0,4)
Hello from rank 2, thread 1, on compute2. (core affinity = 0,4)
Hello from rank 1, thread 1, on compute1. (core affinity = 1,5)
Hello from rank 1, thread 0, on compute1. (core affinity = 1,5)
Hello from rank 3, thread 0, on compute2. (core affinity = 1,5)
Hello from rank 3, thread 1, on compute2. (core affinity = 1,5)
alex@polaris:~/t$

alex@polaris:~/slurm/18.08/slurm/testsuite/expect$ ./test1.88
============================================
TEST: 1.88
spawn /home/alex/repos/ompi/install/v4.0.x/bin/mpicc -o test1.88.prog test1.88.prog.c
spawn /home/alex/slurm/18.08/polaris/bin/sbatch -N1-6 -n6 --output=test1.88.output --error=test1.88.error -t1 test1.88.input
Submitted batch job 20005
Job 20005 is in state PENDING, desire DONE
Job 20005 is in state RUNNING, desire DONE
Job 20005 is in state RUNNING, desire DONE
Job 20005 is in state RUNNING, desire DONE
Job 20005 is in state RUNNING, desire DONE
Job 20005 is in state RUNNING, desire DONE
Job 20005 is in state RUNNING, desire DONE
Job 20005 is in state RUNNING, desire DONE
Job 20005 is in state RUNNING, desire DONE
Job 20005 is in state RUNNING, desire DONE
Job 20005 is DONE (TIMEOUT)
spawn cat test1.88.output
Wed 23 Jan 2019 01:37:40 PM CET
test1_N3_n6_cyclic

FAILURE: No MPI communications occurred
  The version of MPI you are using may be incompatible with the configured switch
  Core files may be present from failed MPI tasks

spawn head test1.88.error
[polaris:03298] *** Process received signal ***
[polaris:03298] Signal: Segmentation fault (11)
[polaris:03298] Signal code: Address not mapped (1)
[polaris:03298] Failing at address: 0x940
[polaris:03298] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12670)[0x7f687b1a1670]
[polaris:03298] [ 1] /home/alex/repos/ompi/install/v4.0.x/lib/openmpi/mca_btl_vader.so(+0x4924)[0x7f687992b924]
[polaris:03298] [ 2] /home/alex/repos/ompi/install/v4.0.x/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f687ae8bf3c]
[polaris:03298] [ 3] /home/alex/repos/ompi/install/v4.0.x/lib/libmpi.so.40(ompi_request_default_wait_all+0xcd)[0x7f687b21a6dd]
[polaris:03298] [ 4] /home/alex/repos/ompi/install/v4.0.x/lib/libmpi.so.40(PMPI_Waitall+0x7f)[0x7f687b25d7cf]
[polaris:03298] [ 5] /home/alex/slurm/18.08/slurm/testsuite/expect/./test1.88.prog(+0x12e9)[0x5586680992e9]
Check contents of test1.88.error
alex@polaris:~/slurm/18.08/slurm/testsuite/expect$

alex@polaris:~/slurm/18.08/slurm/testsuite/expect$ sudo coredumpctl gdb
           PID: 3298 (test1.88.prog)
           UID: 1000 (alex)
           GID: 1000 (alex)
        Signal: 11 (SEGV)
     Timestamp: Wed 2019-01-23 13:37:40 CET (2min 4s ago)
  Command Line: /home/alex/slurm/18.08/slurm/testsuite/expect/./test1.88.prog
    Executable: /home/alex/slurm/18.08/slurm/testsuite/expect/test1.88.prog
...
(gdb) thread apply all bt

Thread 2 (Thread 0x7f687a5e0700 (LWP 3305)):
#0  0x00007f687b0bcb39 in __GI___poll (fds=0x7f6870000b20, nfds=1, timeout=3599928) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f687a8437c8 in ?? () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.6
#2  0x00007f687a83a329 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.6
#3  0x00007f687ae9199e in progress_engine () from /home/alex/repos/ompi/install/v4.0.x/lib/libopen-pal.so.40
#4  0x00007f687b196fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#5  0x00007f687b0c77ef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f687a5f8680 (LWP 3298)):
#0  0x00007f687992b924 in mca_btl_vader_component_progress () from /home/alex/repos/ompi/install/v4.0.x/lib/openmpi/mca_btl_vader.so
#1  0x00007f687ae8bf3c in opal_progress () from /home/alex/repos/ompi/install/v4.0.x/lib/libopen-pal.so.40
#2  0x00007f687b21a6dd in ompi_request_default_wait_all () from /home/alex/repos/ompi/install/v4.0.x/lib/libmpi.so.40
#3  0x00007f687b25d7cf in PMPI_Waitall () from /home/alex/repos/ompi/install/v4.0.x/lib/libmpi.so.40
#4  0x00005586680992e9 in pass_its_neighbor ()
#5  0x0000558668099395 in main ()
(gdb)

Comment 5 Alejandro Sanchez 2019-01-23 06:12:36 MST

Setting OMPI_MCA_btl=self,tc seems make the errors go away. Not sure if that only masks a problem though:

alex@polaris:~/t$ OMPI_MCA_btl=self,tcp srun --mpi=pmix_v3 -N2 -n4 --ntasks-per-node=2 mpi/xthi2
Hello from rank 0, thread 0, on polaris. (core affinity = 0,4)
Hello from rank 0, thread 1, on polaris. (core affinity = 0,4)
Hello from rank 2, thread 0, on polaris. (core affinity = 0,4)
Hello from rank 2, thread 1, on polaris. (core affinity = 0,4)
Hello from rank 1, thread 0, on polaris. (core affinity = 1,5)
Hello from rank 1, thread 1, on polaris. (core affinity = 1,5)
Hello from rank 3, thread 0, on polaris. (core affinity = 1,5)
Hello from rank 3, thread 1, on polaris. (core affinity = 1,5)
alex@polaris:~/t$

Comment 6 Artem Polyakov 2019-01-23 06:42:50 MST

Alejandro, thank you for the analysis.

At first sight, this seems to look like a btl/vader issue. And the fact that disabling it helps also backs it up.
To verify it makes sense to launch the same test with mpirun.

As far as I know, our test suite shows no issues with v3.1 after the build fix that was mentioned earlier. But we need to double check that now. Boris will take it tomorrow.

Also, it is not clear to me what tests are you using: mpi/xthi & test1.88 do not ring a bell to me.

Comment 7 Alejandro Sanchez 2019-01-23 07:52:14 MST

The test1.88 [1] is one of the many expect tests located in the testsuite shipped with the Slurm that we use to exercise various use-cases. The testsuite isn't meant to be executed on production machines. The test1.88 tests the MPI functionality via srun. The test submits a batch script with N1-6 (nodes) and n6 (tasks) and inside it requests different srun steps with a subset of these resources and using different -m tasks distributions methods (block, cyclic, etc.); all of them executing the same test1.88.prog.c program[2], which seems to ping peer neighbors by using MPI_Irecv/MPI_Isend functions. 

Interestingly, I've configured 6 Slurm nodes now and re-executed this test and it completed with SUCCESS; previously when I got the SEGV I only had 2 nodes configured, so it might be an edge-case of how this program was designed or the interaction with the mentioned vader BTL. I'm inclined to think it's a vader BTL issue (I'm not very familiar with these ompi terms) but I think so cause testing with simple mpi_hello, xthi or similar programs I use to test MPI (besides the shipped test1.88) also show problems when calling the unlink() syscall on /dev/shm/vader_segment.polaris.f976504a.* vader segments, unless I set OMPI_MCA_btl=self,tc as mentioned. The xthi is just a simple mpi_hello but also using OpenMP threads, so it's a simple MPI/OpenMP combination program.

Note when configuring Slurm with multiple-slurmd's option, I've different NodeName=compute[1-6] internal Slurm nodes sharing the same NodeHostname=polaris and listening on different ports Port=61201-61206 respectively. I also need to setup slurm.conf TmpFS like this TmpFS=/home/alex/slurm/18.08/polaris/spool/slurmd-tmpfs-%n and then create one subdirectory for compute node to make PMIx work with multiple-slurmds. Looking at the unlink() OMPI error I'm wondering if all the vader_segment.polaris segments should have their own NodeName (compute[1-6]) insetad of NodeHostname (polaris), but that's something that escapes my limited MPI knowledge.

So not sure if we can assume the mentioned combination of PMIx/Slurm/OMPI work as expected (and so we can close this bug) and only fails this specific vader BTL mechanism on single-node multiprocess shared memory data transfer or if this requires further investigation.

Thanks for looking into this.

[1] https://github.com/SchedMD/slurm/blob/slurm-18-08-4-1/testsuite/expect/test1.88

[2] https://github.com/SchedMD/slurm/blob/slurm-18-08-4-1/testsuite/expect/test1.88.prog.c

Comment 8 Alejandro Sanchez 2019-01-23 07:58:31 MST

Ah, forgot to test using mpirun instead of srun. It seems there's no difference:

alex@polaris:~/t$ salloc -N2 -n4 --ntasks-per-node=2
salloc: Granted job allocation 20014
salloc: Waiting for resource configuration
salloc: Nodes compute[1-2] are ready for job
alex@polaris:~/t$ ~/repos/ompi/install/v4.0.x/bin/mpirun mpi/mpi_hello
Hello world from process 0 of 4
Hello world from process 3 of 4
Hello world from process 1 of 4
Hello world from process 2 of 4
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  polaris
  System call: unlink(2) /dev/shm/vader_segment.polaris.92f80001.0
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
[polaris:11125] 1 more process has sent help message help-opal-shmem-mmap.txt / sys call fail
[polaris:11125] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
alex@polaris:~/t$

Comment 9 Alejandro Sanchez 2019-01-23 08:05:53 MST

Also if I remove the --ntasks-per-node=2 from the request, srun works well but not mpirun (both cases without touching OMPI_mca_btl env var:

alex@polaris:~/t$ srun --mpi=pmix_v3 -N2 -n4 mpi/xthi2
Hello from rank 3, thread 1, on polaris. (core affinity = 0,4)
Hello from rank 3, thread 0, on polaris. (core affinity = 0,4)
Hello from rank 2, thread 0, on polaris. (core affinity = 2,6)
Hello from rank 2, thread 1, on polaris. (core affinity = 2,6)
Hello from rank 1, thread 1, on polaris. (core affinity = 1,5)
Hello from rank 1, thread 0, on polaris. (core affinity = 1,5)
Hello from rank 0, thread 0, on polaris. (core affinity = 0,4)
Hello from rank 0, thread 1, on polaris. (core affinity = 0,4)
alex@polaris:~/t$ srun --mpi=pmix_v3 -N2 -n4 mpi/xthi
Hello from rank 1, thread 1, on compute1. (core affinity = 1,5)
Hello from rank 1, thread 0, on compute1. (core affinity = 1,5)
Hello from rank 2, thread 0, on compute1. (core affinity = 2,6)
Hello from rank 2, thread 1, on compute1. (core affinity = 2,6)
Hello from rank 3, thread 0, on compute2. (core affinity = 0,4)
Hello from rank 3, thread 1, on compute2. (core affinity = 0,4)
Hello from rank 0, thread 0, on compute1. (core affinity = 0,4)
Hello from rank 0, thread 1, on compute1. (core affinity = 0,4)
alex@polaris:~/t$ srun --mpi=pmix_v3 -N2 -n4 mpi/mpi_hello
Hello world from process 1 of 4
Hello world from process 2 of 4
Hello world from process 3 of 4
Hello world from process 0 of 4
alex@polaris:~/t$ salloc -N2 -n4
salloc: Granted job allocation 20021
salloc: Waiting for resource configuration
salloc: Nodes compute[1-2] are ready for job
alex@polaris:~/t$ ~/repos/ompi/install/v4.0.x/bin/mpirun mpi/mpi_hello
Hello world from process 3 of 4
Hello world from process 0 of 4
Hello world from process 1 of 4
Hello world from process 2 of 4
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  polaris
  System call: unlink(2) /dev/shm/vader_segment.polaris.89850001.0
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
[polaris:12296] 1 more process has sent help message help-opal-shmem-mmap.txt / sys call fail
[polaris:12296] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
alex@polaris:~/t$

So I'm inclined to close this bug as resolved/infogiven as per the pmix fix:

https://github.com/pmix/pmix/commit/ed763d698127497c72c614d65bdc47f1a33617bc

since things seem to work properly unless you think differently.

Comment 10 Ralph Castain 2019-01-23 08:45:52 MST

Vader is a shared memory messaging plugin, and so it only operates between processes on the same node. The fact that it works when you only run one process/node indicates that the problem is indeed with vader.

Note that OMPI doesn't detect the srun command option --ntasks-per-node. You have to set that on the mpirun cmd line itself using the OMPI syntax.

The root cause here lies right here:

"Note when configuring Slurm with multiple-slurmd's option, I've different NodeName=compute[1-6] internal Slurm nodes sharing the same NodeHostname=polaris"

The shared memory backing file is based on the node name. Your simulated nodes all share that name (i.e., the same name is returned by the system call gethostname), and so the files conflict. Hence the problem.

I believe the vader author has modified the name of the backing file to resolve that problem, but it likely hasn't been released yet. You should check with them.

Comment 11 Alejandro Sanchez 2019-01-23 09:11:41 MST

Jobs set the SLURMD_NODENAME output env var to identify the defined NodeName:

alex@polaris:~/t$ salloc -N1
salloc: Granted job allocation 20029
salloc: Waiting for resource configuration
salloc: Nodes compute1 are ready for job
alex@polaris:~/t$ srun hostname
polaris
alex@polaris:~/t$ srun printenv SLURMD_NODENAME
compute1
alex@polaris:~/t$

I guess Nathan Hjelm, Howard Pritchard or Geoff Paulsen are involved in opal/mca/btl/vader development. I'm gonna try to reach them and see if they know something about shm segment endpoints naming conventions. Thanks for your feedback.

Comment 12 Ralph Castain 2019-01-23 09:14:07 MST

Those are the correct people to contact. Note that I wrote most of the Slurm integration and we definitely ignore the SLURMD_NODENAME envar.

Comment 13 Alejandro Sanchez 2019-01-23 10:22:49 MST

Nathan mentioned the filename is created as follows:

rc = opal_asprintf(&sm_file, "%s" OPAL_PATH_SEP "vader_segment.%s.%x.%d", 
                   mca_btl_vader_component.backing_directory, 
                   opal_process_info.nodename, 
                   OPAL_PROC_MY_NAME.jobid, 
                   MCA_BTL_VADER_LOCAL_RANK);

In comment 4 I could see:

Local host:  polaris
  System call: unlink(2) /dev/shm/vader_segment.polaris.f976504a.0


Local host:  polaris
  System call: unlink(2) /dev/shm/vader_segment.polaris.f976504a.1

So I guess there's no need to change the opal_process_info.nodename to SLURMD_NODENAME since files are different even in the same node, since the differentiation factor is MCA_BTL_VADER_LOCAL_RANK.

I'm gonna close this as resolved/infogiven. Thanks for your feedback. This ompi issue can be closed too:

https://github.com/open-mpi/ompi/issues/6095

Comment 16 Greg Wickham 2019-04-03 13:28:16 MDT

I am be out of the office until Monday, 8th April 2019.

For any issues with Ibex please either:

   - send a request to the Ibex slack channel #general
      (sign up at https://kaust-ibex.slack.com/signup)

   - open a ticket by sending an email to ibex@hpc.kaust.edu.sa

Some useful information:

  To access Ibex, the frontend nodes are:

    ilogin.ibex.kaust.edu.sa (for Intel)
    alogin.ibex.kaust.edu.sa (for AMD)
    glogin.ibex.kaust.edu.sa (for Intel with GPUs)

  For information regarding the unified clusters (tutorial, explanations etc) please refer to the wiki at:

    http://hpc.kaust.edu.sa/ibex

 -Greg

--