Ticket 9039 - Build openmpi (3.1.6) with pmix and ucx support
Summary: Build openmpi (3.1.6) with pmix and ucx support
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: PMIx (show other tickets)
Version: 19.05.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-05-13 08:18 MDT by Damien
Modified: 2020-07-13 05:04 MDT (History)
1 user (show)

See Also:
Site: Monash University
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmd log for m3a012 (test node) (1.70 MB, text/plain)
2020-05-29 10:21 MDT, Damien
Details
The correct slurmd.log (1.71 MB, text/plain)
2020-05-29 10:35 MDT, Damien
Details
Today's slurm log on Test Node (12.27 KB, application/x-gzip)
2020-06-13 00:52 MDT, Damien
Details
Core file generated during this run (4.76 MB, application/x-core)
2020-06-13 01:00 MDT, Damien
Details
Latest slurmd from Test node (m3a012) (25.88 KB, application/x-gzip)
2020-06-19 10:41 MDT, Damien
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Damien 2020-05-13 08:18:45 MDT
Hi Slurm Support

We are trying to build and test a version of openmpi(3.1.6) with pmix and ucx support on a test node, but encounter this issue.


Slurm build:
./configure --prefix=/opt/slurm-19.05.04 --with-munge=/opt/munge --enable-pam --with-pmix=/usr/local/pmix/3.1.4  --with-ucx=/usr/local/ucx/1.8.0


openmpi build:
./configure --prefix=/usr/local/openmpi/3.1.6-ucx --with-slurm --with-pmix=/usr/local/pmix/3.1.4 --enable-static  --enable-shared --enable-mpi-fortran --with-libevent --with-ucx=/usr/local/ucx/1.8.0 --enable-wrapper-runpath --with-hwloc 



Run correctly with srun and mpirun:

$ srun --nodelist=m3a012 --ntasks=4 mpirun -np 4 --mca btl self --mca pml ucx -x UCX_TLS=mm --bind-to core --map-by core --display-map ./a.out  
srun: job 14076970 queued and waiting for resources
srun: job 14076970 has been allocated resources
 Data for JOB [52197,1] offset 0 Total slots allocated 4
 Data for JOB [52194,1] offset 0 Total slots allocated 4
 Data for JOB [52195,1] offset 0 Total slots allocated 4

 ========================   JOB MAP   ========================

 Data for node: m3a012	Num slots: 4	Max slots: 0	Num procs: 4
 	Process OMPI jobid: [52195,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./.]
 	Process OMPI jobid: [52195,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./.]
 	Process OMPI jobid: [52195,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/.]
 	Process OMPI jobid: [52195,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B]

 =============================================================
 Data for JOB [52192,1] offset 0 Total slots allocated 4

 ========================   JOB MAP   ========================

 Data for node: m3a012	Num slots: 4	Max slots: 0	Num procs: 4
 	Process OMPI jobid: [52192,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./.]
 	Process OMPI jobid: [52192,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./.]
 	Process OMPI jobid: [52192,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/.]
 	Process OMPI jobid: [52192,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B]

 =============================================================

 ========================   JOB MAP   ========================

 Data for node: m3a012	Num slots: 4	Max slots: 0	Num procs: 4
 	Process OMPI jobid: [52197,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./.]
 	Process OMPI jobid: [52197,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./.]
 	Process OMPI jobid: [52197,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/.]
 	Process OMPI jobid: [52197,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B]

 =============================================================

 ========================   JOB MAP   ========================

 Data for node: m3a012	Num slots: 4	Max slots: 0	Num procs: 4
 	Process OMPI jobid: [52194,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./.]
 	Process OMPI jobid: [52194,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./.]
 	Process OMPI jobid: [52194,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/.]
 	Process OMPI jobid: [52194,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B]

 =============================================================
Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)


Which is good, but when I try srun 'standalone', I encounter this:


[damienl@m3a012 tmp]$ srun --time=00:00:10 --nodelist=m3a012 --ntasks=4  ./c.out 
[m3a012:23420] OPAL ERROR: Not initialized in file ext2x_client.c at line 112
[m3a012:23421] OPAL ERROR: Not initialized in file ext2x_client.c at line 112
[m3a012:23422] OPAL ERROR: Not initialized in file ext2x_client.c at line 112
[m3a012:23423] OPAL ERROR: Not initialized in file ext2x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[m3a012:23420] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------


How should I investigate this? Or which versions of libraries are safe to use in this matter ?


Kindly advise.


Many Thanks

Damien
Comment 1 Damien 2020-05-13 08:25:52 MDT
More Details:

$ module load openmpi/3.1.6-ucx
$ srun -V
slurm 19.05.4
$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmix
srun: pmi2
srun: openmpi



$ ucx_info -v
# UCT version=1.7.0 revision b02bab9
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni



$ ompi_info 
                 Package: Open MPI root@m3a012 Distribution
                Open MPI: 3.1.6
  Open MPI repo revision: v3.1.6
   Open MPI release date: Mar 18, 2020
                Open RTE: 3.1.6
  Open RTE repo revision: v3.1.6
   Open RTE release date: Mar 18, 2020
                    OPAL: 3.1.6
      OPAL repo revision: v3.1.6
       OPAL release date: Mar 18, 2020
                 MPI API: 3.1.0
            Ident string: 3.1.6
                  Prefix: /usr/local/openmpi/3.1.6-ucx
 Configured architecture: x86_64-unknown-linux-gnu
          Configure host: m3a012
           Configured by: root
           Configured on: Wed May 13 23:13:32 AEST 2020
          Configure host: m3a012
  Configure command line: '--prefix=/usr/local/openmpi/3.1.6-ucx'
                          '--with-slurm' '--with-pmix=/usr/local/pmix/latest'
                          '--enable-static' '--enable-shared'
                          '--enable-mpi-fortran' '--with-libevent'
                          '--with-ucx=/usr/local/ucx/1.8.0'
                          '--enable-wrapper-runpath' '--with-hwloc'
                Built by: root
                Built on: Wed May 13 23:24:43 AEST 2020
              Built host: m3a012
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (limited: overloading)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: no
 Fort mpi_f08 compliance: The mpi_f08 module was not built
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /bin/gcc
  C compiler family name: GNU
      C compiler version: 4.8.5
            C++ compiler: g++
   C++ compiler absolute: /bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /bin/gfortran
         Fort ignore TKR: no
   Fort 08 assumed shape: no
      Fort optional args: no
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: no
      Fort BIND(C) (all): no
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): no
       Fort TYPE,BIND(C): no
 Fort T,BIND(C,name="a"): no
            Fort PRIVATE: no
          Fort PROTECTED: no
           Fort ABSTRACT: no
       Fort ASYNCHRONOUS: no
          Fort PROCEDURE: no
         Fort USE...ONLY: no
           Fort C_FUNLOC: no
 Fort f08 using wrappers: no
         Fort MPI_SIZEOF: no
             C profiling: yes
           C++ profiling: no
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: no
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
          MPI extensions: affinity, cuda
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v3.1.6)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v3.1.6)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA btl: self (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA btl: openib (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA btl: tcp (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA btl: vader (MCA v2.1.0, API v3.0.0, Component v3.1.6)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v3.1.6)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v3.1.6)
               MCA event: external (MCA v2.1.0, API v2.0.0, Component v3.1.6)
               MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v3.1.6)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v3.1.6)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v3.1.6)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v3.1.6)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v3.1.6)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v3.1.6)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v3.1.6)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                MCA pmix: ext2x (MCA v2.1.0, API v2.0.0, Component v3.1.6)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v3.1.6)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v3.1.6)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v3.1.6)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v3.1.6)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v3.1.6)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v3.1.6)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA dfs: app (MCA v2.1.0, API v1.0.0, Component v3.1.6)
                 MCA dfs: orted (MCA v2.1.0, API v1.0.0, Component v3.1.6)
                 MCA dfs: test (MCA v2.1.0, API v1.0.0, Component v3.1.6)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component
                          v3.1.6)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component
                          v3.1.6)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component
                          v3.1.6)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component
                          v3.1.6)
              MCA errmgr: dvm (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
                          v3.1.6)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v3.1.6)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v3.1.6)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v3.1.6)
            MCA notifier: syslog (MCA v2.1.0, API v1.0.0, Component v3.1.6)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA oob: ud (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
                          v3.1.6)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v3.1.6)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v3.1.6)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v3.1.6)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v3.1.6)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v3.1.6)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component
                          v3.1.6)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component
                          v3.1.6)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component
                          v3.1.6)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v3.1.6)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v3.1.6)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v3.1.6)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v3.1.6)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v3.1.6)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v3.1.6)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v3.1.6)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v3.1.6)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v3.1.6)
               MCA state: dvm (MCA v2.1.0, API v1.0.0, Component v3.1.6)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v3.1.6)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v3.1.6)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v3.1.6)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v3.1.6)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                MCA coll: spacc (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v3.1.6)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v3.1.6)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v3.1.6)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v3.1.6)
               MCA fcoll: static (MCA v2.1.0, API v2.0.0, Component v3.1.6)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component
                          v3.1.6)
                  MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                  MCA io: romio314 (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v3.1.6)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v3.1.6)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v3.1.6)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v3.1.6)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v3.1.6)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v3.1.6)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v3.1.6)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v3.1.6)
Comment 2 Damien 2020-05-13 08:29:13 MDT
mpicc  /usr/local/hpcx/2.5.0-redhat7.7/ompi/tests/examples/hello_c.c  -o  c.out


$ cat  /usr/local/hpcx/2.5.0-redhat7.7/ompi/tests/examples/hello_c.c 


#include <stdio.h>
#include "mpi.h"

int main(int argc, char* argv[])
{
    int rank, size, len;
    char version[MPI_MAX_LIBRARY_VERSION_STRING];

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Get_library_version(version, &len);
    printf("Hello, world, I am %d of %d, (%s, %d)\n",
           rank, size, version, len);
    MPI_Finalize();

    return 0;
}
Comment 3 Damien 2020-05-13 18:46:42 MDT
If you can point me to the proper directions, or other sites' practice for this matter


I will be grateful...
Comment 5 Broderick Gardner 2020-05-14 11:08:18 MDT
I have been looking into this by testing different openmpi configurations related to yours, and in my tests, they all work so far. I don't currently have UCX installed, but that should not be related.

Do you have a configured MpiDefault in slurm.conf?
$ scontrol show config | grep Mpi

Since you intend to use pmix, that should be the value of MpiDefault, or specify it using --mpi=pmix.

If that doesn't work, we need to try pmi2. Make sure Slurm's pmi2 is installed from the contribs directory. Recompile openmpi with --with-pmi=$SLURM. Recompile test binary and test:
$ srun --mpi=pmi2 -N4 -n4 hello


My tests:
slurm configure: --prefix=$SLURM --with-pmix=$PMIX

openmpi configure: --prefix=$OMPI --with-pmix=$PMIX --with-hwloc=/usr --with-libevent=/usr  --enable-static --enable-shared --enable-mpi-fortran --enable-wrapper-runpath --with-slurm [--with-pmi=$SLURM]
(I did the same tests with and without the option in brackets, but you should try with that -- make sure Slurm's libpmi2.so is built and installed from contribs)

(--mpi=pmix overrides any slurm.conf MpiDefault)
$ srun --mpi=pmix_v3 -N4 hello_mon
Hello, world, I am 3 of 4, (Open MPI v3.1.6rc3, package: Open MPI broderick@caesar Distribution, ident: 3.1.6rc3, repo rev: v3.1.6, Unreleased developer copy, 130)
Hello, world, I am 1 of 4, (Open MPI v3.1.6rc3, package: Open MPI broderick@caesar Distribution, ident: 3.1.6rc3, repo rev: v3.1.6, Unreleased developer copy, 130)
Hello, world, I am 0 of 4, (Open MPI v3.1.6rc3, package: Open MPI broderick@caesar Distribution, ident: 3.1.6rc3, repo rev: v3.1.6, Unreleased developer copy, 130)
Hello, world, I am 2 of 4, (Open MPI v3.1.6rc3, package: Open MPI broderick@caesar Distribution, ident: 3.1.6rc3, repo rev: v3.1.6, Unreleased developer copy, 130)
Comment 6 Damien 2020-05-14 16:53:13 MDT
Hi Broderick,

Thanks for your reply.

Is there a reason why ucx is not include in your test ? We are testing ucx because it is mentioned to have better performance, I am not too sure about it's adoption to the wider openmpi community.


Our defaults is pmi2 

$ scontrol show config |grep pmi
MpiDefault              = pmi2


I am running your mentioned tests again.



Cheers
Damien
Comment 7 Damien 2020-05-17 09:55:11 MDT
Hi Guys,

I managed to get this task to work with multiple tries, errors and testing, using this combination seems to work for my test node. Running more MPI testings right now.



My working combinations:


Slurm build:
./configure --prefix=/opt/slurm-19.05.04 --with-munge=/opt/munge --enable-pam --with-pmix=/usr/local/pmix/3.1.4  --with-ucx=/usr/local/ucx/1.8.0


# /opt/slurm-latest is sym-linked to /opt/slurm-19.05.04
 

openmpi build:
./configure --prefix=/usr/local/openmpi/3.1.6-ucx --with-slurm --with-pmix=/usr/local/pmix/3.1.4 --enable-static  --enable-shared --enable-mpi-fortran --with-libevent --with-ucx=/usr/local/ucx/1.8.0 --enable-wrapper-runpath --with-hwloc --without-verbs --with-pmi=/opt/slurm-latest



I am really wonder why? In my openmpi, I have already specified "--with-pmix=/usr/local/pmix/3.1.4" , why do I still need "--with-pmi=/opt/slurm-latest" again, Or am I mistaken about this. The only different in this test setup, is the includes of "--with-ucx" for both Slurm and openmpi, as we want to test out this better performing library.


Our hardware setup is not conventional, We are not running a traditional IB network, but a ROCE v1 network (RDMA over Converged Ethernet), Does this makes a different ?

I will setup another 3 VMs in a separate cluster to test this again with the same slurm/openmpi/ucx configuration, if I get the same results.



Cheers

Damien
Comment 8 Broderick Gardner 2020-05-18 09:18:31 MDT
(In reply to Damien from comment #7)
> I am really wonder why? In my openmpi, I have already specified
> "--with-pmix=/usr/local/pmix/3.1.4" , why do I still need
> "--with-pmi=/opt/slurm-latest" again, Or am I mistaken about this. The only
> different in this test setup, is the includes of "--with-ucx" for both Slurm
> and openmpi, as we want to test out this better performing library.
The only difference from what? The configure lines for slurm and openmpi in the first post on this ticket also have "--with-ucx". Do you mean you already had openmpi with pmix working, and you started having problems when adding ucx support? Just seeking clarification.

To use full pmix, pmix must be selected from the slurm side (--mpi=pmix or MpiDefault=pmix). When using pmi2, it is expected that you link openmpi to slurm's libpmi2.so (--with-pmix=$SLURM) if you want to use srun and not mpirun.

Now, as in my tests and due to pmix's compatibility with pmi2, "--mpi=pmi2" should work with openmpi's pmix anyway (meaning whatever pmix openmpi has linked to). I have not been able to replicate your issue so far, but it could be some incompatibility between pmix and pmi2 in your system (maybe due to the use of *UCX, the interconnect, or some other hardware). That's what I would like to isolate, as I don't know why I can't reproduce your original issue right now.

So what works now? Are you still using --mpi=pmi2? Detail is appreciated, including what srun commands you are using to test.

Thanks

* I am looking into UCX; the only reason I haven't tested with it yet is that I never have before and thus didn't have it installed. We don't expect it to be the cause of your problems, but if everything started from adding UCX, then we may revise that opinion.
Comment 9 Damien 2020-05-18 10:15:51 MDT
Hi Broderick,

Many Thanks for your efforts and findings,


Yes, you are right, my test 'srun --mpi=pmix' is broken, Details:
srun --mpi=pmix  -n 4 --time=00:00:10 --nodelist=m3a012  ./abc.out 
.....
srun: Force Terminated job 14121359
srun: Job step 14121359.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 12 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
[damienl@m3a012 tmp]$ 




The working ones are:

[damienl@m3a012 tmp]$ srun --mpi=pmi2 -n 4 --nodelist=m3a012  ./abc.out 
srun: job 14092496 queued and waiting for resources
srun: job 14092496 has been allocated resources
Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) 



[damienl@m3a012 tmp]$ srun --nodelist=m3a012 --ntasks=4 mpirun -np 4 --mca btl self --mca pml ucx -x UCX_TLS=mm --bind-to core --map-by core --display-map ./abc.out 
 Data for JOB [62271,1] offset 0 Total slots allocated 4
 Data for JOB [62269,1] offset 0 Total slots allocated 4
 ========================  JOB MAP  ========================
 Data for node: m3a012	Num slots: 4	Max slots: 0	Num procs: 4
 	Process OMPI jobid: [62271,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./.]
 	Process OMPI jobid: [62271,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./.]
 	Process OMPI jobid: [62271,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/.]
 	Process OMPI jobid: [62271,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B]
.....
.....
.....
Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)

[damienl@m3a012 tmp]$ 
[damienl@m3a012 tmp]$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmix
srun: pmi2
srun: openmpi



As mentioned before we are running on a ROCE v1 network (RDMA over Converged Ethernet).

So from the above example, my 'pmix' implementation is not working, I need to investigate why ? and how to troubleshoot this ?



Cheers

Damien
Comment 10 Damien 2020-05-18 17:55:41 MDT
Hi Broderick,

As mentioned, this could be some incompatibility between pmix and pmi2 from our side. 


We are using:
--
slurm-19.05.04
pmix/3.1.4
--

Is there a Slurm preferred version ? Like under it's ../src/contribs directory.


I am not sure how to troubleshoot this, other then running more MPI tests with different sets of parameters 



Cheers

Damien
Comment 11 Broderick Gardner 2020-05-27 15:31:53 MDT
To get more information about pmix failing, make sure SlurmdDebug=debug, and collect for me the slurmctld.log and the slurmd.log from a node where the job failed to launch.

(In reply to Damien from comment #10)
> Is there a Slurm preferred version ? Like under it's ../src/contribs
> directory.
There is not currently a preferred version. The latest pmix version 3.*.* should be working.

Thanks
Comment 12 Damien 2020-05-29 10:21:26 MDT
Created attachment 14445 [details]
slurmd log  for m3a012 (test node)
Comment 13 Damien 2020-05-29 10:23:29 MDT
Hi Broderick

I hope that this is helpful in this investigation:

---

[root@m3a012 etc]# pwd
/opt/slurm-latest/etc
[root@m3a012 etc]# grep SlurmdDebug slurm.conf 
SlurmdDebug=debug



#srun mpi test 

[damienl@m3a012 ~]$ 
[damienl@m3a012 ~]$ srun --reservation=AWX --mpi=pmix  -n 4 --time=00:00:10 --nodelist=m3a012  ./abc.out 
....
....
Waiting 


[ec2-user@m3-login2 ~]$ squeue -u damienl
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          14361387      comp  abc.out  damienl  R       1:37      1 m3a012




[damienl@m3a012 ~]$ 
[damienl@m3a012 ~]$ srun --reservation=AWX --mpi=pmix  -n 4 --time=00:00:10 --nodelist=m3a012  ./abc.out 

srun: Force Terminated job 14361387
srun: Job step 14361387.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 12 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete


Attaching slurmd.log
Comment 14 Damien 2020-05-29 10:27:07 MDT
Comment on attachment 14445 [details]
slurmd log  for m3a012 (test node)

Attached the wrong log
Comment 15 Damien 2020-05-29 10:31:46 MDT
Hi Broderick

Kindly ignore my previous message, This is the proper ones


---

[root@m3a012 etc]# pwd
/opt/slurm-latest/etc
[root@m3a012 etc]# grep SlurmdDebug slurm.conf 
SlurmdDebug=debug



#srun mpi test 

[damienl@m3a012 ~]$ 
[damienl@m3a012 ~]$ srun --reservation=AWX --mpi=pmix  -n 4 --time=00:00:10 --nodelist=m3a012  ./a.out 
....
....
Waiting 


[ec2-user@m3-login2 ~]$ squeue -u damienl
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          14361491      comp    a.out  damienl  R       1:18      1 m3a012


[damienl@m3a012 ~]$ srun --reservation=AWX --mpi=pmix  -n 4 --time=00:00:10 --nodelist=m3a012  ./a.out 

srun: Force Terminated job 14361491
srun: Job step 14361491.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 12 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
[damienl@m3a012 ~]$ 


Attaching the correct slurmd.log
Comment 16 Damien 2020-05-29 10:35:11 MDT
Created attachment 14447 [details]
The correct slurmd.log
Comment 17 Damien 2020-05-29 10:37:35 MDT
This is the working example w/o pmix flag:

[damienl@m3a012 ~]$ 
[damienl@m3a012 ~]$ srun --reservation=AWX   -n 4 --time=00:00:10 --nodelist=m3a012  ./a.out 
Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)




Kindly advise


Thanks

Damien
Comment 18 Damien 2020-05-29 10:54:30 MDT
slurmctld.log:
---

[2020-05-30T02:28:04.425]   gres_used:gpu:0
[2020-05-30T02:28:04.425] _fill_in_gres_fields JobId=14361491 gres_req:NONE gres_alloc:
[2020-05-30T02:28:04.425] select_nodes: JobId=14361491 gres:NONE gres_alloc:
[2020-05-30T02:28:04.427] sched: _slurm_rpc_allocate_resources JobId=14361491 NodeList=m3a012 usec=9628
[2020-05-30T02:28:05.131] job_submit/lua: /opt/slurm-19.05.4/etc/job_submit.lua: non-numeric return code
[2020-05-30T02:28:05.144] _slurm_rpc_submit_batch_job: JobId=14361492 InitPrio=72945 usec=13659

.....
.....
[2020-05-30T02:30:31.046] Time limit exhausted for JobId=14361491
[2020-05-30T02:30:43.768] _slurm_rpc_complete_job_allocation: JobId=14361491 error Job/step already completing or completed
Comment 19 Damien 2020-05-30 07:54:31 MDT
Hi 

I believe 'ucx' is working for us.

Details:

[damienl@m3a012 ~]$ 
[damienl@m3a012 ~]$ srun --nodelist=m3a012 --ntasks=2 mpirun -np 2 --mca btl self --mca pml ucx -x UCX_LOG_LEVEL=debug  --map-by core --display-map ./a.out 
srun: job 14372968 queued and waiting for resources
srun: job 14372968 has been allocated resources
 Data for JOB [58554,1] offset 0 Total slots allocated 2

 ========================   JOB MAP   ========================

 Data for node: m3a012	Num slots: 2	Max slots: 0	Num procs: 2
 	Process OMPI jobid: [58554,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B][.]
 	Process OMPI jobid: [58554,1] App: 0 Process rank: 1 Bound: socket 1[core 1[hwt 0]]:[.][B]

 =============================================================
 Data for JOB [58555,1] offset 0 Total slots allocated 2

 ========================   JOB MAP   ========================

 Data for node: m3a012	Num slots: 2	Max slots: 0	Num procs: 2
 	Process OMPI jobid: [58555,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B][.]
 	Process OMPI jobid: [58555,1] App: 0 Process rank: 1 Bound: socket 1[core 1[hwt 0]]:[.][B]

 =============================================================
[1590846758.194227] [m3a012:29247:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(self/memory knem/memory); 
[1590846758.194506] [m3a012:29247:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[2]: tag(posix/memory knem/memory); 
[1590846758.195442] [m3a012:29244:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(self/memory knem/memory); 
[1590846758.195821] [m3a012:29244:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[2]: tag(posix/memory knem/memory); 
[1590846758.197027] [m3a012:29246:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(self/memory knem/memory); 
[1590846758.197245] [m3a012:29246:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[2]: tag(posix/memory knem/memory); 
[1590846758.197857] [m3a012:29245:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(self/memory knem/memory); 
[1590846758.198083] [m3a012:29245:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[2]: tag(posix/memory knem/memory); 
Hello, world, I am 1 of 2, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 0 of 2, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 1 of 2, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
Hello, world, I am 0 of 2, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106)
[damienl@m3a012 ~]$ 

---


So the problem is still with pmix .



Cheers

Damien
Comment 20 Felip Moll 2020-06-10 10:49:14 MDT
Hi Damien,

I am taking the bug from now on.

I don't like the following:

[2020-05-30T02:14:06.329] [14361387.0] debug:  (null) [0] mpi_pmix.c:153 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: start
[2020-05-30T02:14:06.330] [14361387.0] debug:  mpi/pmix: setup sockets
.....
[2020-05-30T02:15:25.789] debug:  _step_connect: connect() failed dir /opt/slurm/var/spool node m3a012 step 14361387.0 Connection refused
.....
[2020-05-30T02:16:17.842] debug:  _step_connect: connect() failed dir /opt/slurm/var/spool node m3a012 step 14361387.0 Connection refused


It seems pmix is not able to setup sockets properly. This may or may not have something to do with ROCE v1.

Just to check it again, can you send me your config.log from openmpi and slurm?
In comment 1 I saw the output of ompi_info and --with-pmix was pointing to pmix/latest instead of pmix/3.1.4
I just want to be sure you have openmpi and slurm properly compiled/configured while I am looking at ways to debug the "setup sockets" stuff.

Thanks
Comment 21 Damien 2020-06-10 18:49:23 MDT
Hi Felip

Thanks to hear from you.

Our ucx and pmix versions are sym-linked to their latest installed versions, For example:
----
[damienl@m5-login5 ~]$ cd /usr/local/pmix/
[damienl@m5-login5 pmix]$ 
[damienl@m5-login5 pmix]$ ll
total 2
drwxr-xr-x 7 damienl  systems          7 Jun 19  2019 3.1.2
lrwxrwxrwx 1 root     root             5 Jun 19  2019 latest -> 3.1.2
drwxrwxr-x 6 damienl  systems          6 Dec  6  2018 v2.2 


damienl@m5-login5 ucx]$ 
[damienl@m5-login5 ucx]$ pwd
/usr/local/ucx
[damienl@m5-login5 ucx]$ ll
total 3
drwxr-xr-x 6 damienl systems 6 Nov  7  2019 1.6.1
drwxr-xr-x 6 damienl  systems 7 May  7 11:34 1.8.0
lrwxrwxrwx 1 root     root    5 May  7 11:34 latest -> 1.8.0
----

So it is more management, if in the future, we need newer updates.

Will this be a problem if I compile them via their sym-link 'latest' ?


Happy to provide you with more logs for this investigation. 



Thanks

Damien
Comment 22 Felip Moll 2020-06-11 05:52:12 MDT
> Will this be a problem if I compile them via their sym-link 'latest' ?

That's not a problem, I just wanted to be sure that 'latest' was a correct symlink.

If you can send me your config.log from openmpi and slurm it would be great. 

Also, if you log into a node and run a job and then when you see this:

[2020-05-30T02:14:06.330] [14361387.0] debug:  mpi/pmix: setup sockets

if you could attach with gdb to slurmstepd and do a 'thread apply all bt full' to obtain a dump it would be also good. I want to see where slurmstepd is stuck because in your logs you can see like [14361387.0] stops writing more logs after the mpi/pmix setup sockets, so I bet there are stuck step 0 processes in the node.

I will wait for your feedback.

Thanks!
Comment 23 Damien 2020-06-11 06:53:22 MDT
Hi Felip,

I am not very familiar with 'gdb' usage.


Can you advise me on how to get core dump/log from this task ?
--
gdb to slurmstepd and do a 'thread apply all bt full'
--

We have prepare a test node for this test.



Thanks

Damien
Comment 24 Felip Moll 2020-06-11 10:16:55 MDT
(In reply to Damien from comment #23)
> Hi Felip,
> 
> I am not very familiar with 'gdb' usage.
> 
> 
> Can you advise me on how to get core dump/log from this task ?
> --
> gdb to slurmstepd and do a 'thread apply all bt full'
> --

Sure, is quite straightforward:

1. In the node where a job is 'stuck', identify the step .0, for example pid 109353 here is from job 2820 step 0:
]$ ps aux|grep slurmstepd
root      109336  0.2  0.0 282692  6232 ?        Sl   17:56   0:01 slurmstepd: [2820.extern]
root      109353  0.2  0.0 416216  6608 ?        Sl   17:56   0:01 slurmstepd: [2820.0]
lipi      109865  0.0  0.0 216236  2388 pts/7    S+   18:03   0:00 grep --color=auto slurmstepd

2. Since this process runs as root, we will need to attach to this process as root:

]$ su
Password: ***
]#

3. Attach to the process with gdb:
]# gdb attach 109353

When you are inside gdb and successfully attached to the process, you can generate a dump which will show us where the process is in the code:

> bt

We can also see which threads are alive:

> info threads

And we can see where is every thread in the code:

> thread apply all bt

When you're done:

> quit

Fyi, during the time you are attached to a process this process is completely under gdb control, so it does not proceed running further instructions until explictly told with a continue, next, nexti, step or stepi. In our example we don't need the process to proceed because we just want to see where it is in the code.

To summarize, I need you to run and then copy paste the output here:

> bt
> info threads
> thread apply all bt


If you prefer this can be done in one line instead and put the output directly into a file:

]# gdb -ex "set confirm off" -ex "bt" -ex "info threads" -ex "thread apply all bt" -ex "quit" --p <the_pid_of_slurmstepd_step_0> > /tmp/output_gdb.txt


> We have prepare a test node for this test.

Ok!
Comment 25 Damien 2020-06-12 10:30:07 MDT
Hi Felip, 

I hope this is helpful:



[root@m3a012 ~]# ps -ef |grep slurm
root     27976     1  0 May30 ?        00:00:14 /opt/slurm-19.05.4/sbin/slurmd
root     28816     1  0 02:27 ?        00:00:00 slurmstepd: [14760527.extern]
root     28838 28754  0 02:27 pts/1    00:00:00 grep --color=auto slurm
[root@m3a012 ~]# gdb attach 28816
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attach: No such file or directory.
Attaching to process 28816
Reading symbols from /opt/slurm-19.05.4/sbin/slurmstepd...done.
Reading symbols from /opt/slurm-19.05.4/lib/slurm/libslurmfull.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/libslurmfull.so
Reading symbols from /usr/lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libdl.so.2
Reading symbols from /usr/lib64/libhwloc.so.5...Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libhwloc.so.5
Reading symbols from /usr/lib64/libpam.so.0...Reading symbols from /usr/lib64/libpam.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpam.so.0
Reading symbols from /usr/lib64/libpam_misc.so.0...Reading symbols from /usr/lib64/libpam_misc.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpam_misc.so.0
Reading symbols from /usr/lib64/libutil.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libutil.so.1
Reading symbols from /usr/lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 28820]
[New LWP 28819]
[New LWP 28818]
[New LWP 28817]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /usr/lib64/libpthread.so.0
Reading symbols from /usr/lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libm.so.6
Reading symbols from /usr/lib64/libnuma.so.1...Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /usr/lib64/libltdl.so.7...Reading symbols from /usr/lib64/libltdl.so.7...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libltdl.so.7
Reading symbols from /usr/lib64/libaudit.so.1...Reading symbols from /usr/lib64/libaudit.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libaudit.so.1
Reading symbols from /usr/lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgcc_s.so.1
Reading symbols from /usr/lib64/libcap-ng.so.0...Reading symbols from /usr/lib64/libcap-ng.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcap-ng.so.0
Reading symbols from /usr/lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_files.so.2
Reading symbols from /opt/slurm-19.05.4/lib/slurm/select_cons_res.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/select_cons_res.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/auth_munge.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/auth_munge.so
Reading symbols from /opt/munge-0.5.11/lib/libmunge.so.2...done.
Loaded symbols for /opt/munge-0.5.11/lib/libmunge.so.2
Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_energy_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_energy_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_profile_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_interconnect_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_interconnect_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_filesystem_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_filesystem_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/switch_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/switch_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/gres_gpu.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/gres_gpu.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/jobacct_gather_cgroup.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/jobacct_gather_cgroup.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/core_spec_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/core_spec_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/proctrack_cgroup.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/proctrack_cgroup.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/task_affinity.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/task_affinity.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/task_cgroup.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/task_cgroup.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/checkpoint_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/checkpoint_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/cred_munge.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/cred_munge.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/job_container_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/job_container_none.so
0x00007fa76a8b94ca in wait4 () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.5-4.el7.x86_64 glibc-2.17-292.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 numactl-libs-2.0.12-3.el7_7.1.x86_64 pam-1.1.8-22.el7.x86_64
(gdb) bt
#0  0x00007fa76a8b94ca in wait4 () from /usr/lib64/libc.so.6
#1  0x0000000000410674 in _spawn_job_container (job=0xcdc0f0) at mgr.c:1142
#2  job_manager (job=job@entry=0xcdc0f0) at mgr.c:1251
#3  0x000000000040d291 in main (argc=1, argv=0x7ffe760774d8) at slurmstepd.c:179
(gdb) info threads
  Id   Target Id         Frame 
  5    Thread 0x7fa76be3d700 (LWP 28817) "acctg" 0x00007fa76abcd9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  4    Thread 0x7fa768437700 (LWP 28818) "acctg_prof" 0x00007fa76abcdda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  3    Thread 0x7fa768336700 (LWP 28819) "slurmstepd" 0x00007fa76a8e7bed in poll () from /usr/lib64/libc.so.6
  2    Thread 0x7fa7633ce700 (LWP 28820) "slurmstepd" 0x00007fa76a8e7bed in poll () from /usr/lib64/libc.so.6
* 1    Thread 0x7fa76be3e780 (LWP 28816) "slurmstepd" 0x00007fa76a8b94ca in wait4 () from /usr/lib64/libc.so.6
(gdb) thread apply all bt

Thread 5 (Thread 0x7fa76be3d700 (LWP 28817)):
#0  0x00007fa76abcd9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fa76b9175e3 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:366
#2  0x00007fa76abc9e65 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007fa76a8f288d in clone () from /usr/lib64/libc.so.6

Thread 4 (Thread 0x7fa768437700 (LWP 28818)):
#0  0x00007fa76abcdda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fa76b910f70 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205
#2  0x00007fa76abc9e65 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007fa76a8f288d in clone () from /usr/lib64/libc.so.6

Thread 3 (Thread 0x7fa768336700 (LWP 28819)):
#0  0x00007fa76a8e7bed in poll () from /usr/lib64/libc.so.6
#1  0x00007fa76b9b2688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7fa7640008d0) at eio.c:367
#2  eio_handle_mainloop (eio=0xcfd5d0) at eio.c:330
#3  0x000000000041ffcc in _msg_thr_internal (job_arg=0xcdc0f0) at req.c:289
#4  0x00007fa76abc9e65 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007fa76a8f288d in clone () from /usr/lib64/libc.so.6

Thread 2 (Thread 0x7fa7633ce700 (LWP 28820)):
#0  0x00007fa76a8e7bed in poll () from /usr/lib64/libc.so.6
#1  0x00007fa7639dcbd2 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:493
#2  0x00007fa76abc9e65 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007fa76a8f288d in clone () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x7fa76be3e780 (LWP 28816)):
#0  0x00007fa76a8b94ca in wait4 () from /usr/lib64/libc.so.6
#1  0x0000000000410674 in _spawn_job_container (job=0xcdc0f0) at mgr.c:1142
#2  job_manager (job=job@entry=0xcdc0f0) at mgr.c:1251
#3  0x000000000040d291 in main (argc=1, argv=0x7ffe760774d8) at slurmstepd.c:179
(gdb) quit
A debugging session is active.

	Inferior 1 [process 28816] will be detached.

Quit anyway? (y or n) y
Detaching from program: /opt/slurm-19.05.4/sbin/slurmstepd, process 28816
Comment 26 Felip Moll 2020-06-12 10:56:57 MDT
(In reply to Damien from comment #25)
> Hi Felip, 
> 
> I hope this is helpful:
> 
> 
> 
> [root@m3a012 ~]# ps -ef |grep slurm
> root     27976     1  0 May30 ?        00:00:14
> /opt/slurm-19.05.4/sbin/slurmd
> root     28816     1  0 02:27 ?        00:00:00 slurmstepd: [14760527.extern]
> root     28838 28754  0 02:27 pts/1    00:00:00 grep --color=auto slurm


Unfortunately this is not helpful because the dump is from the extern step.

- Is job 28816 a failing one?
- Can you attach the slurmd log from m3a012?

Remember the pmix setup is done in step 0, e.g. in your past log:

[2020-05-30T02:14:06.329] [14361387.0] debug:  (null) [0] mpi_pmix.c:153 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: start
[2020-05-30T02:14:06.330] [14361387.0] debug:  mpi/pmix: setup sockets
Comment 27 Damien 2020-06-13 00:45:10 MDT
Hi Felip,

Sorry for that, I will redo this.

I hope that this helps:
---

[root@m3a012 ~]# ps -ef |grep srun
damienl   9706  9532  0 16:39 pts/1    00:00:00 srun --mpi=pmix --reservation=AWX --nodelist=m3a012 --ntasks=1 ./a.out
damienl   9720  9706  0 16:39 pts/1    00:00:00 srun --mpi=pmix --reservation=AWX --nodelist=m3a012 --ntasks=1 ./a.out
root      9736  9474  0 16:39 pts/0    00:00:00 grep --color=auto srun
[root@m3a012 ~]# ps -ef |grep slurm
root      9714     1  0 16:39 ?        00:00:00 slurmstepd: [14761112.extern]
root      9738  9474  0 16:39 pts/0    00:00:00 grep --color=auto slurm
root     27976     1  0 May30 ?        00:00:15 /opt/slurm-19.05.4/sbin/slurmd
[root@m3a012 ~]# gdb attach 9706
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attach: No such file or directory.
Attaching to process 9706
Reading symbols from /opt/slurm-19.05.4/bin/srun...done.
Reading symbols from /lib64/libz.so.1...Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /opt/slurm-19.05.4/lib/slurm/libslurmfull.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/libslurmfull.so
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 9723]
[New LWP 9722]
[New LWP 9721]
[New LWP 9707]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /opt/slurm-19.05.4/lib/slurm/select_cons_res.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/select_cons_res.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/select_cray_aries.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/select_cray_aries.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/select_cons_tres.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/select_cons_tres.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/select_linear.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/select_linear.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/switch_cray_aries.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/switch_cray_aries.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/switch_generic.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/switch_generic.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/switch_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/switch_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/launch_slurm.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/launch_slurm.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/mpi_pmix.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/mpi_pmix.so
Reading symbols from /usr/lib64/libhwloc.so.5...Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libhwloc.so.5
Reading symbols from /usr/lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libm.so.6
Reading symbols from /usr/lib64/libnuma.so.1...Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /usr/lib64/libltdl.so.7...Reading symbols from /usr/lib64/libltdl.so.7...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libltdl.so.7
Reading symbols from /usr/lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgcc_s.so.1
Reading symbols from /usr/local/pmix/latest/lib/libpmix.so...(no debugging symbols found)...done.
Loaded symbols for /usr/local/pmix/latest/lib/libpmix.so
Reading symbols from /lib64/libevent_pthreads-2.0.so.5...Reading symbols from /lib64/libevent_pthreads-2.0.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libevent_pthreads-2.0.so.5
Reading symbols from /lib64/libevent-2.0.so.5...Reading symbols from /lib64/libevent-2.0.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libevent-2.0.so.5
Reading symbols from /opt/slurm-19.05.4/lib/slurm/auth_munge.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/auth_munge.so
Reading symbols from /opt/munge-0.5.11/lib/libmunge.so.2...done.
Loaded symbols for /opt/munge-0.5.11/lib/libmunge.so.2
Reading symbols from /lib64/libnss_sss.so.2...Reading symbols from /lib64/libnss_sss.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_sss.so.2
Reading symbols from /opt/slurm-19.05.4/lib/slurm/route_default.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/route_default.so
0x00007f0caf139da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 numactl-libs-2.0.12-3.el7_7.1.x86_64 sssd-client-1.16.4-21.el7_7.1.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007f0caf139da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0caf3aa0bb in slurm_step_launch_wait_start (ctx=0x1e16690) at step_launch.c:647
#2  0x00007f0cadae42eb in launch_p_step_launch (job=0x1e15340, cio_fds=<optimized out>, global_rc=<optimized out>, step_callbacks=<optimized out>, 
    opt_local=0x61de40 <opt>) at launch_slurm.c:848
#3  0x000000000040b9f5 in launch_g_step_launch (job=job@entry=0x1e15340, cio_fds=cio_fds@entry=0x7ffd4bbc6310, global_rc=global_rc@entry=0x61e7b0 <global_rc>, 
    step_callbacks=step_callbacks@entry=0x7ffd4bbc62e0, opt_local=opt_local@entry=0x61de40 <opt>) at launch.c:578
#4  0x0000000000407806 in _launch_one_app (data=0x1e23960) at srun.c:248
#5  0x0000000000408e37 in _launch_app (got_alloc=true, srun_job_list=0x0, job=0x1e15340) at srun.c:547
#6  srun (ac=<optimized out>, av=<optimized out>) at srun.c:202
#7  0x0000000000409246 in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17
(gdb) info threads
  Id   Target Id         Frame 
  5    Thread 0x7f0cafd4a700 (LWP 9707) "srun" 0x00007f0caee53bed in poll () from /lib64/libc.so.6
  4    Thread 0x7f0cac1d5700 (LWP 9721) "srun" 0x00007f0caf13d381 in sigwait () from /lib64/libpthread.so.0
  3    Thread 0x7f0ca75f1700 (LWP 9722) "srun" 0x00007f0caee53bed in poll () from /lib64/libc.so.6
  2    Thread 0x7f0ca74f0700 (LWP 9723) "srun" 0x00007f0caee53bed in poll () from /lib64/libc.so.6
* 1    Thread 0x7f0cafd4b740 (LWP 9706) "srun" 0x00007f0caf139da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) thread apply all bt

Thread 5 (Thread 0x7f0cafd4a700 (LWP 9707)):
#0  0x00007f0caee53bed in poll () from /lib64/libc.so.6
#1  0x00007f0caf4c7688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7f0ca80008d0) at eio.c:367
#2  eio_handle_mainloop (eio=eio@entry=0x1e0f690) at eio.c:330
#3  0x00007f0caf38d998 in _msg_thr_internal (arg=0x1e0f690) at allocate_msg.c:89
#4  0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f0caee5e88d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f0cac1d5700 (LWP 9721)):
#0  0x00007f0caf13d381 in sigwait () from /lib64/libpthread.so.0
#1  0x00000000004113ef in _srun_signal_mgr (job_ptr=0x1e15340) at srun_job.c:2126
#2  0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f0caee5e88d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f0ca75f1700 (LWP 9722)):
#0  0x00007f0caee53bed in poll () from /lib64/libc.so.6
#1  0x00007f0caf4c7688 in _poll_internal (shutdown_time=<optimized out>, nfds=3, pfds=0x7f0ca00008d0) at eio.c:367
#2  eio_handle_mainloop (eio=0x1e29660) at eio.c:330
#3  0x00007f0caf3a6d40 in _msg_thr_internal (arg=<optimized out>) at step_launch.c:1130
#4  0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f0caee5e88d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f0ca74f0700 (LWP 9723)):
#0  0x00007f0caee53bed in poll () from /lib64/libc.so.6
#1  0x00007f0caf4c7688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7f0c980008d0) at eio.c:367
#2  eio_handle_mainloop (eio=0x1e2c2b0) at eio.c:330
#3  0x00007f0caf3a5f8e in _io_thr_internal (cio_arg=0x1e2bba0) at step_io.c:816
#4  0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f0caee5e88d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f0cafd4b740 (LWP 9706)):
#0  0x00007f0caf139da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0caf3aa0bb in slurm_step_launch_wait_start (ctx=0x1e16690) at step_launch.c:647
#2  0x00007f0cadae42eb in launch_p_step_launch (job=0x1e15340, cio_fds=<optimized out>, global_rc=<optimized out>, step_callbacks=<optimized out>, 
    opt_local=0x61de40 <opt>) at launch_slurm.c:848
#3  0x000000000040b9f5 in launch_g_step_launch (job=job@entry=0x1e15340, cio_fds=cio_fds@entry=0x7ffd4bbc6310, global_rc=global_rc@entry=0x61e7b0 <global_rc>, 
    step_callbacks=step_callbacks@entry=0x7ffd4bbc62e0, opt_local=opt_local@entry=0x61de40 <opt>) at launch.c:578
#4  0x0000000000407806 in _launch_one_app (data=0x1e23960) at srun.c:248
#5  0x0000000000408e37 in _launch_app (got_alloc=true, srun_job_list=0x0, job=0x1e15340) at srun.c:547
#6  srun (ac=<optimized out>, av=<optimized out>) at srun.c:202
#7  0x0000000000409246 in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17
(gdb) 

Thread 5 (Thread 0x7f0cafd4a700 (LWP 9707)):
#0  0x00007f0caee53bed in poll () from /lib64/libc.so.6
#1  0x00007f0caf4c7688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7f0ca80008d0) at eio.c:367
#2  eio_handle_mainloop (eio=eio@entry=0x1e0f690) at eio.c:330
#3  0x00007f0caf38d998 in _msg_thr_internal (arg=0x1e0f690) at allocate_msg.c:89
#4  0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f0caee5e88d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f0cac1d5700 (LWP 9721)):
#0  0x00007f0caf13d381 in sigwait () from /lib64/libpthread.so.0
#1  0x00000000004113ef in _srun_signal_mgr (job_ptr=0x1e15340) at srun_job.c:2126
#2  0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f0caee5e88d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f0ca75f1700 (LWP 9722)):
#0  0x00007f0caee53bed in poll () from /lib64/libc.so.6
#1  0x00007f0caf4c7688 in _poll_internal (shutdown_time=<optimized out>, nfds=3, pfds=0x7f0ca00008d0) at eio.c:367
#2  eio_handle_mainloop (eio=0x1e29660) at eio.c:330
#3  0x00007f0caf3a6d40 in _msg_thr_internal (arg=<optimized out>) at step_launch.c:1130
#4  0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f0caee5e88d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f0ca74f0700 (LWP 9723)):
#0  0x00007f0caee53bed in poll () from /lib64/libc.so.6
#1  0x00007f0caf4c7688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7f0c980008d0) at eio.c:367
#2  eio_handle_mainloop (eio=0x1e2c2b0) at eio.c:330
#3  0x00007f0caf3a5f8e in _io_thr_internal (cio_arg=0x1e2bba0) at step_io.c:816
#4  0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f0caee5e88d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f0cafd4b740 (LWP 9706)):
#0  0x00007f0caf139da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0caf3aa0bb in slurm_step_launch_wait_start (ctx=0x1e16690) at step_launch.c:647
#2  0x00007f0cadae42eb in launch_p_step_launch (job=0x1e15340, cio_fds=<optimized out>, global_rc=<optimized out>, step_callbacks=<optimized out>, 
    opt_local=0x61de40 <opt>) at launch_slurm.c:848
#3  0x000000000040b9f5 in launch_g_step_launch (job=job@entry=0x1e15340, cio_fds=cio_fds@entry=0x7ffd4bbc6310, global_rc=global_rc@entry=0x61e7b0 <global_rc>, 
    step_callbacks=step_callbacks@entry=0x7ffd4bbc62e0, opt_local=opt_local@entry=0x61de40 <opt>) at launch.c:578
#4  0x0000000000407806 in _launch_one_app (data=0x1e23960) at srun.c:248
#5  0x0000000000408e37 in _launch_app (got_alloc=true, srun_job_list=0x0, job=0x1e15340) at srun.c:547
#6  srun (ac=<optimized out>, av=<optimized out>) at srun.c:202
#7  0x0000000000409246 in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17
(gdb) quit
A debugging session is active.

	Inferior 1 [process 9706] will be detached.

Quit anyway? (y or n) y
Detaching from program: /opt/slurm-19.05.4/bin/srun, process 9706
[root@m3a012 ~]# ps -ef |grep slurm
root      9714     1  0 16:39 ?        00:00:00 slurmstepd: [14761112.extern]
root      9767  9474  0 16:41 pts/0    00:00:00 grep --color=auto slurm
root     27976     1  0 May30 ?        00:00:15 /opt/slurm-19.05.4/sbin/slurmd
[root@m3a012 ~]# gdb attach 9714
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attach: No such file or directory.
Attaching to process 9714
Reading symbols from /opt/slurm-19.05.4/sbin/slurmstepd...done.
Reading symbols from /opt/slurm-19.05.4/lib/slurm/libslurmfull.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/libslurmfull.so
Reading symbols from /usr/lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libdl.so.2
Reading symbols from /usr/lib64/libhwloc.so.5...Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libhwloc.so.5
Reading symbols from /usr/lib64/libpam.so.0...Reading symbols from /usr/lib64/libpam.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpam.so.0
Reading symbols from /usr/lib64/libpam_misc.so.0...Reading symbols from /usr/lib64/libpam_misc.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpam_misc.so.0
Reading symbols from /usr/lib64/libutil.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libutil.so.1
Reading symbols from /usr/lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 9718]
[New LWP 9717]
[New LWP 9716]
[New LWP 9715]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /usr/lib64/libpthread.so.0
Reading symbols from /usr/lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libm.so.6
Reading symbols from /usr/lib64/libnuma.so.1...Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /usr/lib64/libltdl.so.7...Reading symbols from /usr/lib64/libltdl.so.7...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libltdl.so.7
Reading symbols from /usr/lib64/libaudit.so.1...Reading symbols from /usr/lib64/libaudit.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libaudit.so.1
Reading symbols from /usr/lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgcc_s.so.1
Reading symbols from /usr/lib64/libcap-ng.so.0...Reading symbols from /usr/lib64/libcap-ng.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcap-ng.so.0
Reading symbols from /usr/lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_files.so.2
Reading symbols from /opt/slurm-19.05.4/lib/slurm/select_cons_res.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/select_cons_res.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/auth_munge.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/auth_munge.so
Reading symbols from /opt/munge-0.5.11/lib/libmunge.so.2...done.
Loaded symbols for /opt/munge-0.5.11/lib/libmunge.so.2
Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_energy_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_energy_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_profile_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_interconnect_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_interconnect_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_filesystem_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_filesystem_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/switch_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/switch_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/gres_gpu.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/gres_gpu.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/jobacct_gather_cgroup.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/jobacct_gather_cgroup.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/core_spec_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/core_spec_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/proctrack_cgroup.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/proctrack_cgroup.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/task_affinity.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/task_affinity.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/task_cgroup.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/task_cgroup.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/checkpoint_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/checkpoint_none.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/cred_munge.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/cred_munge.so
Reading symbols from /opt/slurm-19.05.4/lib/slurm/job_container_none.so...done.
Loaded symbols for /opt/slurm-19.05.4/lib/slurm/job_container_none.so
0x00007fdfb71d34ca in wait4 () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.5-4.el7.x86_64 glibc-2.17-292.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 numactl-libs-2.0.12-3.el7_7.1.x86_64 pam-1.1.8-22.el7.x86_64
(gdb) bt
#0  0x00007fdfb71d34ca in wait4 () from /usr/lib64/libc.so.6
#1  0x0000000000410674 in _spawn_job_container (job=0x1eba0f0) at mgr.c:1142
#2  job_manager (job=job@entry=0x1eba0f0) at mgr.c:1251
#3  0x000000000040d291 in main (argc=1, argv=0x7ffd275cfea8) at slurmstepd.c:179
(gdb) info threads
  Id   Target Id         Frame 
  5    Thread 0x7fdfb8757700 (LWP 9715) "acctg" 0x00007fdfb74e79f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  4    Thread 0x7fdfb4d51700 (LWP 9716) "acctg_prof" 0x00007fdfb74e7da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  3    Thread 0x7fdfb4c50700 (LWP 9717) "slurmstepd" 0x00007fdfb7201bed in poll () from /usr/lib64/libc.so.6
  2    Thread 0x7fdfb4121700 (LWP 9718) "slurmstepd" 0x00007fdfb7201bed in poll () from /usr/lib64/libc.so.6
* 1    Thread 0x7fdfb8758780 (LWP 9714) "slurmstepd" 0x00007fdfb71d34ca in wait4 () from /usr/lib64/libc.so.6
(gdb) thread apply all bt

Thread 5 (Thread 0x7fdfb8757700 (LWP 9715)):
#0  0x00007fdfb74e79f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fdfb82315e3 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:366
#2  0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6

Thread 4 (Thread 0x7fdfb4d51700 (LWP 9716)):
#0  0x00007fdfb74e7da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fdfb822af70 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205
#2  0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6

Thread 3 (Thread 0x7fdfb4c50700 (LWP 9717)):
#0  0x00007fdfb7201bed in poll () from /usr/lib64/libc.so.6
#1  0x00007fdfb82cc688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7fdfb00008d0) at eio.c:367
#2  eio_handle_mainloop (eio=0x1edb5d0) at eio.c:330
#3  0x000000000041ffcc in _msg_thr_internal (job_arg=0x1eba0f0) at req.c:289
#4  0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6

Thread 2 (Thread 0x7fdfb4121700 (LWP 9718)):
#0  0x00007fdfb7201bed in poll () from /usr/lib64/libc.so.6
#1  0x00007fdfb432abd2 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:493
#2  0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x7fdfb8758780 (LWP 9714)):
#0  0x00007fdfb71d34ca in wait4 () from /usr/lib64/libc.so.6
#1  0x0000000000410674 in _spawn_job_container (job=0x1eba0f0) at mgr.c:1142
#2  job_manager (job=job@entry=0x1eba0f0) at mgr.c:1251
#3  0x000000000040d291 in main (argc=1, argv=0x7ffd275cfea8) at slurmstepd.c:179
(gdb) 

Thread 5 (Thread 0x7fdfb8757700 (LWP 9715)):
#0  0x00007fdfb74e79f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fdfb82315e3 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:366
#2  0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6

Thread 4 (Thread 0x7fdfb4d51700 (LWP 9716)):
#0  0x00007fdfb74e7da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fdfb822af70 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205
#2  0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6

Thread 3 (Thread 0x7fdfb4c50700 (LWP 9717)):
#0  0x00007fdfb7201bed in poll () from /usr/lib64/libc.so.6
#1  0x00007fdfb82cc688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7fdfb00008d0) at eio.c:367
#2  eio_handle_mainloop (eio=0x1edb5d0) at eio.c:330
#3  0x000000000041ffcc in _msg_thr_internal (job_arg=0x1eba0f0) at req.c:289
#4  0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6

Thread 2 (Thread 0x7fdfb4121700 (LWP 9718)):
#0  0x00007fdfb7201bed in poll () from /usr/lib64/libc.so.6
#1  0x00007fdfb432abd2 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:493
#2  0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x7fdfb8758780 (LWP 9714)):
#0  0x00007fdfb71d34ca in wait4 () from /usr/lib64/libc.so.6
#1  0x0000000000410674 in _spawn_job_container (job=0x1eba0f0) at mgr.c:1142
#2  job_manager (job=job@entry=0x1eba0f0) at mgr.c:1251
#3  0x000000000040d291 in main (argc=1, argv=0x7ffd275cfea8) at slurmstepd.c:179
(gdb) quit
A debugging session is active.

	Inferior 1 [process 9714] will be detached.

Quit anyway? (y or n) y
Detaching from program: /opt/slurm-19.05.4/sbin/slurmstepd, process 9714
Comment 28 Damien 2020-06-13 00:52:13 MDT
Created attachment 14660 [details]
Today's slurm log on Test Node

Please review the current slurmd.log.



Thanks
Comment 29 Damien 2020-06-13 01:00:30 MDT
Created attachment 14661 [details]
Core file generated during this run

Core file generated during this run.


You might to examine this.


Thanks

Damien
Comment 30 Felip Moll 2020-06-15 06:39:23 MDT
(In reply to Damien from comment #29)
> Created attachment 14661 [details]
> Core file generated during this run
> 
> Core file generated during this run.
> 
> 
> You might to examine this.
> 
> 
> Thanks
> 
> Damien

Again you can see how you dump .extern and srun processes, but not the .0 step.
I guess your .0 is gone because in your 'ps -ef' commands it is not showing up.

So for the next step I need more verbosity on slurmd logs. I need you to:

1. Set the verbosity as: SlurmdDebug=debug3 
2. Run again the test, but this time use srun -vvvv and send me all the output in the console
3. Send me the slurmd logs again from the node
4. Send me the output of 'dmesg' command in the node

Thanks!
Comment 31 Damien 2020-06-16 08:14:12 MDT
Hi Felip,

Thanks for your investigation.

I wonder why I can see the slurmstep ".0 step". 

Do you have a standard test case ? I can follow your test plan to simulate the proper logs for this matter.



Cheers
Damien
Comment 32 Felip Moll 2020-06-16 08:47:12 MDT
(In reply to Damien from comment #31)
> Hi Felip,
> 
> Thanks for your investigation.
> 
> I wonder why I can see the slurmstep ".0 step". 

That's exactly what I am trying to figure out now.

The step .0 seems to disappear and I guess it has something to do with pmix initialization.

> Do you have a standard test case ? I can follow your test plan to simulate
> the proper logs for this matter.

No, the tests you're doing are actually what I need, but I also need the info. requested in comment 30.

Just repeat the experiment following comment 30 indications, I don't need the gdb output at the moment. Depending on what I see I will ask for more things.
Comment 33 Damien 2020-06-19 10:38:06 MDT
Hi Felip,


Kindly review the followings:


root@m3a012 etc]# cat slurm.conf |grep SlurmdDebug
SlurmdDebug=debug3

[damienl@m3a012 ~]$ module list
Currently Loaded Modulefiles:
  1) openmpi/3.1.6-ucx
[damienl@m3a012 ~]$ srun -vvvv --mpi=pmix --reservation=AWX --nodelist=m3a012 --ntasks=2  /home/damienl/a.out 
srun: defined options
srun: -------------------- --------------------
srun: mpi                 : pmix
srun: nodelist            : m3a012
srun: ntasks              : 2
srun: reservation         : AWX
srun: verbose             : 4
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=8388608
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=4096
srun: debug:  propagating RLIMIT_NOFILE=1024
srun: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug2: srun PMI messages to port=36580
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 42955
srun: debug:  Entering _msg_thr_internal
srun: debug3: eio_message_socket_readable: shutdown 0 fd 4
srun: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/auth_munge.so
srun: debug:  Munge authentication plugin loaded
srun: debug3: Success.
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: debug:  Waited 0.100000 sec and still waiting: next sleep for 0.200000 sec
srun: debug:  Waited 0.300000 sec and still waiting: next sleep for 0.300000 sec
srun: Nodes m3a012 are ready for job
srun: jobid 14801428: nodes(1):`m3a012', cpu counts: 2(x1)
srun: debug2: creating job with 2 tasks
srun: debug:  requesting job 14801428, user 10005, nodes 1 including (m3a012)
srun: debug:  cpus 2, tasks 2, name a.out, relative 65534
srun: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  (null) [0] mpi_pmix.c:212 [p_mpi_hook_client_prelaunch] mpi/pmix: setup process mapping in srun
srun: debug:  Entering _msg_thr_create()
srun: debug3: eio_message_socket_readable: shutdown 0 fd 12
srun: debug3: eio_message_socket_readable: shutdown 0 fd 8
srun: debug:  initialized stdio listening socket, port 38639
srun: debug:  Started IO server thread (139834612782848)
srun: debug:  Entering _launch_tasks
srun: debug3: IO thread pid = 10240
srun: debug2: Called _file_readable
srun: debug3:   false, all ioservers not yet initialized
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug3: Called _listening_socket_readable
srun: launching 14801428.0 on host m3a012, 2 tasks: [0-1]
srun: debug3: uid:10005 gid:10025 cwd:/home/damienl 0
srun: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/route_default.so
srun: route default plugin loaded
srun: debug3: Success.
srun: debug2: Tree head got back 0 looking for 1
srun: debug3: Tree sending to m3a012
srun: debug2: Tree head got back 1
srun: debug:  launch returned msg_rc=0 err=0 type=8001


Wait 5 mins , then Control+Break

^C

srun: interrupt (one more within 1 sec to abort)
srun: step:14801428.0 tasks 0-1: unknown
^Csrun: interrupt (one more within 1 sec to abort)
srun: step:14801428.0 tasks 0-1: unknown
^Csrun: sending Ctrl-C to job 14801428.0
srun: debug2: sending signal 2 to step 14801428.0 on hosts m3a012
srun: debug2: Tree head got back 0 looking for 1
srun: debug3: Tree sending to m3a012
srun: debug2: Tree head got back 1
srun: debug3: eio_message_socket_accept: start
srun: Job step 14801428.0 aborted before step completely launched.
srun: debug2: eio_message_socket_accept: got message connection from 172.16.200.79:46018 16
srun: Job step aborted: Waiting up to 12 seconds for job step to finish.
srun: debug2: received job step complete message
srun: Complete job step 14801428.0 received
srun: debug3: eio_message_socket_readable: shutdown 0 fd 12
srun: debug3: eio_message_socket_readable: shutdown 0 fd 8
^Csrun: forcing job termination
^Csrun: interrupt (abort already in progress)
srun: step:14801428.0 tasks 0-1: unknown
^Csrun: job abort in progress
srun: debug3: eio_message_socket_accept: start
srun: debug2: eio_message_socket_accept: got message connection from 172.16.200.79:46022 18
srun: debug2: received job step complete message
srun: Complete job step 14801428.0 received
srun: debug3: eio_message_socket_readable: shutdown 0 fd 12
srun: debug3: eio_message_socket_readable: shutdown 0 fd 8
^Csrun: interrupt (abort already in progress)
srun: step:14801428.0 tasks 0-1: unknown
^Csrun: job abort in progress
^Csrun: interrupt (abort already in progress)
srun: step:14801428.0 tasks 0-1: unknown
^Csrun: job abort in progress
^Csrun: interrupt (abort already in progress)
srun: step:14801428.0 tasks 0-1: unknown
^Csrun: job abort in progress
^Csrun: job abort in progress
^Csrun: interrupt (abort already in progress)
srun: step:14801428.0 tasks 0-1: unknown
^Csrun: job abort in progress
^C^C^Csrun: interrupt (abort already in progress)
srun: step:14801428.0 tasks 0-1: unknown
^Csrun: job abort in progress
^C^Csrun: interrupt (abort already in progress)
srun: step:14801428.0 tasks 0-1: unknown
^Csrun: job abort in progress
^C^Csrun: interrupt (abort already in progress)
srun: step:14801428.0 tasks 0-1: unknown
^Csrun: job abort in progress
^C^C^C^Csrun: interrupt (abort already in progress)
srun: step:14801428.0 tasks 0-1: unknown
^Csrun: job abort in progress
^C^C^C^Csrun: error: Timed out waiting for job step to complete
srun: debug3: eio_message_socket_accept: start
srun: debug2: eio_message_socket_accept: got message connection from 172.16.200.79:46028 18
srun: debug2: received job step complete message
srun: Complete job step 14801428.0 received
srun: debug3: eio_message_socket_readable: shutdown 0 fd 12
srun: debug3: eio_message_socket_readable: shutdown 0 fd 8
srun: debug3: eio_message_socket_readable: shutdown 1 fd 12
srun: debug2:   false, shutdown
srun: debug3: eio_message_socket_readable: shutdown 1 fd 8
srun: debug2:   false, shutdown
srun: debug2: Called _file_readable
srun: debug3:   false, shutdown
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug3: Called _listening_socket_readable
srun: debug2:   false, shutdown
srun: debug:  IO thread exiting
srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread
srun: debug3: eio_message_socket_readable: shutdown 1 fd 4
srun: debug2:   false, shutdown
srun: debug:  Leaving _msg_thr_internal






dmesg messages:

damienl@m3a012 tmp]$ dmesg 
[3917656.132106] echo (9975): drop_caches: 1
[3917712.010828] echo (10048): drop_caches: 1
[3917717.620904] echo (10057): drop_caches: 1
[3917755.767754] echo (10125): drop_caches: 1
[3917810.449355] LustreError: 10224:0:(layout.c:2121:__req_capsule_get()) @@@ Wrong buffer for field 'niobuf_inline' (7 of 7) in format 'LDLM_INTENT_OPEN', 0 vs. 0 (server)  req@ffff9ad53f260000 x1665840541408768/t549497789873(549497789873) o101->fs02-MDT0000-mdc-ffff9ad53cb8c000@172.16.192.3@o2ib:12/10 lens 616/600 e 0 to 0 dl 1592584327 ref 3 fl Complete:RPQU/4/0 rc 0/0 job:'bash.10005'
[3917824.703899] LustreError: 10230:0:(layout.c:2121:__req_capsule_get()) @@@ Wrong buffer for field 'niobuf_inline' (7 of 7) in format 'LDLM_INTENT_OPEN', 0 vs. 0 (server)  req@ffff9ac642378480 x1665840541415808/t549497841330(549497841330) o101->fs02-MDT0000-mdc-ffff9ad53cb8c000@172.16.192.3@o2ib:12/10 lens 616/600 e 0 to 0 dl 1592584341 ref 3 fl Complete:RPQU/4/0 rc 0/0 job:'bash.10005'
[3917829.915317] LustreError: 10236:0:(layout.c:2121:__req_capsule_get()) @@@ Wrong buffer for field 'niobuf_inline' (7 of 7) in format 'LDLM_INTENT_OPEN', 0 vs. 0 (server)  req@ffff9ac5b47a5580 x1665840541423040/t549497859215(549497859215) o101->fs02-MDT0000-mdc-ffff9ad53cb8c000@172.16.192.3@o2ib:12/10 lens 616/600 e 0 to 0 dl 1592584347 ref 3 fl Complete:RPQU/4/0 rc 0/0 job:'bash.10005'
[3917850.600014] echo (10246): drop_caches: 1
[3917929.517249] echo (10292): drop_caches: 1
Comment 34 Damien 2020-06-19 10:41:38 MDT
Created attachment 14736 [details]
Latest slurmd from Test node (m3a012)
Comment 35 Felip Moll 2020-06-22 05:45:12 MDT
From the logs, it seems pretty clear that the issue must be happening in PMIx initialization which may make slurmstepd to crash. We don't see anything more in the logs for slurm step 0 after the call to pmixp_stepd_init()->pmixp_info_nspace_usock().

I see you're using:
/usr/local/pmix/3.1.4

We have a few options:
1. Try the latest 3.2 pmix, you need to recompile pmix, openmpi and slurm and point to the new pmix.
2. Add a debug patch in slurmstepd, recompile slurm and try again. We'll get further logs from here.
3. Look for slurmstepd core file. Depending on how you've configured your OS, an abort in a process should create a 'core file' which can later be inspected with gdb. Normally the core is generated using the command/path in /proc/sys/kernel/core_pattern, just check it. You're in RHEL 7, so you I guess you will see the file in abrt directory: /var/spool/abrt/

Can you check if there's any core file related to slurmstepd?

cat /var/spool/abrt/ccpp.../cmdline <--- this will give you whether the directory relates to slurmstepd

----

Let's start for number 3, let me know if you find any core file. You can also look into Slurm's log dir in the node, they are put there if cannot go into another place.
Comment 36 Felip Moll 2020-06-22 05:59:39 MDT
I have some more information.

Some time ago I worked with bug 7646

Please, see these comments:

https://bugs.schedmd.com/show_bug.cgi?id=7646#c26
https://bugs.schedmd.com/show_bug.cgi?id=7646#c28
https://bugs.schedmd.com/show_bug.cgi?id=7646#c30

Artem Polyakov is the developer of PMIx:

> Now I think what we should do to move forward is to document that UCX support is only available with rdma-core:
> * starting from MOFED 4.7 if rdma-core is explicitly enabled (need to double-check the exact instructions on how to enable)
> * starting from v5.0 - by default.

May you try to enable rdma-core in your systems? I am not sure how to do this with your ROCE network though, which is out of my scope at the moment.
Comment 37 Felip Moll 2020-07-13 05:04:19 MDT
Damien,

I will timeout this bug for the moment. Please, check my latest comments and just mark it as open again if you feel we need to go deeper in this issue.

Thanks for your comprehension!