Hi Slurm Support We are trying to build and test a version of openmpi(3.1.6) with pmix and ucx support on a test node, but encounter this issue. Slurm build: ./configure --prefix=/opt/slurm-19.05.04 --with-munge=/opt/munge --enable-pam --with-pmix=/usr/local/pmix/3.1.4 --with-ucx=/usr/local/ucx/1.8.0 openmpi build: ./configure --prefix=/usr/local/openmpi/3.1.6-ucx --with-slurm --with-pmix=/usr/local/pmix/3.1.4 --enable-static --enable-shared --enable-mpi-fortran --with-libevent --with-ucx=/usr/local/ucx/1.8.0 --enable-wrapper-runpath --with-hwloc Run correctly with srun and mpirun: $ srun --nodelist=m3a012 --ntasks=4 mpirun -np 4 --mca btl self --mca pml ucx -x UCX_TLS=mm --bind-to core --map-by core --display-map ./a.out srun: job 14076970 queued and waiting for resources srun: job 14076970 has been allocated resources Data for JOB [52197,1] offset 0 Total slots allocated 4 Data for JOB [52194,1] offset 0 Total slots allocated 4 Data for JOB [52195,1] offset 0 Total slots allocated 4 ======================== JOB MAP ======================== Data for node: m3a012 Num slots: 4 Max slots: 0 Num procs: 4 Process OMPI jobid: [52195,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./.] Process OMPI jobid: [52195,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./.] Process OMPI jobid: [52195,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/.] Process OMPI jobid: [52195,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B] ============================================================= Data for JOB [52192,1] offset 0 Total slots allocated 4 ======================== JOB MAP ======================== Data for node: m3a012 Num slots: 4 Max slots: 0 Num procs: 4 Process OMPI jobid: [52192,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./.] Process OMPI jobid: [52192,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./.] Process OMPI jobid: [52192,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/.] Process OMPI jobid: [52192,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B] ============================================================= ======================== JOB MAP ======================== Data for node: m3a012 Num slots: 4 Max slots: 0 Num procs: 4 Process OMPI jobid: [52197,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./.] Process OMPI jobid: [52197,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./.] Process OMPI jobid: [52197,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/.] Process OMPI jobid: [52197,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B] ============================================================= ======================== JOB MAP ======================== Data for node: m3a012 Num slots: 4 Max slots: 0 Num procs: 4 Process OMPI jobid: [52194,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./.] Process OMPI jobid: [52194,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./.] Process OMPI jobid: [52194,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/.] Process OMPI jobid: [52194,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B] ============================================================= Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Which is good, but when I try srun 'standalone', I encounter this: [damienl@m3a012 tmp]$ srun --time=00:00:10 --nodelist=m3a012 --ntasks=4 ./c.out [m3a012:23420] OPAL ERROR: Not initialized in file ext2x_client.c at line 112 [m3a012:23421] OPAL ERROR: Not initialized in file ext2x_client.c at line 112 [m3a012:23422] OPAL ERROR: Not initialized in file ext2x_client.c at line 112 [m3a012:23423] OPAL ERROR: Not initialized in file ext2x_client.c at line 112 -------------------------------------------------------------------------- The application appears to have been direct launched using "srun", but OMPI was not built with SLURM's PMI support and therefore cannot execute. There are several options for building PMI support under SLURM, depending upon the SLURM version you are using: version 16.05 or later: you can use SLURM's PMIx support. This requires that you configure and build SLURM --with-pmix. Versions earlier than 16.05: you must use either SLURM's PMI-1 or PMI-2 support. SLURM builds PMI-1 by default, or you can manually install PMI-2. You must then build Open MPI using --with-pmi pointing to the SLURM PMI library location. Please configure as appropriate and try again. -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [m3a012:23420] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- How should I investigate this? Or which versions of libraries are safe to use in this matter ? Kindly advise. Many Thanks Damien
More Details: $ module load openmpi/3.1.6-ucx $ srun -V slurm 19.05.4 $ srun --mpi=list srun: MPI types are... srun: none srun: pmix_v3 srun: pmix srun: pmi2 srun: openmpi $ ucx_info -v # UCT version=1.7.0 revision b02bab9 # configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni $ ompi_info Package: Open MPI root@m3a012 Distribution Open MPI: 3.1.6 Open MPI repo revision: v3.1.6 Open MPI release date: Mar 18, 2020 Open RTE: 3.1.6 Open RTE repo revision: v3.1.6 Open RTE release date: Mar 18, 2020 OPAL: 3.1.6 OPAL repo revision: v3.1.6 OPAL release date: Mar 18, 2020 MPI API: 3.1.0 Ident string: 3.1.6 Prefix: /usr/local/openmpi/3.1.6-ucx Configured architecture: x86_64-unknown-linux-gnu Configure host: m3a012 Configured by: root Configured on: Wed May 13 23:13:32 AEST 2020 Configure host: m3a012 Configure command line: '--prefix=/usr/local/openmpi/3.1.6-ucx' '--with-slurm' '--with-pmix=/usr/local/pmix/latest' '--enable-static' '--enable-shared' '--enable-mpi-fortran' '--with-libevent' '--with-ucx=/usr/local/ucx/1.8.0' '--enable-wrapper-runpath' '--with-hwloc' Built by: root Built on: Wed May 13 23:24:43 AEST 2020 Built host: m3a012 C bindings: yes C++ bindings: no Fort mpif.h: yes (all) Fort use mpi: yes (limited: overloading) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: no Fort mpi_f08 compliance: The mpi_f08 module was not built Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: gcc C compiler absolute: /bin/gcc C compiler family name: GNU C compiler version: 4.8.5 C++ compiler: g++ C++ compiler absolute: /bin/g++ Fort compiler: gfortran Fort compiler abs: /bin/gfortran Fort ignore TKR: no Fort 08 assumed shape: no Fort optional args: no Fort INTERFACE: yes Fort ISO_FORTRAN_ENV: yes Fort STORAGE_SIZE: no Fort BIND(C) (all): no Fort ISO_C_BINDING: yes Fort SUBROUTINE BIND(C): no Fort TYPE,BIND(C): no Fort T,BIND(C,name="a"): no Fort PRIVATE: no Fort PROTECTED: no Fort ABSTRACT: no Fort ASYNCHRONOUS: no Fort PROCEDURE: no Fort USE...ONLY: no Fort C_FUNLOC: no Fort f08 using wrappers: no Fort MPI_SIZEOF: no C profiling: yes C++ profiling: no Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: no C++ exceptions: no Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Sparse Groups: no Internal debug support: no MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no dl support: yes Heterogeneous support: no mpirun default --prefix: no MPI_WTIME support: native Symbol vis. support: yes Host topology support: yes MPI extensions: affinity, cuda FT Checkpoint support: no (checkpoint thread: no) C/R Enabled Debugging: no MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA btl: self (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA btl: openib (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA btl: tcp (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA btl: vader (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA crs: none (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA event: external (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA pmix: ext2x (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v3.1.6) MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA dfs: app (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA dfs: orted (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA dfs: test (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA errmgr: dvm (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA ess: env (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA notifier: syslog (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA odls: default (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA oob: ud (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA state: app (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA state: dvm (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA state: novm (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA state: orted (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA state: tool (MCA v2.1.0, API v1.0.0, Component v3.1.6) MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA coll: self (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA coll: spacc (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA fcoll: static (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA io: romio314 (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v3.1.6) MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA pml: v (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v3.1.6) MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v3.1.6) MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v3.1.6) MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v3.1.6)
mpicc /usr/local/hpcx/2.5.0-redhat7.7/ompi/tests/examples/hello_c.c -o c.out $ cat /usr/local/hpcx/2.5.0-redhat7.7/ompi/tests/examples/hello_c.c #include <stdio.h> #include "mpi.h" int main(int argc, char* argv[]) { int rank, size, len; char version[MPI_MAX_LIBRARY_VERSION_STRING]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_library_version(version, &len); printf("Hello, world, I am %d of %d, (%s, %d)\n", rank, size, version, len); MPI_Finalize(); return 0; }
If you can point me to the proper directions, or other sites' practice for this matter I will be grateful...
I have been looking into this by testing different openmpi configurations related to yours, and in my tests, they all work so far. I don't currently have UCX installed, but that should not be related. Do you have a configured MpiDefault in slurm.conf? $ scontrol show config | grep Mpi Since you intend to use pmix, that should be the value of MpiDefault, or specify it using --mpi=pmix. If that doesn't work, we need to try pmi2. Make sure Slurm's pmi2 is installed from the contribs directory. Recompile openmpi with --with-pmi=$SLURM. Recompile test binary and test: $ srun --mpi=pmi2 -N4 -n4 hello My tests: slurm configure: --prefix=$SLURM --with-pmix=$PMIX openmpi configure: --prefix=$OMPI --with-pmix=$PMIX --with-hwloc=/usr --with-libevent=/usr --enable-static --enable-shared --enable-mpi-fortran --enable-wrapper-runpath --with-slurm [--with-pmi=$SLURM] (I did the same tests with and without the option in brackets, but you should try with that -- make sure Slurm's libpmi2.so is built and installed from contribs) (--mpi=pmix overrides any slurm.conf MpiDefault) $ srun --mpi=pmix_v3 -N4 hello_mon Hello, world, I am 3 of 4, (Open MPI v3.1.6rc3, package: Open MPI broderick@caesar Distribution, ident: 3.1.6rc3, repo rev: v3.1.6, Unreleased developer copy, 130) Hello, world, I am 1 of 4, (Open MPI v3.1.6rc3, package: Open MPI broderick@caesar Distribution, ident: 3.1.6rc3, repo rev: v3.1.6, Unreleased developer copy, 130) Hello, world, I am 0 of 4, (Open MPI v3.1.6rc3, package: Open MPI broderick@caesar Distribution, ident: 3.1.6rc3, repo rev: v3.1.6, Unreleased developer copy, 130) Hello, world, I am 2 of 4, (Open MPI v3.1.6rc3, package: Open MPI broderick@caesar Distribution, ident: 3.1.6rc3, repo rev: v3.1.6, Unreleased developer copy, 130)
Hi Broderick, Thanks for your reply. Is there a reason why ucx is not include in your test ? We are testing ucx because it is mentioned to have better performance, I am not too sure about it's adoption to the wider openmpi community. Our defaults is pmi2 $ scontrol show config |grep pmi MpiDefault = pmi2 I am running your mentioned tests again. Cheers Damien
Hi Guys, I managed to get this task to work with multiple tries, errors and testing, using this combination seems to work for my test node. Running more MPI testings right now. My working combinations: Slurm build: ./configure --prefix=/opt/slurm-19.05.04 --with-munge=/opt/munge --enable-pam --with-pmix=/usr/local/pmix/3.1.4 --with-ucx=/usr/local/ucx/1.8.0 # /opt/slurm-latest is sym-linked to /opt/slurm-19.05.04 openmpi build: ./configure --prefix=/usr/local/openmpi/3.1.6-ucx --with-slurm --with-pmix=/usr/local/pmix/3.1.4 --enable-static --enable-shared --enable-mpi-fortran --with-libevent --with-ucx=/usr/local/ucx/1.8.0 --enable-wrapper-runpath --with-hwloc --without-verbs --with-pmi=/opt/slurm-latest I am really wonder why? In my openmpi, I have already specified "--with-pmix=/usr/local/pmix/3.1.4" , why do I still need "--with-pmi=/opt/slurm-latest" again, Or am I mistaken about this. The only different in this test setup, is the includes of "--with-ucx" for both Slurm and openmpi, as we want to test out this better performing library. Our hardware setup is not conventional, We are not running a traditional IB network, but a ROCE v1 network (RDMA over Converged Ethernet), Does this makes a different ? I will setup another 3 VMs in a separate cluster to test this again with the same slurm/openmpi/ucx configuration, if I get the same results. Cheers Damien
(In reply to Damien from comment #7) > I am really wonder why? In my openmpi, I have already specified > "--with-pmix=/usr/local/pmix/3.1.4" , why do I still need > "--with-pmi=/opt/slurm-latest" again, Or am I mistaken about this. The only > different in this test setup, is the includes of "--with-ucx" for both Slurm > and openmpi, as we want to test out this better performing library. The only difference from what? The configure lines for slurm and openmpi in the first post on this ticket also have "--with-ucx". Do you mean you already had openmpi with pmix working, and you started having problems when adding ucx support? Just seeking clarification. To use full pmix, pmix must be selected from the slurm side (--mpi=pmix or MpiDefault=pmix). When using pmi2, it is expected that you link openmpi to slurm's libpmi2.so (--with-pmix=$SLURM) if you want to use srun and not mpirun. Now, as in my tests and due to pmix's compatibility with pmi2, "--mpi=pmi2" should work with openmpi's pmix anyway (meaning whatever pmix openmpi has linked to). I have not been able to replicate your issue so far, but it could be some incompatibility between pmix and pmi2 in your system (maybe due to the use of *UCX, the interconnect, or some other hardware). That's what I would like to isolate, as I don't know why I can't reproduce your original issue right now. So what works now? Are you still using --mpi=pmi2? Detail is appreciated, including what srun commands you are using to test. Thanks * I am looking into UCX; the only reason I haven't tested with it yet is that I never have before and thus didn't have it installed. We don't expect it to be the cause of your problems, but if everything started from adding UCX, then we may revise that opinion.
Hi Broderick, Many Thanks for your efforts and findings, Yes, you are right, my test 'srun --mpi=pmix' is broken, Details: srun --mpi=pmix -n 4 --time=00:00:10 --nodelist=m3a012 ./abc.out ..... srun: Force Terminated job 14121359 srun: Job step 14121359.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 12 seconds for job step to finish. srun: error: Timed out waiting for job step to complete [damienl@m3a012 tmp]$ The working ones are: [damienl@m3a012 tmp]$ srun --mpi=pmi2 -n 4 --nodelist=m3a012 ./abc.out srun: job 14092496 queued and waiting for resources srun: job 14092496 has been allocated resources Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) [damienl@m3a012 tmp]$ srun --nodelist=m3a012 --ntasks=4 mpirun -np 4 --mca btl self --mca pml ucx -x UCX_TLS=mm --bind-to core --map-by core --display-map ./abc.out Data for JOB [62271,1] offset 0 Total slots allocated 4 Data for JOB [62269,1] offset 0 Total slots allocated 4 ======================== JOB MAP ======================== Data for node: m3a012 Num slots: 4 Max slots: 0 Num procs: 4 Process OMPI jobid: [62271,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./.] Process OMPI jobid: [62271,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./.] Process OMPI jobid: [62271,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/.] Process OMPI jobid: [62271,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B] ..... ..... ..... Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) [damienl@m3a012 tmp]$ [damienl@m3a012 tmp]$ srun --mpi=list srun: MPI types are... srun: none srun: pmix_v3 srun: pmix srun: pmi2 srun: openmpi As mentioned before we are running on a ROCE v1 network (RDMA over Converged Ethernet). So from the above example, my 'pmix' implementation is not working, I need to investigate why ? and how to troubleshoot this ? Cheers Damien
Hi Broderick, As mentioned, this could be some incompatibility between pmix and pmi2 from our side. We are using: -- slurm-19.05.04 pmix/3.1.4 -- Is there a Slurm preferred version ? Like under it's ../src/contribs directory. I am not sure how to troubleshoot this, other then running more MPI tests with different sets of parameters Cheers Damien
To get more information about pmix failing, make sure SlurmdDebug=debug, and collect for me the slurmctld.log and the slurmd.log from a node where the job failed to launch. (In reply to Damien from comment #10) > Is there a Slurm preferred version ? Like under it's ../src/contribs > directory. There is not currently a preferred version. The latest pmix version 3.*.* should be working. Thanks
Created attachment 14445 [details] slurmd log for m3a012 (test node)
Hi Broderick I hope that this is helpful in this investigation: --- [root@m3a012 etc]# pwd /opt/slurm-latest/etc [root@m3a012 etc]# grep SlurmdDebug slurm.conf SlurmdDebug=debug #srun mpi test [damienl@m3a012 ~]$ [damienl@m3a012 ~]$ srun --reservation=AWX --mpi=pmix -n 4 --time=00:00:10 --nodelist=m3a012 ./abc.out .... .... Waiting [ec2-user@m3-login2 ~]$ squeue -u damienl JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 14361387 comp abc.out damienl R 1:37 1 m3a012 [damienl@m3a012 ~]$ [damienl@m3a012 ~]$ srun --reservation=AWX --mpi=pmix -n 4 --time=00:00:10 --nodelist=m3a012 ./abc.out srun: Force Terminated job 14361387 srun: Job step 14361387.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 12 seconds for job step to finish. srun: error: Timed out waiting for job step to complete Attaching slurmd.log
Comment on attachment 14445 [details] slurmd log for m3a012 (test node) Attached the wrong log
Hi Broderick Kindly ignore my previous message, This is the proper ones --- [root@m3a012 etc]# pwd /opt/slurm-latest/etc [root@m3a012 etc]# grep SlurmdDebug slurm.conf SlurmdDebug=debug #srun mpi test [damienl@m3a012 ~]$ [damienl@m3a012 ~]$ srun --reservation=AWX --mpi=pmix -n 4 --time=00:00:10 --nodelist=m3a012 ./a.out .... .... Waiting [ec2-user@m3-login2 ~]$ squeue -u damienl JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 14361491 comp a.out damienl R 1:18 1 m3a012 [damienl@m3a012 ~]$ srun --reservation=AWX --mpi=pmix -n 4 --time=00:00:10 --nodelist=m3a012 ./a.out srun: Force Terminated job 14361491 srun: Job step 14361491.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 12 seconds for job step to finish. srun: error: Timed out waiting for job step to complete [damienl@m3a012 ~]$ Attaching the correct slurmd.log
Created attachment 14447 [details] The correct slurmd.log
This is the working example w/o pmix flag: [damienl@m3a012 ~]$ [damienl@m3a012 ~]$ srun --reservation=AWX -n 4 --time=00:00:10 --nodelist=m3a012 ./a.out Hello, world, I am 3 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 1 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 2 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 0 of 4, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Kindly advise Thanks Damien
slurmctld.log: --- [2020-05-30T02:28:04.425] gres_used:gpu:0 [2020-05-30T02:28:04.425] _fill_in_gres_fields JobId=14361491 gres_req:NONE gres_alloc: [2020-05-30T02:28:04.425] select_nodes: JobId=14361491 gres:NONE gres_alloc: [2020-05-30T02:28:04.427] sched: _slurm_rpc_allocate_resources JobId=14361491 NodeList=m3a012 usec=9628 [2020-05-30T02:28:05.131] job_submit/lua: /opt/slurm-19.05.4/etc/job_submit.lua: non-numeric return code [2020-05-30T02:28:05.144] _slurm_rpc_submit_batch_job: JobId=14361492 InitPrio=72945 usec=13659 ..... ..... [2020-05-30T02:30:31.046] Time limit exhausted for JobId=14361491 [2020-05-30T02:30:43.768] _slurm_rpc_complete_job_allocation: JobId=14361491 error Job/step already completing or completed
Hi I believe 'ucx' is working for us. Details: [damienl@m3a012 ~]$ [damienl@m3a012 ~]$ srun --nodelist=m3a012 --ntasks=2 mpirun -np 2 --mca btl self --mca pml ucx -x UCX_LOG_LEVEL=debug --map-by core --display-map ./a.out srun: job 14372968 queued and waiting for resources srun: job 14372968 has been allocated resources Data for JOB [58554,1] offset 0 Total slots allocated 2 ======================== JOB MAP ======================== Data for node: m3a012 Num slots: 2 Max slots: 0 Num procs: 2 Process OMPI jobid: [58554,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B][.] Process OMPI jobid: [58554,1] App: 0 Process rank: 1 Bound: socket 1[core 1[hwt 0]]:[.][B] ============================================================= Data for JOB [58555,1] offset 0 Total slots allocated 2 ======================== JOB MAP ======================== Data for node: m3a012 Num slots: 2 Max slots: 0 Num procs: 2 Process OMPI jobid: [58555,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B][.] Process OMPI jobid: [58555,1] App: 0 Process rank: 1 Bound: socket 1[core 1[hwt 0]]:[.][B] ============================================================= [1590846758.194227] [m3a012:29247:0] ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(self/memory knem/memory); [1590846758.194506] [m3a012:29247:0] ucp_worker.c:1543 UCX INFO ep_cfg[2]: tag(posix/memory knem/memory); [1590846758.195442] [m3a012:29244:0] ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(self/memory knem/memory); [1590846758.195821] [m3a012:29244:0] ucp_worker.c:1543 UCX INFO ep_cfg[2]: tag(posix/memory knem/memory); [1590846758.197027] [m3a012:29246:0] ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(self/memory knem/memory); [1590846758.197245] [m3a012:29246:0] ucp_worker.c:1543 UCX INFO ep_cfg[2]: tag(posix/memory knem/memory); [1590846758.197857] [m3a012:29245:0] ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(self/memory knem/memory); [1590846758.198083] [m3a012:29245:0] ucp_worker.c:1543 UCX INFO ep_cfg[2]: tag(posix/memory knem/memory); Hello, world, I am 1 of 2, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 0 of 2, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 1 of 2, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) Hello, world, I am 0 of 2, (Open MPI v3.1.6, package: Open MPI root@m3a012 Distribution, ident: 3.1.6, repo rev: v3.1.6, Mar 18, 2020, 106) [damienl@m3a012 ~]$ --- So the problem is still with pmix . Cheers Damien
Hi Damien, I am taking the bug from now on. I don't like the following: [2020-05-30T02:14:06.329] [14361387.0] debug: (null) [0] mpi_pmix.c:153 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: start [2020-05-30T02:14:06.330] [14361387.0] debug: mpi/pmix: setup sockets ..... [2020-05-30T02:15:25.789] debug: _step_connect: connect() failed dir /opt/slurm/var/spool node m3a012 step 14361387.0 Connection refused ..... [2020-05-30T02:16:17.842] debug: _step_connect: connect() failed dir /opt/slurm/var/spool node m3a012 step 14361387.0 Connection refused It seems pmix is not able to setup sockets properly. This may or may not have something to do with ROCE v1. Just to check it again, can you send me your config.log from openmpi and slurm? In comment 1 I saw the output of ompi_info and --with-pmix was pointing to pmix/latest instead of pmix/3.1.4 I just want to be sure you have openmpi and slurm properly compiled/configured while I am looking at ways to debug the "setup sockets" stuff. Thanks
Hi Felip Thanks to hear from you. Our ucx and pmix versions are sym-linked to their latest installed versions, For example: ---- [damienl@m5-login5 ~]$ cd /usr/local/pmix/ [damienl@m5-login5 pmix]$ [damienl@m5-login5 pmix]$ ll total 2 drwxr-xr-x 7 damienl systems 7 Jun 19 2019 3.1.2 lrwxrwxrwx 1 root root 5 Jun 19 2019 latest -> 3.1.2 drwxrwxr-x 6 damienl systems 6 Dec 6 2018 v2.2 damienl@m5-login5 ucx]$ [damienl@m5-login5 ucx]$ pwd /usr/local/ucx [damienl@m5-login5 ucx]$ ll total 3 drwxr-xr-x 6 damienl systems 6 Nov 7 2019 1.6.1 drwxr-xr-x 6 damienl systems 7 May 7 11:34 1.8.0 lrwxrwxrwx 1 root root 5 May 7 11:34 latest -> 1.8.0 ---- So it is more management, if in the future, we need newer updates. Will this be a problem if I compile them via their sym-link 'latest' ? Happy to provide you with more logs for this investigation. Thanks Damien
> Will this be a problem if I compile them via their sym-link 'latest' ? That's not a problem, I just wanted to be sure that 'latest' was a correct symlink. If you can send me your config.log from openmpi and slurm it would be great. Also, if you log into a node and run a job and then when you see this: [2020-05-30T02:14:06.330] [14361387.0] debug: mpi/pmix: setup sockets if you could attach with gdb to slurmstepd and do a 'thread apply all bt full' to obtain a dump it would be also good. I want to see where slurmstepd is stuck because in your logs you can see like [14361387.0] stops writing more logs after the mpi/pmix setup sockets, so I bet there are stuck step 0 processes in the node. I will wait for your feedback. Thanks!
Hi Felip, I am not very familiar with 'gdb' usage. Can you advise me on how to get core dump/log from this task ? -- gdb to slurmstepd and do a 'thread apply all bt full' -- We have prepare a test node for this test. Thanks Damien
(In reply to Damien from comment #23) > Hi Felip, > > I am not very familiar with 'gdb' usage. > > > Can you advise me on how to get core dump/log from this task ? > -- > gdb to slurmstepd and do a 'thread apply all bt full' > -- Sure, is quite straightforward: 1. In the node where a job is 'stuck', identify the step .0, for example pid 109353 here is from job 2820 step 0: ]$ ps aux|grep slurmstepd root 109336 0.2 0.0 282692 6232 ? Sl 17:56 0:01 slurmstepd: [2820.extern] root 109353 0.2 0.0 416216 6608 ? Sl 17:56 0:01 slurmstepd: [2820.0] lipi 109865 0.0 0.0 216236 2388 pts/7 S+ 18:03 0:00 grep --color=auto slurmstepd 2. Since this process runs as root, we will need to attach to this process as root: ]$ su Password: *** ]# 3. Attach to the process with gdb: ]# gdb attach 109353 When you are inside gdb and successfully attached to the process, you can generate a dump which will show us where the process is in the code: > bt We can also see which threads are alive: > info threads And we can see where is every thread in the code: > thread apply all bt When you're done: > quit Fyi, during the time you are attached to a process this process is completely under gdb control, so it does not proceed running further instructions until explictly told with a continue, next, nexti, step or stepi. In our example we don't need the process to proceed because we just want to see where it is in the code. To summarize, I need you to run and then copy paste the output here: > bt > info threads > thread apply all bt If you prefer this can be done in one line instead and put the output directly into a file: ]# gdb -ex "set confirm off" -ex "bt" -ex "info threads" -ex "thread apply all bt" -ex "quit" --p <the_pid_of_slurmstepd_step_0> > /tmp/output_gdb.txt > We have prepare a test node for this test. Ok!
Hi Felip, I hope this is helpful: [root@m3a012 ~]# ps -ef |grep slurm root 27976 1 0 May30 ? 00:00:14 /opt/slurm-19.05.4/sbin/slurmd root 28816 1 0 02:27 ? 00:00:00 slurmstepd: [14760527.extern] root 28838 28754 0 02:27 pts/1 00:00:00 grep --color=auto slurm [root@m3a012 ~]# gdb attach 28816 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... attach: No such file or directory. Attaching to process 28816 Reading symbols from /opt/slurm-19.05.4/sbin/slurmstepd...done. Reading symbols from /opt/slurm-19.05.4/lib/slurm/libslurmfull.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/libslurmfull.so Reading symbols from /usr/lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libdl.so.2 Reading symbols from /usr/lib64/libhwloc.so.5...Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libhwloc.so.5 Reading symbols from /usr/lib64/libpam.so.0...Reading symbols from /usr/lib64/libpam.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libpam.so.0 Reading symbols from /usr/lib64/libpam_misc.so.0...Reading symbols from /usr/lib64/libpam_misc.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libpam_misc.so.0 Reading symbols from /usr/lib64/libutil.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libutil.so.1 Reading symbols from /usr/lib64/libpthread.so.0...(no debugging symbols found)...done. [New LWP 28820] [New LWP 28819] [New LWP 28818] [New LWP 28817] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Loaded symbols for /usr/lib64/libpthread.so.0 Reading symbols from /usr/lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /usr/lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libm.so.6 Reading symbols from /usr/lib64/libnuma.so.1...Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnuma.so.1 Reading symbols from /usr/lib64/libltdl.so.7...Reading symbols from /usr/lib64/libltdl.so.7...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libltdl.so.7 Reading symbols from /usr/lib64/libaudit.so.1...Reading symbols from /usr/lib64/libaudit.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libaudit.so.1 Reading symbols from /usr/lib64/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libgcc_s.so.1 Reading symbols from /usr/lib64/libcap-ng.so.0...Reading symbols from /usr/lib64/libcap-ng.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libcap-ng.so.0 Reading symbols from /usr/lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnss_files.so.2 Reading symbols from /opt/slurm-19.05.4/lib/slurm/select_cons_res.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/select_cons_res.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/auth_munge.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/auth_munge.so Reading symbols from /opt/munge-0.5.11/lib/libmunge.so.2...done. Loaded symbols for /opt/munge-0.5.11/lib/libmunge.so.2 Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_energy_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_energy_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_profile_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_profile_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_interconnect_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_interconnect_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_filesystem_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_filesystem_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/switch_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/switch_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/gres_gpu.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/gres_gpu.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/jobacct_gather_cgroup.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/jobacct_gather_cgroup.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/core_spec_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/core_spec_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/proctrack_cgroup.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/proctrack_cgroup.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/task_affinity.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/task_affinity.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/task_cgroup.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/task_cgroup.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/checkpoint_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/checkpoint_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/cred_munge.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/cred_munge.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/job_container_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/job_container_none.so 0x00007fa76a8b94ca in wait4 () from /usr/lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.5-4.el7.x86_64 glibc-2.17-292.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 numactl-libs-2.0.12-3.el7_7.1.x86_64 pam-1.1.8-22.el7.x86_64 (gdb) bt #0 0x00007fa76a8b94ca in wait4 () from /usr/lib64/libc.so.6 #1 0x0000000000410674 in _spawn_job_container (job=0xcdc0f0) at mgr.c:1142 #2 job_manager (job=job@entry=0xcdc0f0) at mgr.c:1251 #3 0x000000000040d291 in main (argc=1, argv=0x7ffe760774d8) at slurmstepd.c:179 (gdb) info threads Id Target Id Frame 5 Thread 0x7fa76be3d700 (LWP 28817) "acctg" 0x00007fa76abcd9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 4 Thread 0x7fa768437700 (LWP 28818) "acctg_prof" 0x00007fa76abcdda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 3 Thread 0x7fa768336700 (LWP 28819) "slurmstepd" 0x00007fa76a8e7bed in poll () from /usr/lib64/libc.so.6 2 Thread 0x7fa7633ce700 (LWP 28820) "slurmstepd" 0x00007fa76a8e7bed in poll () from /usr/lib64/libc.so.6 * 1 Thread 0x7fa76be3e780 (LWP 28816) "slurmstepd" 0x00007fa76a8b94ca in wait4 () from /usr/lib64/libc.so.6 (gdb) thread apply all bt Thread 5 (Thread 0x7fa76be3d700 (LWP 28817)): #0 0x00007fa76abcd9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007fa76b9175e3 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:366 #2 0x00007fa76abc9e65 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00007fa76a8f288d in clone () from /usr/lib64/libc.so.6 Thread 4 (Thread 0x7fa768437700 (LWP 28818)): #0 0x00007fa76abcdda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007fa76b910f70 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205 #2 0x00007fa76abc9e65 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00007fa76a8f288d in clone () from /usr/lib64/libc.so.6 Thread 3 (Thread 0x7fa768336700 (LWP 28819)): #0 0x00007fa76a8e7bed in poll () from /usr/lib64/libc.so.6 #1 0x00007fa76b9b2688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7fa7640008d0) at eio.c:367 #2 eio_handle_mainloop (eio=0xcfd5d0) at eio.c:330 #3 0x000000000041ffcc in _msg_thr_internal (job_arg=0xcdc0f0) at req.c:289 #4 0x00007fa76abc9e65 in start_thread () from /usr/lib64/libpthread.so.0 #5 0x00007fa76a8f288d in clone () from /usr/lib64/libc.so.6 Thread 2 (Thread 0x7fa7633ce700 (LWP 28820)): #0 0x00007fa76a8e7bed in poll () from /usr/lib64/libc.so.6 #1 0x00007fa7639dcbd2 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:493 #2 0x00007fa76abc9e65 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00007fa76a8f288d in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x7fa76be3e780 (LWP 28816)): #0 0x00007fa76a8b94ca in wait4 () from /usr/lib64/libc.so.6 #1 0x0000000000410674 in _spawn_job_container (job=0xcdc0f0) at mgr.c:1142 #2 job_manager (job=job@entry=0xcdc0f0) at mgr.c:1251 #3 0x000000000040d291 in main (argc=1, argv=0x7ffe760774d8) at slurmstepd.c:179 (gdb) quit A debugging session is active. Inferior 1 [process 28816] will be detached. Quit anyway? (y or n) y Detaching from program: /opt/slurm-19.05.4/sbin/slurmstepd, process 28816
(In reply to Damien from comment #25) > Hi Felip, > > I hope this is helpful: > > > > [root@m3a012 ~]# ps -ef |grep slurm > root 27976 1 0 May30 ? 00:00:14 > /opt/slurm-19.05.4/sbin/slurmd > root 28816 1 0 02:27 ? 00:00:00 slurmstepd: [14760527.extern] > root 28838 28754 0 02:27 pts/1 00:00:00 grep --color=auto slurm Unfortunately this is not helpful because the dump is from the extern step. - Is job 28816 a failing one? - Can you attach the slurmd log from m3a012? Remember the pmix setup is done in step 0, e.g. in your past log: [2020-05-30T02:14:06.329] [14361387.0] debug: (null) [0] mpi_pmix.c:153 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: start [2020-05-30T02:14:06.330] [14361387.0] debug: mpi/pmix: setup sockets
Hi Felip, Sorry for that, I will redo this. I hope that this helps: --- [root@m3a012 ~]# ps -ef |grep srun damienl 9706 9532 0 16:39 pts/1 00:00:00 srun --mpi=pmix --reservation=AWX --nodelist=m3a012 --ntasks=1 ./a.out damienl 9720 9706 0 16:39 pts/1 00:00:00 srun --mpi=pmix --reservation=AWX --nodelist=m3a012 --ntasks=1 ./a.out root 9736 9474 0 16:39 pts/0 00:00:00 grep --color=auto srun [root@m3a012 ~]# ps -ef |grep slurm root 9714 1 0 16:39 ? 00:00:00 slurmstepd: [14761112.extern] root 9738 9474 0 16:39 pts/0 00:00:00 grep --color=auto slurm root 27976 1 0 May30 ? 00:00:15 /opt/slurm-19.05.4/sbin/slurmd [root@m3a012 ~]# gdb attach 9706 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... attach: No such file or directory. Attaching to process 9706 Reading symbols from /opt/slurm-19.05.4/bin/srun...done. Reading symbols from /lib64/libz.so.1...Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libz.so.1 Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /opt/slurm-19.05.4/lib/slurm/libslurmfull.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/libslurmfull.so Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [New LWP 9723] [New LWP 9722] [New LWP 9721] [New LWP 9707] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /opt/slurm-19.05.4/lib/slurm/select_cons_res.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/select_cons_res.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/select_cray_aries.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/select_cray_aries.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/select_cons_tres.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/select_cons_tres.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/select_linear.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/select_linear.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/switch_cray_aries.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/switch_cray_aries.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/switch_generic.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/switch_generic.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/switch_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/switch_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/launch_slurm.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/launch_slurm.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/mpi_pmix.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/mpi_pmix.so Reading symbols from /usr/lib64/libhwloc.so.5...Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libhwloc.so.5 Reading symbols from /usr/lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libm.so.6 Reading symbols from /usr/lib64/libnuma.so.1...Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnuma.so.1 Reading symbols from /usr/lib64/libltdl.so.7...Reading symbols from /usr/lib64/libltdl.so.7...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libltdl.so.7 Reading symbols from /usr/lib64/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libgcc_s.so.1 Reading symbols from /usr/local/pmix/latest/lib/libpmix.so...(no debugging symbols found)...done. Loaded symbols for /usr/local/pmix/latest/lib/libpmix.so Reading symbols from /lib64/libevent_pthreads-2.0.so.5...Reading symbols from /lib64/libevent_pthreads-2.0.so.5...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libevent_pthreads-2.0.so.5 Reading symbols from /lib64/libevent-2.0.so.5...Reading symbols from /lib64/libevent-2.0.so.5...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libevent-2.0.so.5 Reading symbols from /opt/slurm-19.05.4/lib/slurm/auth_munge.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/auth_munge.so Reading symbols from /opt/munge-0.5.11/lib/libmunge.so.2...done. Loaded symbols for /opt/munge-0.5.11/lib/libmunge.so.2 Reading symbols from /lib64/libnss_sss.so.2...Reading symbols from /lib64/libnss_sss.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libnss_sss.so.2 Reading symbols from /opt/slurm-19.05.4/lib/slurm/route_default.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/route_default.so 0x00007f0caf139da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 numactl-libs-2.0.12-3.el7_7.1.x86_64 sssd-client-1.16.4-21.el7_7.1.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) bt #0 0x00007f0caf139da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f0caf3aa0bb in slurm_step_launch_wait_start (ctx=0x1e16690) at step_launch.c:647 #2 0x00007f0cadae42eb in launch_p_step_launch (job=0x1e15340, cio_fds=<optimized out>, global_rc=<optimized out>, step_callbacks=<optimized out>, opt_local=0x61de40 <opt>) at launch_slurm.c:848 #3 0x000000000040b9f5 in launch_g_step_launch (job=job@entry=0x1e15340, cio_fds=cio_fds@entry=0x7ffd4bbc6310, global_rc=global_rc@entry=0x61e7b0 <global_rc>, step_callbacks=step_callbacks@entry=0x7ffd4bbc62e0, opt_local=opt_local@entry=0x61de40 <opt>) at launch.c:578 #4 0x0000000000407806 in _launch_one_app (data=0x1e23960) at srun.c:248 #5 0x0000000000408e37 in _launch_app (got_alloc=true, srun_job_list=0x0, job=0x1e15340) at srun.c:547 #6 srun (ac=<optimized out>, av=<optimized out>) at srun.c:202 #7 0x0000000000409246 in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17 (gdb) info threads Id Target Id Frame 5 Thread 0x7f0cafd4a700 (LWP 9707) "srun" 0x00007f0caee53bed in poll () from /lib64/libc.so.6 4 Thread 0x7f0cac1d5700 (LWP 9721) "srun" 0x00007f0caf13d381 in sigwait () from /lib64/libpthread.so.0 3 Thread 0x7f0ca75f1700 (LWP 9722) "srun" 0x00007f0caee53bed in poll () from /lib64/libc.so.6 2 Thread 0x7f0ca74f0700 (LWP 9723) "srun" 0x00007f0caee53bed in poll () from /lib64/libc.so.6 * 1 Thread 0x7f0cafd4b740 (LWP 9706) "srun" 0x00007f0caf139da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 (gdb) thread apply all bt Thread 5 (Thread 0x7f0cafd4a700 (LWP 9707)): #0 0x00007f0caee53bed in poll () from /lib64/libc.so.6 #1 0x00007f0caf4c7688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7f0ca80008d0) at eio.c:367 #2 eio_handle_mainloop (eio=eio@entry=0x1e0f690) at eio.c:330 #3 0x00007f0caf38d998 in _msg_thr_internal (arg=0x1e0f690) at allocate_msg.c:89 #4 0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f0caee5e88d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f0cac1d5700 (LWP 9721)): #0 0x00007f0caf13d381 in sigwait () from /lib64/libpthread.so.0 #1 0x00000000004113ef in _srun_signal_mgr (job_ptr=0x1e15340) at srun_job.c:2126 #2 0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f0caee5e88d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f0ca75f1700 (LWP 9722)): #0 0x00007f0caee53bed in poll () from /lib64/libc.so.6 #1 0x00007f0caf4c7688 in _poll_internal (shutdown_time=<optimized out>, nfds=3, pfds=0x7f0ca00008d0) at eio.c:367 #2 eio_handle_mainloop (eio=0x1e29660) at eio.c:330 #3 0x00007f0caf3a6d40 in _msg_thr_internal (arg=<optimized out>) at step_launch.c:1130 #4 0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f0caee5e88d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f0ca74f0700 (LWP 9723)): #0 0x00007f0caee53bed in poll () from /lib64/libc.so.6 #1 0x00007f0caf4c7688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7f0c980008d0) at eio.c:367 #2 eio_handle_mainloop (eio=0x1e2c2b0) at eio.c:330 #3 0x00007f0caf3a5f8e in _io_thr_internal (cio_arg=0x1e2bba0) at step_io.c:816 #4 0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f0caee5e88d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f0cafd4b740 (LWP 9706)): #0 0x00007f0caf139da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f0caf3aa0bb in slurm_step_launch_wait_start (ctx=0x1e16690) at step_launch.c:647 #2 0x00007f0cadae42eb in launch_p_step_launch (job=0x1e15340, cio_fds=<optimized out>, global_rc=<optimized out>, step_callbacks=<optimized out>, opt_local=0x61de40 <opt>) at launch_slurm.c:848 #3 0x000000000040b9f5 in launch_g_step_launch (job=job@entry=0x1e15340, cio_fds=cio_fds@entry=0x7ffd4bbc6310, global_rc=global_rc@entry=0x61e7b0 <global_rc>, step_callbacks=step_callbacks@entry=0x7ffd4bbc62e0, opt_local=opt_local@entry=0x61de40 <opt>) at launch.c:578 #4 0x0000000000407806 in _launch_one_app (data=0x1e23960) at srun.c:248 #5 0x0000000000408e37 in _launch_app (got_alloc=true, srun_job_list=0x0, job=0x1e15340) at srun.c:547 #6 srun (ac=<optimized out>, av=<optimized out>) at srun.c:202 #7 0x0000000000409246 in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17 (gdb) Thread 5 (Thread 0x7f0cafd4a700 (LWP 9707)): #0 0x00007f0caee53bed in poll () from /lib64/libc.so.6 #1 0x00007f0caf4c7688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7f0ca80008d0) at eio.c:367 #2 eio_handle_mainloop (eio=eio@entry=0x1e0f690) at eio.c:330 #3 0x00007f0caf38d998 in _msg_thr_internal (arg=0x1e0f690) at allocate_msg.c:89 #4 0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f0caee5e88d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f0cac1d5700 (LWP 9721)): #0 0x00007f0caf13d381 in sigwait () from /lib64/libpthread.so.0 #1 0x00000000004113ef in _srun_signal_mgr (job_ptr=0x1e15340) at srun_job.c:2126 #2 0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f0caee5e88d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f0ca75f1700 (LWP 9722)): #0 0x00007f0caee53bed in poll () from /lib64/libc.so.6 #1 0x00007f0caf4c7688 in _poll_internal (shutdown_time=<optimized out>, nfds=3, pfds=0x7f0ca00008d0) at eio.c:367 #2 eio_handle_mainloop (eio=0x1e29660) at eio.c:330 #3 0x00007f0caf3a6d40 in _msg_thr_internal (arg=<optimized out>) at step_launch.c:1130 #4 0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f0caee5e88d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f0ca74f0700 (LWP 9723)): #0 0x00007f0caee53bed in poll () from /lib64/libc.so.6 #1 0x00007f0caf4c7688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7f0c980008d0) at eio.c:367 #2 eio_handle_mainloop (eio=0x1e2c2b0) at eio.c:330 #3 0x00007f0caf3a5f8e in _io_thr_internal (cio_arg=0x1e2bba0) at step_io.c:816 #4 0x00007f0caf135e65 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f0caee5e88d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f0cafd4b740 (LWP 9706)): #0 0x00007f0caf139da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f0caf3aa0bb in slurm_step_launch_wait_start (ctx=0x1e16690) at step_launch.c:647 #2 0x00007f0cadae42eb in launch_p_step_launch (job=0x1e15340, cio_fds=<optimized out>, global_rc=<optimized out>, step_callbacks=<optimized out>, opt_local=0x61de40 <opt>) at launch_slurm.c:848 #3 0x000000000040b9f5 in launch_g_step_launch (job=job@entry=0x1e15340, cio_fds=cio_fds@entry=0x7ffd4bbc6310, global_rc=global_rc@entry=0x61e7b0 <global_rc>, step_callbacks=step_callbacks@entry=0x7ffd4bbc62e0, opt_local=opt_local@entry=0x61de40 <opt>) at launch.c:578 #4 0x0000000000407806 in _launch_one_app (data=0x1e23960) at srun.c:248 #5 0x0000000000408e37 in _launch_app (got_alloc=true, srun_job_list=0x0, job=0x1e15340) at srun.c:547 #6 srun (ac=<optimized out>, av=<optimized out>) at srun.c:202 #7 0x0000000000409246 in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17 (gdb) quit A debugging session is active. Inferior 1 [process 9706] will be detached. Quit anyway? (y or n) y Detaching from program: /opt/slurm-19.05.4/bin/srun, process 9706 [root@m3a012 ~]# ps -ef |grep slurm root 9714 1 0 16:39 ? 00:00:00 slurmstepd: [14761112.extern] root 9767 9474 0 16:41 pts/0 00:00:00 grep --color=auto slurm root 27976 1 0 May30 ? 00:00:15 /opt/slurm-19.05.4/sbin/slurmd [root@m3a012 ~]# gdb attach 9714 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... attach: No such file or directory. Attaching to process 9714 Reading symbols from /opt/slurm-19.05.4/sbin/slurmstepd...done. Reading symbols from /opt/slurm-19.05.4/lib/slurm/libslurmfull.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/libslurmfull.so Reading symbols from /usr/lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libdl.so.2 Reading symbols from /usr/lib64/libhwloc.so.5...Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libhwloc.so.5 Reading symbols from /usr/lib64/libpam.so.0...Reading symbols from /usr/lib64/libpam.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libpam.so.0 Reading symbols from /usr/lib64/libpam_misc.so.0...Reading symbols from /usr/lib64/libpam_misc.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libpam_misc.so.0 Reading symbols from /usr/lib64/libutil.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libutil.so.1 Reading symbols from /usr/lib64/libpthread.so.0...(no debugging symbols found)...done. [New LWP 9718] [New LWP 9717] [New LWP 9716] [New LWP 9715] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Loaded symbols for /usr/lib64/libpthread.so.0 Reading symbols from /usr/lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /usr/lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libm.so.6 Reading symbols from /usr/lib64/libnuma.so.1...Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnuma.so.1 Reading symbols from /usr/lib64/libltdl.so.7...Reading symbols from /usr/lib64/libltdl.so.7...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libltdl.so.7 Reading symbols from /usr/lib64/libaudit.so.1...Reading symbols from /usr/lib64/libaudit.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libaudit.so.1 Reading symbols from /usr/lib64/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libgcc_s.so.1 Reading symbols from /usr/lib64/libcap-ng.so.0...Reading symbols from /usr/lib64/libcap-ng.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libcap-ng.so.0 Reading symbols from /usr/lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnss_files.so.2 Reading symbols from /opt/slurm-19.05.4/lib/slurm/select_cons_res.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/select_cons_res.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/auth_munge.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/auth_munge.so Reading symbols from /opt/munge-0.5.11/lib/libmunge.so.2...done. Loaded symbols for /opt/munge-0.5.11/lib/libmunge.so.2 Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_energy_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_energy_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_profile_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_profile_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_interconnect_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_interconnect_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/acct_gather_filesystem_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/acct_gather_filesystem_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/switch_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/switch_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/gres_gpu.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/gres_gpu.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/jobacct_gather_cgroup.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/jobacct_gather_cgroup.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/core_spec_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/core_spec_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/proctrack_cgroup.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/proctrack_cgroup.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/task_affinity.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/task_affinity.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/task_cgroup.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/task_cgroup.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/checkpoint_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/checkpoint_none.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/cred_munge.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/cred_munge.so Reading symbols from /opt/slurm-19.05.4/lib/slurm/job_container_none.so...done. Loaded symbols for /opt/slurm-19.05.4/lib/slurm/job_container_none.so 0x00007fdfb71d34ca in wait4 () from /usr/lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.5-4.el7.x86_64 glibc-2.17-292.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 numactl-libs-2.0.12-3.el7_7.1.x86_64 pam-1.1.8-22.el7.x86_64 (gdb) bt #0 0x00007fdfb71d34ca in wait4 () from /usr/lib64/libc.so.6 #1 0x0000000000410674 in _spawn_job_container (job=0x1eba0f0) at mgr.c:1142 #2 job_manager (job=job@entry=0x1eba0f0) at mgr.c:1251 #3 0x000000000040d291 in main (argc=1, argv=0x7ffd275cfea8) at slurmstepd.c:179 (gdb) info threads Id Target Id Frame 5 Thread 0x7fdfb8757700 (LWP 9715) "acctg" 0x00007fdfb74e79f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 4 Thread 0x7fdfb4d51700 (LWP 9716) "acctg_prof" 0x00007fdfb74e7da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 3 Thread 0x7fdfb4c50700 (LWP 9717) "slurmstepd" 0x00007fdfb7201bed in poll () from /usr/lib64/libc.so.6 2 Thread 0x7fdfb4121700 (LWP 9718) "slurmstepd" 0x00007fdfb7201bed in poll () from /usr/lib64/libc.so.6 * 1 Thread 0x7fdfb8758780 (LWP 9714) "slurmstepd" 0x00007fdfb71d34ca in wait4 () from /usr/lib64/libc.so.6 (gdb) thread apply all bt Thread 5 (Thread 0x7fdfb8757700 (LWP 9715)): #0 0x00007fdfb74e79f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007fdfb82315e3 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:366 #2 0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6 Thread 4 (Thread 0x7fdfb4d51700 (LWP 9716)): #0 0x00007fdfb74e7da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007fdfb822af70 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205 #2 0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6 Thread 3 (Thread 0x7fdfb4c50700 (LWP 9717)): #0 0x00007fdfb7201bed in poll () from /usr/lib64/libc.so.6 #1 0x00007fdfb82cc688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7fdfb00008d0) at eio.c:367 #2 eio_handle_mainloop (eio=0x1edb5d0) at eio.c:330 #3 0x000000000041ffcc in _msg_thr_internal (job_arg=0x1eba0f0) at req.c:289 #4 0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0 #5 0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6 Thread 2 (Thread 0x7fdfb4121700 (LWP 9718)): #0 0x00007fdfb7201bed in poll () from /usr/lib64/libc.so.6 #1 0x00007fdfb432abd2 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:493 #2 0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x7fdfb8758780 (LWP 9714)): #0 0x00007fdfb71d34ca in wait4 () from /usr/lib64/libc.so.6 #1 0x0000000000410674 in _spawn_job_container (job=0x1eba0f0) at mgr.c:1142 #2 job_manager (job=job@entry=0x1eba0f0) at mgr.c:1251 #3 0x000000000040d291 in main (argc=1, argv=0x7ffd275cfea8) at slurmstepd.c:179 (gdb) Thread 5 (Thread 0x7fdfb8757700 (LWP 9715)): #0 0x00007fdfb74e79f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007fdfb82315e3 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:366 #2 0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6 Thread 4 (Thread 0x7fdfb4d51700 (LWP 9716)): #0 0x00007fdfb74e7da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007fdfb822af70 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:205 #2 0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6 Thread 3 (Thread 0x7fdfb4c50700 (LWP 9717)): #0 0x00007fdfb7201bed in poll () from /usr/lib64/libc.so.6 #1 0x00007fdfb82cc688 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7fdfb00008d0) at eio.c:367 #2 eio_handle_mainloop (eio=0x1edb5d0) at eio.c:330 #3 0x000000000041ffcc in _msg_thr_internal (job_arg=0x1eba0f0) at req.c:289 #4 0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0 #5 0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6 Thread 2 (Thread 0x7fdfb4121700 (LWP 9718)): #0 0x00007fdfb7201bed in poll () from /usr/lib64/libc.so.6 #1 0x00007fdfb432abd2 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:493 #2 0x00007fdfb74e3e65 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00007fdfb720c88d in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x7fdfb8758780 (LWP 9714)): #0 0x00007fdfb71d34ca in wait4 () from /usr/lib64/libc.so.6 #1 0x0000000000410674 in _spawn_job_container (job=0x1eba0f0) at mgr.c:1142 #2 job_manager (job=job@entry=0x1eba0f0) at mgr.c:1251 #3 0x000000000040d291 in main (argc=1, argv=0x7ffd275cfea8) at slurmstepd.c:179 (gdb) quit A debugging session is active. Inferior 1 [process 9714] will be detached. Quit anyway? (y or n) y Detaching from program: /opt/slurm-19.05.4/sbin/slurmstepd, process 9714
Created attachment 14660 [details] Today's slurm log on Test Node Please review the current slurmd.log. Thanks
Created attachment 14661 [details] Core file generated during this run Core file generated during this run. You might to examine this. Thanks Damien
(In reply to Damien from comment #29) > Created attachment 14661 [details] > Core file generated during this run > > Core file generated during this run. > > > You might to examine this. > > > Thanks > > Damien Again you can see how you dump .extern and srun processes, but not the .0 step. I guess your .0 is gone because in your 'ps -ef' commands it is not showing up. So for the next step I need more verbosity on slurmd logs. I need you to: 1. Set the verbosity as: SlurmdDebug=debug3 2. Run again the test, but this time use srun -vvvv and send me all the output in the console 3. Send me the slurmd logs again from the node 4. Send me the output of 'dmesg' command in the node Thanks!
Hi Felip, Thanks for your investigation. I wonder why I can see the slurmstep ".0 step". Do you have a standard test case ? I can follow your test plan to simulate the proper logs for this matter. Cheers Damien
(In reply to Damien from comment #31) > Hi Felip, > > Thanks for your investigation. > > I wonder why I can see the slurmstep ".0 step". That's exactly what I am trying to figure out now. The step .0 seems to disappear and I guess it has something to do with pmix initialization. > Do you have a standard test case ? I can follow your test plan to simulate > the proper logs for this matter. No, the tests you're doing are actually what I need, but I also need the info. requested in comment 30. Just repeat the experiment following comment 30 indications, I don't need the gdb output at the moment. Depending on what I see I will ask for more things.
Hi Felip, Kindly review the followings: root@m3a012 etc]# cat slurm.conf |grep SlurmdDebug SlurmdDebug=debug3 [damienl@m3a012 ~]$ module list Currently Loaded Modulefiles: 1) openmpi/3.1.6-ucx [damienl@m3a012 ~]$ srun -vvvv --mpi=pmix --reservation=AWX --nodelist=m3a012 --ntasks=2 /home/damienl/a.out srun: defined options srun: -------------------- -------------------- srun: mpi : pmix srun: nodelist : m3a012 srun: ntasks : 2 srun: reservation : AWX srun: verbose : 4 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=8388608 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=4096 srun: debug: propagating RLIMIT_NOFILE=1024 srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0022 srun: debug2: srun PMI messages to port=36580 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 42955 srun: debug: Entering _msg_thr_internal srun: debug3: eio_message_socket_readable: shutdown 0 fd 4 srun: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/auth_munge.so srun: debug: Munge authentication plugin loaded srun: debug3: Success. srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index) srun: debug: Waited 0.100000 sec and still waiting: next sleep for 0.200000 sec srun: debug: Waited 0.300000 sec and still waiting: next sleep for 0.300000 sec srun: Nodes m3a012 are ready for job srun: jobid 14801428: nodes(1):`m3a012', cpu counts: 2(x1) srun: debug2: creating job with 2 tasks srun: debug: requesting job 14801428, user 10005, nodes 1 including (m3a012) srun: debug: cpus 2, tasks 2, name a.out, relative 65534 srun: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi type = (null) srun: debug: (null) [0] mpi_pmix.c:212 [p_mpi_hook_client_prelaunch] mpi/pmix: setup process mapping in srun srun: debug: Entering _msg_thr_create() srun: debug3: eio_message_socket_readable: shutdown 0 fd 12 srun: debug3: eio_message_socket_readable: shutdown 0 fd 8 srun: debug: initialized stdio listening socket, port 38639 srun: debug: Started IO server thread (139834612782848) srun: debug: Entering _launch_tasks srun: debug3: IO thread pid = 10240 srun: debug2: Called _file_readable srun: debug3: false, all ioservers not yet initialized srun: debug2: Called _file_writable srun: debug3: false srun: debug3: eof is false srun: debug2: Called _file_writable srun: debug3: false srun: debug3: eof is false srun: debug3: Called _listening_socket_readable srun: launching 14801428.0 on host m3a012, 2 tasks: [0-1] srun: debug3: uid:10005 gid:10025 cwd:/home/damienl 0 srun: debug3: Trying to load plugin /opt/slurm-19.05.4/lib/slurm/route_default.so srun: route default plugin loaded srun: debug3: Success. srun: debug2: Tree head got back 0 looking for 1 srun: debug3: Tree sending to m3a012 srun: debug2: Tree head got back 1 srun: debug: launch returned msg_rc=0 err=0 type=8001 Wait 5 mins , then Control+Break ^C srun: interrupt (one more within 1 sec to abort) srun: step:14801428.0 tasks 0-1: unknown ^Csrun: interrupt (one more within 1 sec to abort) srun: step:14801428.0 tasks 0-1: unknown ^Csrun: sending Ctrl-C to job 14801428.0 srun: debug2: sending signal 2 to step 14801428.0 on hosts m3a012 srun: debug2: Tree head got back 0 looking for 1 srun: debug3: Tree sending to m3a012 srun: debug2: Tree head got back 1 srun: debug3: eio_message_socket_accept: start srun: Job step 14801428.0 aborted before step completely launched. srun: debug2: eio_message_socket_accept: got message connection from 172.16.200.79:46018 16 srun: Job step aborted: Waiting up to 12 seconds for job step to finish. srun: debug2: received job step complete message srun: Complete job step 14801428.0 received srun: debug3: eio_message_socket_readable: shutdown 0 fd 12 srun: debug3: eio_message_socket_readable: shutdown 0 fd 8 ^Csrun: forcing job termination ^Csrun: interrupt (abort already in progress) srun: step:14801428.0 tasks 0-1: unknown ^Csrun: job abort in progress srun: debug3: eio_message_socket_accept: start srun: debug2: eio_message_socket_accept: got message connection from 172.16.200.79:46022 18 srun: debug2: received job step complete message srun: Complete job step 14801428.0 received srun: debug3: eio_message_socket_readable: shutdown 0 fd 12 srun: debug3: eio_message_socket_readable: shutdown 0 fd 8 ^Csrun: interrupt (abort already in progress) srun: step:14801428.0 tasks 0-1: unknown ^Csrun: job abort in progress ^Csrun: interrupt (abort already in progress) srun: step:14801428.0 tasks 0-1: unknown ^Csrun: job abort in progress ^Csrun: interrupt (abort already in progress) srun: step:14801428.0 tasks 0-1: unknown ^Csrun: job abort in progress ^Csrun: job abort in progress ^Csrun: interrupt (abort already in progress) srun: step:14801428.0 tasks 0-1: unknown ^Csrun: job abort in progress ^C^C^Csrun: interrupt (abort already in progress) srun: step:14801428.0 tasks 0-1: unknown ^Csrun: job abort in progress ^C^Csrun: interrupt (abort already in progress) srun: step:14801428.0 tasks 0-1: unknown ^Csrun: job abort in progress ^C^Csrun: interrupt (abort already in progress) srun: step:14801428.0 tasks 0-1: unknown ^Csrun: job abort in progress ^C^C^C^Csrun: interrupt (abort already in progress) srun: step:14801428.0 tasks 0-1: unknown ^Csrun: job abort in progress ^C^C^C^Csrun: error: Timed out waiting for job step to complete srun: debug3: eio_message_socket_accept: start srun: debug2: eio_message_socket_accept: got message connection from 172.16.200.79:46028 18 srun: debug2: received job step complete message srun: Complete job step 14801428.0 received srun: debug3: eio_message_socket_readable: shutdown 0 fd 12 srun: debug3: eio_message_socket_readable: shutdown 0 fd 8 srun: debug3: eio_message_socket_readable: shutdown 1 fd 12 srun: debug2: false, shutdown srun: debug3: eio_message_socket_readable: shutdown 1 fd 8 srun: debug2: false, shutdown srun: debug2: Called _file_readable srun: debug3: false, shutdown srun: debug2: Called _file_writable srun: debug3: false srun: debug3: eof is false srun: debug2: Called _file_writable srun: debug3: false srun: debug3: eof is false srun: debug3: Called _listening_socket_readable srun: debug2: false, shutdown srun: debug: IO thread exiting srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread srun: debug3: eio_message_socket_readable: shutdown 1 fd 4 srun: debug2: false, shutdown srun: debug: Leaving _msg_thr_internal dmesg messages: damienl@m3a012 tmp]$ dmesg [3917656.132106] echo (9975): drop_caches: 1 [3917712.010828] echo (10048): drop_caches: 1 [3917717.620904] echo (10057): drop_caches: 1 [3917755.767754] echo (10125): drop_caches: 1 [3917810.449355] LustreError: 10224:0:(layout.c:2121:__req_capsule_get()) @@@ Wrong buffer for field 'niobuf_inline' (7 of 7) in format 'LDLM_INTENT_OPEN', 0 vs. 0 (server) req@ffff9ad53f260000 x1665840541408768/t549497789873(549497789873) o101->fs02-MDT0000-mdc-ffff9ad53cb8c000@172.16.192.3@o2ib:12/10 lens 616/600 e 0 to 0 dl 1592584327 ref 3 fl Complete:RPQU/4/0 rc 0/0 job:'bash.10005' [3917824.703899] LustreError: 10230:0:(layout.c:2121:__req_capsule_get()) @@@ Wrong buffer for field 'niobuf_inline' (7 of 7) in format 'LDLM_INTENT_OPEN', 0 vs. 0 (server) req@ffff9ac642378480 x1665840541415808/t549497841330(549497841330) o101->fs02-MDT0000-mdc-ffff9ad53cb8c000@172.16.192.3@o2ib:12/10 lens 616/600 e 0 to 0 dl 1592584341 ref 3 fl Complete:RPQU/4/0 rc 0/0 job:'bash.10005' [3917829.915317] LustreError: 10236:0:(layout.c:2121:__req_capsule_get()) @@@ Wrong buffer for field 'niobuf_inline' (7 of 7) in format 'LDLM_INTENT_OPEN', 0 vs. 0 (server) req@ffff9ac5b47a5580 x1665840541423040/t549497859215(549497859215) o101->fs02-MDT0000-mdc-ffff9ad53cb8c000@172.16.192.3@o2ib:12/10 lens 616/600 e 0 to 0 dl 1592584347 ref 3 fl Complete:RPQU/4/0 rc 0/0 job:'bash.10005' [3917850.600014] echo (10246): drop_caches: 1 [3917929.517249] echo (10292): drop_caches: 1
Created attachment 14736 [details] Latest slurmd from Test node (m3a012)
From the logs, it seems pretty clear that the issue must be happening in PMIx initialization which may make slurmstepd to crash. We don't see anything more in the logs for slurm step 0 after the call to pmixp_stepd_init()->pmixp_info_nspace_usock(). I see you're using: /usr/local/pmix/3.1.4 We have a few options: 1. Try the latest 3.2 pmix, you need to recompile pmix, openmpi and slurm and point to the new pmix. 2. Add a debug patch in slurmstepd, recompile slurm and try again. We'll get further logs from here. 3. Look for slurmstepd core file. Depending on how you've configured your OS, an abort in a process should create a 'core file' which can later be inspected with gdb. Normally the core is generated using the command/path in /proc/sys/kernel/core_pattern, just check it. You're in RHEL 7, so you I guess you will see the file in abrt directory: /var/spool/abrt/ Can you check if there's any core file related to slurmstepd? cat /var/spool/abrt/ccpp.../cmdline <--- this will give you whether the directory relates to slurmstepd ---- Let's start for number 3, let me know if you find any core file. You can also look into Slurm's log dir in the node, they are put there if cannot go into another place.
I have some more information. Some time ago I worked with bug 7646 Please, see these comments: https://bugs.schedmd.com/show_bug.cgi?id=7646#c26 https://bugs.schedmd.com/show_bug.cgi?id=7646#c28 https://bugs.schedmd.com/show_bug.cgi?id=7646#c30 Artem Polyakov is the developer of PMIx: > Now I think what we should do to move forward is to document that UCX support is only available with rdma-core: > * starting from MOFED 4.7 if rdma-core is explicitly enabled (need to double-check the exact instructions on how to enable) > * starting from v5.0 - by default. May you try to enable rdma-core in your systems? I am not sure how to do this with your ROCE network though, which is out of my scope at the moment.
Damien, I will timeout this bug for the moment. Please, check my latest comments and just mark it as open again if you feel we need to go deeper in this issue. Thanks for your comprehension!