Ticket 12115

Summary: srun does not trigger core dumps with SIGABRT
Product: Slurm Reporter: Marc Caubet Serrabou <marc.caubet>
Component: slurmdAssignee: Nate Rini <nate>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: nate
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: Paul Scherrer Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: mpi_endlessloop.80s-65756,merlin-c-219.psi.ch.btr
mpi_endlessloop.80s-65755,merlin-c-219.psi.ch.btr
mpi_endlessloop.80s-65753,merlin-c-219.psi.ch.btr
slurm-519762.out.tar.gz
slurm-519701.out.tar.gz
slurmd_merlin-c-220.tar.gz
cgroup.conf
slurm.conf
mpi_endlessloop.80s-32107,merlin-c-320.psi.ch.btr
slurm-528502.out
slurm-528501.out
slurm-539438.out
slurm-539437.out
mpi_endlessloop.80s-60583,merlin-c-023.psi.ch.btr
slurm-544159.out
slurmd.log

Description Marc Caubet Serrabou 2021-07-23 09:47:55 MDT
Hi,

for some reason, srun does not produce a core dump for software crashing due to SIGABRT, while mpirun can handle it perfectly. That's simple to reproduce with the following code:

#### Software example
#include <mpi.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{
  int rank;
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  if (rank == 3) {
    int i = 0;
    printf("PID %d is endless waiting\n", getpid());
    fflush(stdout);
    while (i == 0) 
      sleep(10);
  } else {
    printf("PID %d I am waiting\n", getpid());
    abort();
  }
}

#### Running software
ulimit -c unlimited
sbatch -n 4 --wrap "srun ./mpi_endlessloop"    # core dumps not generated
sbatch -n 4 --wrap "mpirun ./mpi_endlessloop"  # core dumps generated

Software running with srun generated a couple of back trace files, while software running with mpirun generates correctly the core dump files.

Thanks a lot,
Marc
Comment 1 Nate Rini 2021-07-23 09:55:16 MDT
Please call this:
> sbatch -n 4 --wrap "cat /proc/sys/kernel/core_pattern"
> sbatch -n 4 --wrap "srun cat /proc/sys/kernel/core_pattern"
Comment 2 Marc Caubet Serrabou 2021-07-23 09:57:54 MDT
Hi, thanks for taking care of it:

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$  sbatch -n 4 --wrap "cat /proc/sys/kernel/core_pattern"
Submitted batch job 504951
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ sbatch -n 4 --wrap "srun cat /proc/sys/kernel/core_pattern"
Submitted batch job 504952

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ cat slurm-504951.out
core
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ cat slurm-504952.out
core
core
core
core
Comment 3 Nate Rini 2021-07-23 10:05:49 MDT
(In reply to Marc Caubet Serrabou from comment #2)
> (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ cat slurm-504951.out
> core
> (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ cat slurm-504952.out
> core

Looks like it dumps as core in the current working directory of the job.

Please try this:
> sbatch -n 4 --wrap "srun bash -c 'unlink core; ulimit -c unlimited; ./mpi_endlessloop; file core'"
Comment 4 Marc Caubet Serrabou 2021-07-26 00:24:24 MDT
Hi,

thanks for replying. Looks "core" is not found:

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ sbatch -n 4 --wrap "mpirun bash -c 'unlink core; ulimit -c unlimited; ./mpi_endlessloop; file core'"
Submitted batch job 512906

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ cat slurm-512904.out
unlink: cannot unlink ‘core’: No such file or directory
unlink: cannot unlink ‘core’: No such file or directory
unlink: cannot unlink ‘core’: No such file or directory
unlink: cannot unlink ‘core’: No such file or directory
PID 47129 waiting
PID 47127 I am not rank 3
PID 47126 I am not rank 3

mpi_endlessloop:47127 terminated with signal 6 at PC=2b66727603d7 SP=7fff42beef48.  Backtrace:

mpi_endlessloop:47126 terminated with signal 6 at PC=2b1e0e34d3d7 SP=7ffe5b6abfa8.  Backtrace:
PID 47130 I am not rank 3

mpi_endlessloop:47130 terminated with signal 6 at PC=2b0d878703d7 SP=7ffe9c160288.  Backtrace:
/usr/lib64/libc.so.6(gsignal+0x37)[0x2b66727603d7]
/usr/lib64/libc.so.6(gsignal+0x37)[0x2b1e0e34d3d7]
/usr/lib64/libc.so.6(gsignal+0x37)[0x2b0d878703d7]
/usr/lib64/libc.so.6(abort+0x148)[0x2b1e0e34eac8]
./mpi_endlessloop[0x40095b]
/usr/lib64/libc.so.6(abort+0x148)[0x2b0d87871ac8]
./mpi_endlessloop[0x40095b]
/usr/lib64/libc.so.6(abort+0x148)[0x2b6672761ac8]
./mpi_endlessloop[0x40095b]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b667274c555]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1e0e339555]
./mpi_endlessloop[0x4007c9]
./mpi_endlessloop[0x4007c9]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b0d8785c555]
./mpi_endlessloop[0x4007c9]
[1627280397.900741] [merlin-c-113:47126:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 20000b819 apid 30000b816 is not released, refcount 1
[1627280397.900746] [merlin-c-113:47126:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 20000b81a apid 20000b816 is not released, refcount 1
[1627280397.900748] [merlin-c-113:47126:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 20000b817 apid 40000b816 is not released, refcount 1
[1627280397.900749] [merlin-c-113:47126:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 20000b816 apid 10000b816 is not released, refcount 1
[1627280397.900811] [merlin-c-113:47127:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 20000b819 apid 40000b817 is not released, refcount 1
[1627280397.900817] [merlin-c-113:47127:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 20000b81a apid 30000b817 is not released, refcount 1
[1627280397.900819] [merlin-c-113:47127:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 20000b817 apid 10000b817 is not released, refcount 1
[1627280397.900820] [merlin-c-113:47127:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 20000b816 apid 20000b817 is not released, refcount 1
[1627280397.900823] [merlin-c-113:47130:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 20000b819 apid 20000b81a is not released, refcount 1
[1627280397.900828] [merlin-c-113:47130:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 20000b81a apid 10000b81a is not released, refcount 1
[1627280397.900830] [merlin-c-113:47130:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 20000b817 apid 30000b81a is not released, refcount 1
[1627280397.900831] [merlin-c-113:47130:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 20000b816 apid 40000b81a is not released, refcount 1
core: cannot open (No such file or directory)
core: cannot open (No such file or directory)
core: cannot open (No such file or directory)
Comment 5 Nate Rini 2021-07-26 09:06:49 MDT
(In reply to Marc Caubet Serrabou from comment #4)
> [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$
> sbatch -n 4 --wrap "mpirun bash -c 'unlink core; ulimit -c unlimited;
> ./mpi_endlessloop; file core'"
> Submitted batch job 512906
>
> core: cannot open (No such file or directory)

Is it possible to modify the core_pattern to see if this is a file permission issue?
> echo '/tmp/core_%e_%g_%P_%s_%u' > /proc/sys/kernel/core_pattern
Comment 6 Marc Caubet Serrabou 2021-07-26 09:41:47 MDT
Hi,

after changing the core pattern as suggested:

[root@merlin-c-314 ~]# echo '/tmp/core_%e_%g_%P_%s_%u' > /proc/sys/kernel/core_pattern
[root@merlin-c-314 ~]# cat /proc/sys/kernel/core_pattern
/tmp/core_%e_%g_%P_%s_%u

what I see is the following: 
  * For "mpirun" it generates the proper core file, with my username as the owner of the file, and job gets aborted (from within the code). 
  * For "srun", the job stays running (while it should be aborted once "abort()" is called). Then, one needs to cancel it. By cancelling via scancel with SIGABRT it generates the core file, with root as owner. 

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ squeue -u caubet_m -a
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            515371 cpu-maint     wrap caubet_m  R       0:52      1 merlin-c-314

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ scancel --signal=SIGABRT 515371

Telling that, it seems a file permission from "srun" which is not able to write a core file as a the user running the job, while this is working with "mpirun". 

When cancelling a job (with SIGABRT) by using scancel, it will do it as "root" (slurmd), hence it has proper rights for generating the core file.

Is this correct? 

Thanks a lot,
Marc
Comment 7 Nate Rini 2021-07-26 09:46:56 MDT
(In reply to Marc Caubet Serrabou from comment #6)
> Hi,
> 
> after changing the core pattern as suggested:
> 
> [root@merlin-c-314 ~]# echo '/tmp/core_%e_%g_%P_%s_%u' >
> /proc/sys/kernel/core_pattern
> [root@merlin-c-314 ~]# cat /proc/sys/kernel/core_pattern
> /tmp/core_%e_%g_%P_%s_%u

Please make sure to set this for whatever is appropriate for your site after. This suggestion may have security implications.
 
> what I see is the following: 
>   * For "mpirun" it generates the proper core file, with my username as the
> owner of the file, and job gets aborted (from within the code). 
>   * For "srun", the job stays running (while it should be aborted once
> "abort()" is called). Then, one needs to cancel it. By cancelling via
> scancel with SIGABRT it generates the core file, with root as owner. 

This suggests that the Slurm integration isn't active on the MPI layer. Which MPI is being used?
Comment 8 Marc Caubet Serrabou 2021-07-27 01:59:07 MDT
Hi,


Thanks for the reminder. Yes, when I did the test I removed that node from production and then added back with the proper pattern.


Regarding to your question, I use OpenMPI v4.0.5. This is compiled with Slurm related options, in example, some of the most relevant ones (ompi_info -c):


  Configure command line: '--prefix=/opt/psi/Compiler/openmpi/4.0.5_slurm/gcc/9.3.0'
                          '--with-cuda=/opt/psi/Programming/cuda/11.1.0'
                          '--prefix=/opt/psi/Compiler/openmpi/4.0.5_slurm/gcc/9.3.0'
                          '--enable-mpi-cxx' '--enable-mpi-cxx-seek'
                          '--enable-orterun-prefix-by-default'
                          '--enable-shared' '--enable-static'
                          '--with-sge=yes' '--with-ucx'
                          '--with-hwloc=internal' '--with-slurm=yes'
                          '--with-pmi' '--with-pmi-libdir=/usr/lib64/'
                          '--enable-mpi-fortran' '--without-verbs'


Cheers,

Marc
Comment 9 Nate Rini 2021-07-28 08:58:09 MDT
Please provide the output of:
> orte-info
Comment 10 Marc Caubet Serrabou 2021-07-28 09:01:27 MDT
Hi,


here it is:


(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ orte-info
                Open RTE: 4.0.5
  Open RTE repo revision: v4.0.5
   Open RTE release date: Aug 26, 2020
                  Prefix: /opt/psi/Compiler/openmpi/4.0.5_slurm/gcc/9.3.0
 Configured architecture: x86_64-unknown-linux-gnu
          Configure host: merlin-l-002.psi.ch
           Configured by: caubet_m
           Configured on: Mon Nov  9 21:11:08 CET 2020
          Configure host: merlin-l-002.psi.ch
  Configure command line: '--prefix=/opt/psi/Compiler/openmpi/4.0.5_slurm/gcc/9.3.0' '--with-cuda=/opt/psi/Programming/cuda/11.1.0'
                          '--prefix=/opt/psi/Compiler/openmpi/4.0.5_slurm/gcc/9.3.0' '--enable-mpi-cxx' '--enable-mpi-cxx-seek' '--enable-orterun-prefix-by-default'
                          '--enable-shared' '--enable-static' '--with-sge=yes' '--with-ucx' '--with-hwloc=internal' '--with-slurm=yes' '--with-pmi'
                          '--with-pmi-libdir=/usr/lib64/' '--enable-mpi-fortran' '--without-verbs'
                Built by: caubet_m
                Built on: Mon Nov  9 21:18:19 CET 2020
              Built host: merlin-l-002.psi.ch
              C compiler: /opt/psi/Programming/gcc/9.3.0/bin/gcc
     C compiler absolute:
  C compiler family name: GNU
      C compiler version: 9.3.0
          Thread support: posix (OPAL: yes, ORTE progress: yes, Event lib: yes)
  Internal debug support: no
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
orterun default --prefix: yes
       MPI_WTIME support: native
     Symbol vis. support: yes
   FT Checkpoint support: no (checkpoint thread: no)
           MCA allocator: basic (MCA v2.1, API v2.0, Component v4.0.5)
           MCA allocator: bucket (MCA v2.1, API v2.0, Component v4.0.5)
           MCA backtrace: execinfo (MCA v2.1, API v2.0, Component v4.0.5)
                 MCA btl: self (MCA v2.1, API v3.1, Component v4.0.5)
                 MCA btl: smcuda (MCA v2.1, API v3.1, Component v4.0.5)
                 MCA btl: tcp (MCA v2.1, API v3.1, Component v4.0.5)
                 MCA btl: usnic (MCA v2.1, API v3.1, Component v4.0.5)
                 MCA btl: vader (MCA v2.1, API v3.1, Component v4.0.5)
            MCA compress: bzip (MCA v2.1, API v2.0, Component v4.0.5)
            MCA compress: gzip (MCA v2.1, API v2.0, Component v4.0.5)
                 MCA crs: none (MCA v2.1, API v2.0, Component v4.0.5)
                  MCA dl: dlopen (MCA v2.1, API v1.0, Component v4.0.5)
               MCA event: libevent2022 (MCA v2.1, API v2.0, Component v4.0.5)
               MCA hwloc: hwloc201 (MCA v2.1, API v2.0, Component v4.0.5)
                  MCA if: linux_ipv6 (MCA v2.1, API v2.0, Component v4.0.5)
                  MCA if: posix_ipv4 (MCA v2.1, API v2.0, Component v4.0.5)
         MCA installdirs: env (MCA v2.1, API v2.0, Component v4.0.5)
         MCA installdirs: config (MCA v2.1, API v2.0, Component v4.0.5)
              MCA memory: patcher (MCA v2.1, API v2.0, Component v4.0.5)
               MCA mpool: hugepage (MCA v2.1, API v3.0, Component v4.0.5)
             MCA patcher: overwrite (MCA v2.1, API v1.0, Component v4.0.5)
                MCA pmix: isolated (MCA v2.1, API v2.0, Component v4.0.5)
                MCA pmix: pmix3x (MCA v2.1, API v2.0, Component v4.0.5)
                MCA pmix: s1 (MCA v2.1, API v2.0, Component v4.0.5)
                MCA pmix: s2 (MCA v2.1, API v2.0, Component v4.0.5)
               MCA pstat: linux (MCA v2.1, API v2.0, Component v4.0.5)
              MCA rcache: grdma (MCA v2.1, API v3.3, Component v4.0.5)
              MCA rcache: gpusm (MCA v2.1, API v3.3, Component v4.0.5)
              MCA rcache: rgpusm (MCA v2.1, API v3.3, Component v4.0.5)
           MCA reachable: weighted (MCA v2.1, API v2.0, Component v4.0.5)
           MCA reachable: netlink (MCA v2.1, API v2.0, Component v4.0.5)
               MCA shmem: mmap (MCA v2.1, API v2.0, Component v4.0.5)
               MCA shmem: posix (MCA v2.1, API v2.0, Component v4.0.5)
               MCA shmem: sysv (MCA v2.1, API v2.0, Component v4.0.5)
               MCA timer: linux (MCA v2.1, API v2.0, Component v4.0.5)
              MCA errmgr: default_app (MCA v2.1, API v3.0, Component v4.0.5)
              MCA errmgr: default_hnp (MCA v2.1, API v3.0, Component v4.0.5)
              MCA errmgr: default_orted (MCA v2.1, API v3.0, Component v4.0.5)
              MCA errmgr: default_tool (MCA v2.1, API v3.0, Component v4.0.5)
                 MCA ess: env (MCA v2.1, API v3.0, Component v4.0.5)
                 MCA ess: hnp (MCA v2.1, API v3.0, Component v4.0.5)
                 MCA ess: pmi (MCA v2.1, API v3.0, Component v4.0.5)
                 MCA ess: singleton (MCA v2.1, API v3.0, Component v4.0.5)
                 MCA ess: tool (MCA v2.1, API v3.0, Component v4.0.5)
                 MCA ess: slurm (MCA v2.1, API v3.0, Component v4.0.5)
               MCA filem: raw (MCA v2.1, API v2.0, Component v4.0.5)
             MCA grpcomm: direct (MCA v2.1, API v3.0, Component v4.0.5)
                 MCA iof: hnp (MCA v2.1, API v2.0, Component v4.0.5)
                 MCA iof: orted (MCA v2.1, API v2.0, Component v4.0.5)
                 MCA iof: tool (MCA v2.1, API v2.0, Component v4.0.5)
                MCA odls: default (MCA v2.1, API v2.0, Component v4.0.5)
                MCA odls: pspawn (MCA v2.1, API v2.0, Component v4.0.5)
                 MCA oob: tcp (MCA v2.1, API v2.0, Component v4.0.5)
                 MCA plm: isolated (MCA v2.1, API v2.0, Component v4.0.5)
                 MCA plm: rsh (MCA v2.1, API v2.0, Component v4.0.5)
                 MCA plm: slurm (MCA v2.1, API v2.0, Component v4.0.5)
                 MCA ras: simulator (MCA v2.1, API v2.0, Component v4.0.5)
                 MCA ras: gridengine (MCA v2.1, API v2.0, Component v4.0.5)
                 MCA ras: slurm (MCA v2.1, API v2.0, Component v4.0.5)
                MCA regx: fwd (MCA v2.1, API v1.0, Component v4.0.5)
                MCA regx: naive (MCA v2.1, API v1.0, Component v4.0.5)
                MCA regx: reverse (MCA v2.1, API v1.0, Component v4.0.5)
               MCA rmaps: mindist (MCA v2.1, API v2.0, Component v4.0.5)
               MCA rmaps: ppr (MCA v2.1, API v2.0, Component v4.0.5)
               MCA rmaps: rank_file (MCA v2.1, API v2.0, Component v4.0.5)
               MCA rmaps: resilient (MCA v2.1, API v2.0, Component v4.0.5)
               MCA rmaps: round_robin (MCA v2.1, API v2.0, Component v4.0.5)
               MCA rmaps: seq (MCA v2.1, API v2.0, Component v4.0.5)
                 MCA rml: oob (MCA v2.1, API v3.0, Component v4.0.5)
              MCA routed: binomial (MCA v2.1, API v3.0, Component v4.0.5)
              MCA routed: direct (MCA v2.1, API v3.0, Component v4.0.5)
              MCA routed: radix (MCA v2.1, API v3.0, Component v4.0.5)
                 MCA rtc: hwloc (MCA v2.1, API v1.0, Component v4.0.5)
              MCA schizo: flux (MCA v2.1, API v1.0, Component v4.0.5)
              MCA schizo: ompi (MCA v2.1, API v1.0, Component v4.0.5)
              MCA schizo: orte (MCA v2.1, API v1.0, Component v4.0.5)
              MCA schizo: slurm (MCA v2.1, API v1.0, Component v4.0.5)
               MCA state: app (MCA v2.1, API v1.0, Component v4.0.5)
               MCA state: hnp (MCA v2.1, API v1.0, Component v4.0.5)
               MCA state: novm (MCA v2.1, API v1.0, Component v4.0.5)
               MCA state: orted (MCA v2.1, API v1.0, Component v4.0.5)
               MCA state: tool (MCA v2.1, API v1.0, Component v4.0.5)
Thanks a lot,
Marc
Comment 11 Nate Rini 2021-07-28 09:05:44 MDT
Please also call:
> srun --mpi=list
> ls -la /usr/lib64/lib*pmi*
Comment 12 Marc Caubet Serrabou 2021-07-28 09:08:43 MDT
Hi,


we only provide pmi2 support for srun, pmix and other will be soon provided too:


(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: cray_shasta

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ ls -la /usr/lib64/lib*pmi*
lrwxrwxrwx. 1 root root      21  6. Jul 2020  /usr/lib64/libfreeipmi.so -> libfreeipmi.so.17.1.4
lrwxrwxrwx. 1 root root      21  6. Jul 2020  /usr/lib64/libfreeipmi.so.17 -> libfreeipmi.so.17.1.4
-rwxr-xr-x. 1 root root 5156016 27. Mär 2019  /usr/lib64/libfreeipmi.so.17.1.4
lrwxrwxrwx. 1 root root      23  6. Jul 2020  /usr/lib64/libipmiconsole.so -> libipmiconsole.so.2.3.4
lrwxrwxrwx. 1 root root      23  6. Jul 2020  /usr/lib64/libipmiconsole.so.2 -> libipmiconsole.so.2.3.4
-rwxr-xr-x. 1 root root  249592 27. Mär 2019  /usr/lib64/libipmiconsole.so.2.3.4
lrwxrwxrwx. 1 root root      22  6. Jul 2020  /usr/lib64/libipmidetect.so -> libipmidetect.so.0.0.0
lrwxrwxrwx. 1 root root      22  6. Jul 2020  /usr/lib64/libipmidetect.so.0 -> libipmidetect.so.0.0.0
-rwxr-xr-x. 1 root root   62824 27. Mär 2019  /usr/lib64/libipmidetect.so.0.0.0
lrwxrwxrwx. 1 root root      26  6. Jul 2020  /usr/lib64/libipmimonitoring.so -> libipmimonitoring.so.6.0.6
lrwxrwxrwx. 1 root root      26  6. Jul 2020  /usr/lib64/libipmimonitoring.so.6 -> libipmimonitoring.so.6.0.6
-rwxr-xr-x. 1 root root  121416 27. Mär 2019  /usr/lib64/libipmimonitoring.so.6.0.6
lrwxrwxrwx. 1 root root      16 26. Mai 12:17 /usr/lib64/libpmi2.so -> libpmi2.so.0.0.0
lrwxrwxrwx. 1 root root      16 26. Mai 12:17 /usr/lib64/libpmi2.so.0 -> libpmi2.so.0.0.0
-rwxr-xr-x. 1 root root  239872 14. Mai 10:44 /usr/lib64/libpmi2.so.0.0.0
lrwxrwxrwx. 1 root root      15 26. Mai 12:17 /usr/lib64/libpmi.so -> libpmi.so.0.0.0
lrwxrwxrwx. 1 root root      15 26. Mai 12:17 /usr/lib64/libpmi.so.0 -> libpmi.so.0.0.0
-rwxr-xr-x. 1 root root  230896 14. Mai 10:44 /usr/lib64/libpmi.so.0.0.0
lrwxrwxrwx. 1 root root      17 26. Mai 10:36 /usr/lib64/librpmio.so.3 -> librpmio.so.3.2.2
-rwxr-xr-x. 1 root root  178928  2. Jun 2020  /usr/lib64/librpmio.so.3.2.2

Cheers,

Marc
Comment 13 Nate Rini 2021-07-28 09:15:56 MDT
(In reply to Marc Caubet Serrabou from comment #12)
> we only provide pmi2 support for srun, pmix and other will be soon provided
> too:

My main concern is that Slurm and the job (via openmpi) are using the correct pmi2 and not the pmi1 compatibility layer in pmi2 (or just not at all).

Please call this:
Please try this:
> sbatch -n 4 --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; ulimit -c unlimited; ./mpi_endlessloop; file core'"
Comment 14 Marc Caubet Serrabou 2021-07-28 09:24:09 MDT
Created attachment 20574 [details]
mpi_endlessloop.80s-65756,merlin-c-219.psi.ch.btr

Here it is:

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ cat slurm-519654.out
srun: defined options
srun: -------------------- --------------------
srun: (null)              : merlin-c-219
srun: jobid               : 519654
srun: job-name            : wrap
srun: mem-per-cpu         : 4000
srun: mpi                 : pmi2
srun: nodes               : 1
srun: ntasks              : 4
srun: verbose             : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CORE=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug2: srun PMI messages to port=45457
srun: debug:  auth/munge: init: Munge authentication plugin loaded
srun: jobid 519654: nodes(1):`merlin-c-219', cpu counts: 8(x1)
srun: debug2: creating job with 4 tasks
srun: debug:  requesting job 519654, user 39177, nodes 1 including ((null))
srun: debug:  cpus 4, tasks 4, name bash, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  mpi/pmi2: p_mpi_hook_client_prelaunch: mpi/pmi2: client_prelaunch
srun: debug:  mpi/pmi2: _get_proc_mapping: mpi/pmi2: processor mapping: (vector,(0,1,4))
srun: debug:  mpi/pmi2: _setup_srun_socket: mpi/pmi2: srun pmi port: 44382
srun: debug2: mpi/pmi2: _tree_listen_readable: mpi/pmi2: _tree_listen_readable
srun: debug:  mpi/pmi2: pmi2_start_agent: mpi/pmi2: started agent thread
srun: debug:  Entering _msg_thr_create()
srun: debug:  initialized stdio listening socket, port 36265
srun: debug:  Started IO server thread (47631167780608)
srun: debug:  Entering _launch_tasks
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: launching StepId=519654.0 on host merlin-c-219, 4 tasks: [0-3]
srun: route/default: init: route default plugin loaded
srun: debug2: Tree head got back 0 looking for 1
srun: debug2: Tree head got back 1
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: debug2: Activity on IO listening socket 15
srun: debug2: Entering io_init_msg_read_from_fd
srun: debug2: Leaving  io_init_msg_read_from_fd
srun: debug2: Entering io_init_msg_validate
srun: debug2: Leaving  io_init_msg_validate
srun: debug2: Validated IO connection from 129.129.185.99:56418, node rank 0, sd=16
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_read
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: eio_message_socket_accept: got message connection from 129.129.185.99:38236 17
srun: debug2: received task launch
srun: launch/slurm: _task_start: Node merlin-c-219, 4 tasks started
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
unlink: cannot unlink ‘core’: No such file or directory
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
unlink: cannot unlink ‘core’: No such file or directory
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
unlink: cannot unlink ‘core’: No such file or directory
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
unlink: cannot unlink ‘core’: No such file or directory
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: mpi/pmi2: _tree_listen_read: mpi/pmi2: _tree_listen_read
srun: debug2: Tree head got back 0 looking for 1
srun: debug2: Tree head got back 1
srun: debug2: mpi/pmi2: _tree_listen_readable: mpi/pmi2: _tree_listen_readable
srun: debug2: mpi/pmi2: _tree_listen_read: mpi/pmi2: _tree_listen_read
srun: debug2: Tree head got back 0 looking for 1
srun: debug2: Tree head got back 1
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
PID 65756 I am waiting
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
PID 65753 I am waiting
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
PID 65754 is endless waiting
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write

mpi_endlessloop:65756 terminated with signal 6 at PC=2b68714be3d7 SP=7fff76d48da8.  Backtrace:
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write

mpi_endlessloop:65753 terminated with signal 6 at PC=2b3d7e65e3d7 SP=7ffdbd426308.  Backtrace:
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
PID 65755 I am waiting
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write

mpi_endlessloop:65755 terminated with signal 6 at PC=2b88023433d7 SP=7ffe2168eac8.  Backtrace:
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(gsignal+0x37)[0x2b68714be3d7]
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d7e65e3d7]
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(gsignal+0x37)[0x2b88023433d7]
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(abort+0x148)[0x2b68714bfac8]
./mpi_endlessloop[0x4009eb]
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(abort+0x148)[0x2b3d7e65fac8]
./mpi_endlessloop[0x4009eb]
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b68714aa555]
./mpi_endlessloop[0x400859]
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(abort+0x148)[0x2b8802344ac8]
./mpi_endlessloop[0x4009eb]
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3d7e64a555]
./mpi_endlessloop[0x400859]
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b880232f555]
./mpi_endlessloop[0x400859]
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
[1627485536.719843] [merlin-c-219:65756:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 2000100d9 apid 3000100dc is not released, refcount 1
[1627485536.719849] [merlin-c-219:65756:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 2000100db apid 2000100dc is not released, refcount 1
[1627485536.719851] [merlin-c-219:65756:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 2000100da apid 4000100dc is not released, refcount 1
[1627485536.719852] [merlin-c-219:65756:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 2000100dc apid 1000100dc is not released, refcount 1
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
[1627485536.720234] [merlin-c-219:65753:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 2000100d9 apid 1000100d9 is not released, refcount 1
[1627485536.720242] [merlin-c-219:65753:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 2000100db apid 4000100d9 is not released, refcount 1
[1627485536.720244] [merlin-c-219:65753:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 2000100da apid 2000100d9 is not released, refcount 1
[1627485536.720245] [merlin-c-219:65753:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 2000100dc apid 3000100d9 is not released, refcount 1
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
[1627485536.721140] [merlin-c-219:65755:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 2000100d9 apid 2000100db is not released, refcount 1
[1627485536.721153] [merlin-c-219:65755:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 2000100db apid 1000100db is not released, refcount 1
[1627485536.721159] [merlin-c-219:65755:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 2000100da apid 3000100db is not released, refcount 1
[1627485536.721163] [merlin-c-219:65755:0]       mm_xpmem.c:85   UCX  WARN  remote segment id 2000100dc apid 4000100db is not released, refcount 1
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: mpi/pmi2: _tree_listen_readable: mpi/pmi2: _tree_listen_readable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
core: cannot open (No such file or directory)
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
core: cannot open (No such file or directory)
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
core: cannot open (No such file or directory)
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: eio_message_socket_accept: got message connection from 129.129.185.99:38252 17
srun: debug2: received task exit
srun: launch/slurm: _task_finish: Received task exit notification for 3 tasks of StepId=519654.0 (status=0x0000).
srun: launch/slurm: _task_finish: merlin-c-219: tasks 0-2: Completed
srun: debug:  task 0 done
srun: debug:  task 1 done
srun: debug:  task 2 done

BTR files (see attached files) are generated for such cases (srun), instead of core dump (mpirun).

Thanks a lot,
Marc
Comment 15 Marc Caubet Serrabou 2021-07-28 09:24:09 MDT
Created attachment 20575 [details]
mpi_endlessloop.80s-65755,merlin-c-219.psi.ch.btr
Comment 16 Marc Caubet Serrabou 2021-07-28 09:24:09 MDT
Created attachment 20576 [details]
mpi_endlessloop.80s-65753,merlin-c-219.psi.ch.btr
Comment 17 Nate Rini 2021-07-28 09:39:32 MDT
(In reply to Nate Rini from comment #13)
> sbatch -n 4 --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; ulimit -c unlimited; ./mpi_endlessloop; file core'"

Let's go for a more verbose backtrace:
> sbatch -n 4 --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; ulimit -c unlimited; env SEGFAULT_SIGNALS="fault bus abrt" catchsegv ./mpi_endlessloop; file core'"
Comment 18 Marc Caubet Serrabou 2021-07-28 09:53:34 MDT
Created attachment 20577 [details]
slurm-519762.out.tar.gz

Now it generates the core dumps. I attach the files in a compressed file as these were pretty big:


  - slurm-519701.out: contains the abort() in all ranks except rank 3, that's why is much bigger (the loop in rank 3 continues). So it generates 3 core files.

  - slurm-519762.out: I moved the abort() inside the loop, so file gets smaller, so it generates 1 core dump only (I did it because may be easier to debug).


Thanks a lot,

Marc
Comment 19 Marc Caubet Serrabou 2021-07-28 09:53:34 MDT
Created attachment 20578 [details]
slurm-519701.out.tar.gz
Comment 20 Nate Rini 2021-07-28 10:05:09 MDT
(In reply to Marc Caubet Serrabou from comment #19)
> Created attachment 20578 [details]
> slurm-519701.out.tar.gz

Logs confirms pmi is loaded but it has both versions:
> 2b6ba5bf0000-2b6ba5bf1000 r--p 00007000 fd:00 14199414 /usr/lib64/libpmi2.so.0.0.0
> 2b6ba5c02000-2b6ba5c07000 r-xp 00000000 fd:00 14199411 /usr/lib64/libpmi.so.0.0.0

Since catchsegv is working, we also know that the kernel has no issue generating core dumps. Looks like the issue is more with limits of the process.

Let's call this:
> sbatch -n 4 --wrap "srun --mpi=pmi2 bash -c 'ulimit -H -c; ulimit -S -c"

Please also attach the slurmd log from 'merlin-c-220'.
Comment 21 Marc Caubet Serrabou 2021-07-28 10:11:05 MDT
Created attachment 20579 [details]
slurmd_merlin-c-220.tar.gz

Here it is:


(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ sbatch -n 4 --wrap "srun --mpi=pmi2 bash -c 'ulimit -H -c; ulimit -S -c'"
Submitted batch job 520018

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ cat slurm-520018.out

unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited


I attach the log file for merlin-c-220.


Thanks a lot!

Marc
Comment 22 Nate Rini 2021-07-28 10:15:15 MDT
(In reply to Marc Caubet Serrabou from comment #21)
> Created attachment 20579 [details]
> slurmd_merlin-c-220.tar.gz

This is probably unrelated to this ticket but the influxdb server appears to be quite unhappy:
> {"error":"partial write: max-series-per-database limit exceeded: (1000000) dropped=216"}

Note that when influxdb updates fail, Slurm will cache the update and try again (forever) which will likely slow down the slurmds and use a considerable bit of space on the node's spooldir.
Comment 23 Nate Rini 2021-07-28 10:48:58 MDT
Please provide an updated copy of your slurm.conf and cgroup.conf (if present).
Comment 24 Marc Caubet Serrabou 2021-08-02 07:05:30 MDT
Created attachment 20630 [details]
cgroup.conf

Hi,


attached both, cgroup.conf and slurm.conf.


Thanks a lot for pointing out the problem with InfluxDB. I was aware of it, our InfluxDB is not capable to store all the generated entries (I wanted to integrate it into our monitoring system), so I would probably move it to a different format (HDF5), process what I need, then send it to InfluxDB. I will change the configuration to get rid of these messages.
Comment 25 Marc Caubet Serrabou 2021-08-02 07:05:31 MDT
Created attachment 20631 [details]
slurm.conf
Comment 26 Nate Rini 2021-08-02 10:03:08 MDT
(In reply to Marc Caubet Serrabou from comment #24)
> Created attachment 20630 [details]
> cgroup.conf
>
> ConstrainDevices=no
> AllowedDevicesFile=/etc/slurm/cgroup_allowed_devices_file.conf

AllowedDevicesFile is no longer required to constrain devices and can safely be removed from your config.
Comment 27 Nate Rini 2021-08-04 11:22:30 MDT
(In reply to Marc Caubet Serrabou from comment #25)
> Created attachment 20631 [details]
> slurm.conf
Please change
> PropagateResourceLimitsExcept=AS,CPU,DATA,FSIZE,MEMLOCK,NOFILE,NPROC,RSS,STACK
to
> PropagateResourceLimitsExcept=AS,CPU,DATA,FSIZE,MEMLOCK,NOFILE,NPROC,RSS,STACK,CORE

This will require a restart of all Slurm daemons.

Then call:
> ulimit -c unlimited; sbatch -n 4 --wrap "ulimit -S -c; srun --mpi=pmi2 bash -c 'ulimit -H -c; ulimit -S -c'"
> sbatch -n 4 --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; ulimit -c unlimited; ./mpi_endlessloop; file core'"
Comment 28 Marc Caubet Serrabou 2021-08-06 05:18:09 MDT
Created attachment 20707 [details]
mpi_endlessloop.80s-32107,merlin-c-320.psi.ch.btr

Hi,

I prepared a partition for it, modified the slurm daemon as proposed, and restarted slurmd.

(base) caubet_m@caubet-laptop:~/vxargs$ ./exec_vxargs.sh merlin6/mu3e "sed -i 's/PropagateResourceLimitsExcept=.*/PropagateResourceLimitsExcept=AS,CPU,DATA,FSIZE,MEMLOCK,NOFILE,NPROC,RSS,STACK,CORE/g' /etc/slurm/slurm.conf; systemctl restart slurmd"
exit code 0: 6 job(s)
total number of jobs: 6

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ ulimit -c unlimited; sbatch -n 4 --partition=mu3e --wrap "ulimit -S -c; srun --mpi=pmi2 bash -c 'ulimit -H -c; ulimit -S -c'"
Submitted batch job 528501

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ sbatch -n 4 --partition mu3e --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; ulimit -c unlimited; ./mpi_endlessloop; file core'"
Submitted batch job 528502

Attached corresponding log files
Comment 29 Marc Caubet Serrabou 2021-08-06 05:18:09 MDT
Created attachment 20708 [details]
slurm-528502.out
Comment 30 Marc Caubet Serrabou 2021-08-06 05:18:09 MDT
Created attachment 20709 [details]
slurm-528501.out
Comment 31 Nate Rini 2021-08-13 13:05:46 MDT
(In reply to Marc Caubet Serrabou from comment #10)
>            MCA backtrace: execinfo (MCA v2.1, API v2.0, Component v4.0.5)

I suspect the problem is that there is already a target for the cores causing the kernel to not dump them twice.

Please call:
> orte-info --param backtrace all
Comment 32 Marc Caubet Serrabou 2021-08-17 08:16:15 MDT
Hi,


here is the output for that command:


(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/User/sobbia_r/TEST-RSM]$ orte-info --param backtrace all
           MCA backtrace: parameter "backtrace" (current value: "", data source: default, level: 2 user/detail, type: string)
                          Default selection set of components for the backtrace framework (<none> means use all components that can be found)
           MCA backtrace: parameter "backtrace_base_verbose" (current value: "error", data source: default, level: 8 dev/detail, type: int)
                          Verbosity level for the backtrace framework (default: 0)
                          Valid values: -1:"none", 0:"error", 10:"component", 20:"warn", 40:"info", 60:"trace", 80:"debug", 100:"max", 0 - 100
Comment 34 Nate Rini 2021-08-17 11:54:20 MDT
Please try this:
> sbatch -n 4 --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; export OMPI_MCA_opal_set_max_sys_limits=1 OMPI_MCA_opal_signal=64 ./mpi_endlessloop; file core'"
Comment 35 Marc Caubet Serrabou 2021-08-19 00:16:30 MDT
Created attachment 20907 [details]
slurm-539438.out

I did it as follows:

(base) ❄ [caubet_m@merlin-l-001 abort_example]$ sbatch -n 4 --partition=cpu-maint --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; export OMPI_MCA_opal_set_max_sys_limits=1 OMPI_MCA_opal_signal=64 ./mpi_endlessloop; file core'"
Submitted batch job 539437

then, the same command but by setting "ulimit -c unlimited" before running the job:

(base) ❄ [caubet_m@merlin-l-001 abort_example]$ ulimit -c unlimited
(base) ❄ [caubet_m@merlin-l-001 abort_example]$ sbatch -n 4 --partition=cpu-maint --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; export OMPI_MCA_opal_set_max_sys_limits=1 OMPI_MCA_opal_signal=64 ./mpi_endlessloop; file core'"
Submitted batch job 539438

Both outputs are attached.

Thanks a lot for your help,
Marc
Comment 36 Marc Caubet Serrabou 2021-08-19 00:16:30 MDT
Created attachment 20908 [details]
slurm-539437.out
Comment 37 Nate Rini 2021-08-20 10:53:48 MDT
Looks like there was a typo in the command:

Please call this instead (env instead of export):
> (base) ❄ [caubet_m@merlin-l-001 abort_example]$ sbatch -n 4
> --partition=cpu-maint --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core;
> env OMPI_MCA_opal_set_max_sys_limits=1 OMPI_MCA_opal_signal=64
> ./mpi_endlessloop; file core'"
> Submitted batch job 539437
Comment 38 Marc Caubet Serrabou 2021-08-23 10:00:30 MDT
Created attachment 20976 [details]
mpi_endlessloop.80s-60583,merlin-c-023.psi.ch.btr

Hi,


attach the output and generated BTR files for:


(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ sbatch -n 4 --partition=cpu-maint --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; env OMPI_MCA_opal_set_max_sys_limits=1 OMPI_MCA_opal_signal=64 ./mpi_endlessloop; file core'"
Submitted batch job 544159

Thanks a lot,

Marc
Comment 39 Marc Caubet Serrabou 2021-08-23 10:00:31 MDT
Created attachment 20977 [details]
slurm-544159.out
Comment 40 Nate Rini 2021-08-23 10:09:48 MDT
(In reply to Marc Caubet Serrabou from comment #39)
> Created attachment 20977 [details]
> slurm-544159.out

Please also attach the slurmd log during the test.
Comment 41 Marc Caubet Serrabou 2021-08-24 04:17:45 MDT
Created attachment 20990 [details]
slurmd.log

Hi,


attached the log file for the node.


Thanks a lot,

Marc
Comment 42 Nate Rini 2021-08-24 10:36:05 MDT
Which version of UCX is the job using?
Comment 43 Marc Caubet Serrabou 2021-09-02 02:49:21 MDT
Hi,

sorry I thought I already answered and I just realized that I never replied back. We run UCX v1.10:

(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ ucx_info -v
# UCT version=1.10.0 revision a212a09
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --without-gdrcopy --with-verbs --without-cm --with-knem --with-rdmacm --without-rocm --with-xpmem --without-ugni --with-cuda=/usr/local/cuda-10.2

Sorry for the delay,
Marc
Comment 44 Nate Rini 2021-09-02 10:47:19 MDT
Is it possible to recompile UCX without these options?
> --disable-logging --disable-debug
Comment 45 Marc Caubet Serrabou 2021-09-02 10:47:35 MDT
Until Tuesday, September 7, I will be out of the office.


For all urgent matters please contact:

  *   PSI@CSCS projects: psi-hpc-at-cscs-admin@lists.psi.ch
  *   MeG cluster: meg-admins@lists.psi.ch
  *   Merlin Clusters: merlin-admins@lists.psi.ch


Sorry for any inconvenience and best regards,

Marc Caubet Serrabou
Comment 46 Marc Caubet Serrabou 2021-09-03 00:29:06 MDT
Hi,


this is the compilation for system packages coming from Mellanox OFED repositories. However, I will make a different compilation excluding those options. It would take some time, I will update you as soon as possible.
Comment 47 Nate Rini 2021-09-03 09:13:22 MDT
(In reply to Marc Caubet Serrabou from comment #46)
> this is the compilation for system packages coming from Mellanox OFED
> repositories.
Which version of MOFED includes it?

> However, I will make a different compilation excluding those
> options. It would take some time, I will update you as soon as possible.
Great, so far my testing with UCX has not been able to replicate the issue even with the same version.
Comment 48 Nate Rini 2021-10-04 09:11:15 MDT
(In reply to Nate Rini from comment #47)
> (In reply to Marc Caubet Serrabou from comment #46)
> > this is the compilation for system packages coming from Mellanox OFED
> > repositories.
> Which version of MOFED includes it?

Which version of MOFED includes it?
Comment 49 Marc Caubet Serrabou 2021-10-04 09:23:16 MDT
Hi Nate,

the version is OFED v5.2-2.2.0.1 for rhel7u9.

A couple of weeks ago I compiled it but I was not able to make it work. I will make further tests this week.
Comment 50 Nate Rini 2021-10-13 11:20:56 MDT
Marc

I'm going to time this ticket out while we wait for your test results. Please reply and we can continue debugging.

Thanks,
--Nate