Please call this:
> sbatch -n 4 --wrap "cat /proc/sys/kernel/core_pattern"
> sbatch -n 4 --wrap "srun cat /proc/sys/kernel/core_pattern"
Hi, thanks for taking care of it: (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ sbatch -n 4 --wrap "cat /proc/sys/kernel/core_pattern" Submitted batch job 504951 (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ sbatch -n 4 --wrap "srun cat /proc/sys/kernel/core_pattern" Submitted batch job 504952 (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ cat slurm-504951.out core (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ cat slurm-504952.out core core core core (In reply to Marc Caubet Serrabou from comment #2) > (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ cat slurm-504951.out > core > (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ cat slurm-504952.out > core Looks like it dumps as core in the current working directory of the job. Please try this: > sbatch -n 4 --wrap "srun bash -c 'unlink core; ulimit -c unlimited; ./mpi_endlessloop; file core'" Hi, thanks for replying. Looks "core" is not found: (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ sbatch -n 4 --wrap "mpirun bash -c 'unlink core; ulimit -c unlimited; ./mpi_endlessloop; file core'" Submitted batch job 512906 (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ cat slurm-512904.out unlink: cannot unlink ‘core’: No such file or directory unlink: cannot unlink ‘core’: No such file or directory unlink: cannot unlink ‘core’: No such file or directory unlink: cannot unlink ‘core’: No such file or directory PID 47129 waiting PID 47127 I am not rank 3 PID 47126 I am not rank 3 mpi_endlessloop:47127 terminated with signal 6 at PC=2b66727603d7 SP=7fff42beef48. Backtrace: mpi_endlessloop:47126 terminated with signal 6 at PC=2b1e0e34d3d7 SP=7ffe5b6abfa8. Backtrace: PID 47130 I am not rank 3 mpi_endlessloop:47130 terminated with signal 6 at PC=2b0d878703d7 SP=7ffe9c160288. Backtrace: /usr/lib64/libc.so.6(gsignal+0x37)[0x2b66727603d7] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b1e0e34d3d7] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b0d878703d7] /usr/lib64/libc.so.6(abort+0x148)[0x2b1e0e34eac8] ./mpi_endlessloop[0x40095b] /usr/lib64/libc.so.6(abort+0x148)[0x2b0d87871ac8] ./mpi_endlessloop[0x40095b] /usr/lib64/libc.so.6(abort+0x148)[0x2b6672761ac8] ./mpi_endlessloop[0x40095b] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b667274c555] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1e0e339555] ./mpi_endlessloop[0x4007c9] ./mpi_endlessloop[0x4007c9] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b0d8785c555] ./mpi_endlessloop[0x4007c9] [1627280397.900741] [merlin-c-113:47126:0] mm_xpmem.c:85 UCX WARN remote segment id 20000b819 apid 30000b816 is not released, refcount 1 [1627280397.900746] [merlin-c-113:47126:0] mm_xpmem.c:85 UCX WARN remote segment id 20000b81a apid 20000b816 is not released, refcount 1 [1627280397.900748] [merlin-c-113:47126:0] mm_xpmem.c:85 UCX WARN remote segment id 20000b817 apid 40000b816 is not released, refcount 1 [1627280397.900749] [merlin-c-113:47126:0] mm_xpmem.c:85 UCX WARN remote segment id 20000b816 apid 10000b816 is not released, refcount 1 [1627280397.900811] [merlin-c-113:47127:0] mm_xpmem.c:85 UCX WARN remote segment id 20000b819 apid 40000b817 is not released, refcount 1 [1627280397.900817] [merlin-c-113:47127:0] mm_xpmem.c:85 UCX WARN remote segment id 20000b81a apid 30000b817 is not released, refcount 1 [1627280397.900819] [merlin-c-113:47127:0] mm_xpmem.c:85 UCX WARN remote segment id 20000b817 apid 10000b817 is not released, refcount 1 [1627280397.900820] [merlin-c-113:47127:0] mm_xpmem.c:85 UCX WARN remote segment id 20000b816 apid 20000b817 is not released, refcount 1 [1627280397.900823] [merlin-c-113:47130:0] mm_xpmem.c:85 UCX WARN remote segment id 20000b819 apid 20000b81a is not released, refcount 1 [1627280397.900828] [merlin-c-113:47130:0] mm_xpmem.c:85 UCX WARN remote segment id 20000b81a apid 10000b81a is not released, refcount 1 [1627280397.900830] [merlin-c-113:47130:0] mm_xpmem.c:85 UCX WARN remote segment id 20000b817 apid 30000b81a is not released, refcount 1 [1627280397.900831] [merlin-c-113:47130:0] mm_xpmem.c:85 UCX WARN remote segment id 20000b816 apid 40000b81a is not released, refcount 1 core: cannot open (No such file or directory) core: cannot open (No such file or directory) core: cannot open (No such file or directory) (In reply to Marc Caubet Serrabou from comment #4) > [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ > sbatch -n 4 --wrap "mpirun bash -c 'unlink core; ulimit -c unlimited; > ./mpi_endlessloop; file core'" > Submitted batch job 512906 > > core: cannot open (No such file or directory) Is it possible to modify the core_pattern to see if this is a file permission issue? > echo '/tmp/core_%e_%g_%P_%s_%u' > /proc/sys/kernel/core_pattern Hi,
after changing the core pattern as suggested:
[root@merlin-c-314 ~]# echo '/tmp/core_%e_%g_%P_%s_%u' > /proc/sys/kernel/core_pattern
[root@merlin-c-314 ~]# cat /proc/sys/kernel/core_pattern
/tmp/core_%e_%g_%P_%s_%u
what I see is the following:
* For "mpirun" it generates the proper core file, with my username as the owner of the file, and job gets aborted (from within the code).
* For "srun", the job stays running (while it should be aborted once "abort()" is called). Then, one needs to cancel it. By cancelling via scancel with SIGABRT it generates the core file, with root as owner.
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ squeue -u caubet_m -a
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
515371 cpu-maint wrap caubet_m R 0:52 1 merlin-c-314
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ scancel --signal=SIGABRT 515371
Telling that, it seems a file permission from "srun" which is not able to write a core file as a the user running the job, while this is working with "mpirun".
When cancelling a job (with SIGABRT) by using scancel, it will do it as "root" (slurmd), hence it has proper rights for generating the core file.
Is this correct?
Thanks a lot,
Marc
(In reply to Marc Caubet Serrabou from comment #6) > Hi, > > after changing the core pattern as suggested: > > [root@merlin-c-314 ~]# echo '/tmp/core_%e_%g_%P_%s_%u' > > /proc/sys/kernel/core_pattern > [root@merlin-c-314 ~]# cat /proc/sys/kernel/core_pattern > /tmp/core_%e_%g_%P_%s_%u Please make sure to set this for whatever is appropriate for your site after. This suggestion may have security implications. > what I see is the following: > * For "mpirun" it generates the proper core file, with my username as the > owner of the file, and job gets aborted (from within the code). > * For "srun", the job stays running (while it should be aborted once > "abort()" is called). Then, one needs to cancel it. By cancelling via > scancel with SIGABRT it generates the core file, with root as owner. This suggests that the Slurm integration isn't active on the MPI layer. Which MPI is being used? Hi,
Thanks for the reminder. Yes, when I did the test I removed that node from production and then added back with the proper pattern.
Regarding to your question, I use OpenMPI v4.0.5. This is compiled with Slurm related options, in example, some of the most relevant ones (ompi_info -c):
Configure command line: '--prefix=/opt/psi/Compiler/openmpi/4.0.5_slurm/gcc/9.3.0'
'--with-cuda=/opt/psi/Programming/cuda/11.1.0'
'--prefix=/opt/psi/Compiler/openmpi/4.0.5_slurm/gcc/9.3.0'
'--enable-mpi-cxx' '--enable-mpi-cxx-seek'
'--enable-orterun-prefix-by-default'
'--enable-shared' '--enable-static'
'--with-sge=yes' '--with-ucx'
'--with-hwloc=internal' '--with-slurm=yes'
'--with-pmi' '--with-pmi-libdir=/usr/lib64/'
'--enable-mpi-fortran' '--without-verbs'
Cheers,
Marc
Please provide the output of:
> orte-info
Hi,
here it is:
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ orte-info
Open RTE: 4.0.5
Open RTE repo revision: v4.0.5
Open RTE release date: Aug 26, 2020
Prefix: /opt/psi/Compiler/openmpi/4.0.5_slurm/gcc/9.3.0
Configured architecture: x86_64-unknown-linux-gnu
Configure host: merlin-l-002.psi.ch
Configured by: caubet_m
Configured on: Mon Nov 9 21:11:08 CET 2020
Configure host: merlin-l-002.psi.ch
Configure command line: '--prefix=/opt/psi/Compiler/openmpi/4.0.5_slurm/gcc/9.3.0' '--with-cuda=/opt/psi/Programming/cuda/11.1.0'
'--prefix=/opt/psi/Compiler/openmpi/4.0.5_slurm/gcc/9.3.0' '--enable-mpi-cxx' '--enable-mpi-cxx-seek' '--enable-orterun-prefix-by-default'
'--enable-shared' '--enable-static' '--with-sge=yes' '--with-ucx' '--with-hwloc=internal' '--with-slurm=yes' '--with-pmi'
'--with-pmi-libdir=/usr/lib64/' '--enable-mpi-fortran' '--without-verbs'
Built by: caubet_m
Built on: Mon Nov 9 21:18:19 CET 2020
Built host: merlin-l-002.psi.ch
C compiler: /opt/psi/Programming/gcc/9.3.0/bin/gcc
C compiler absolute:
C compiler family name: GNU
C compiler version: 9.3.0
Thread support: posix (OPAL: yes, ORTE progress: yes, Event lib: yes)
Internal debug support: no
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
orterun default --prefix: yes
MPI_WTIME support: native
Symbol vis. support: yes
FT Checkpoint support: no (checkpoint thread: no)
MCA allocator: basic (MCA v2.1, API v2.0, Component v4.0.5)
MCA allocator: bucket (MCA v2.1, API v2.0, Component v4.0.5)
MCA backtrace: execinfo (MCA v2.1, API v2.0, Component v4.0.5)
MCA btl: self (MCA v2.1, API v3.1, Component v4.0.5)
MCA btl: smcuda (MCA v2.1, API v3.1, Component v4.0.5)
MCA btl: tcp (MCA v2.1, API v3.1, Component v4.0.5)
MCA btl: usnic (MCA v2.1, API v3.1, Component v4.0.5)
MCA btl: vader (MCA v2.1, API v3.1, Component v4.0.5)
MCA compress: bzip (MCA v2.1, API v2.0, Component v4.0.5)
MCA compress: gzip (MCA v2.1, API v2.0, Component v4.0.5)
MCA crs: none (MCA v2.1, API v2.0, Component v4.0.5)
MCA dl: dlopen (MCA v2.1, API v1.0, Component v4.0.5)
MCA event: libevent2022 (MCA v2.1, API v2.0, Component v4.0.5)
MCA hwloc: hwloc201 (MCA v2.1, API v2.0, Component v4.0.5)
MCA if: linux_ipv6 (MCA v2.1, API v2.0, Component v4.0.5)
MCA if: posix_ipv4 (MCA v2.1, API v2.0, Component v4.0.5)
MCA installdirs: env (MCA v2.1, API v2.0, Component v4.0.5)
MCA installdirs: config (MCA v2.1, API v2.0, Component v4.0.5)
MCA memory: patcher (MCA v2.1, API v2.0, Component v4.0.5)
MCA mpool: hugepage (MCA v2.1, API v3.0, Component v4.0.5)
MCA patcher: overwrite (MCA v2.1, API v1.0, Component v4.0.5)
MCA pmix: isolated (MCA v2.1, API v2.0, Component v4.0.5)
MCA pmix: pmix3x (MCA v2.1, API v2.0, Component v4.0.5)
MCA pmix: s1 (MCA v2.1, API v2.0, Component v4.0.5)
MCA pmix: s2 (MCA v2.1, API v2.0, Component v4.0.5)
MCA pstat: linux (MCA v2.1, API v2.0, Component v4.0.5)
MCA rcache: grdma (MCA v2.1, API v3.3, Component v4.0.5)
MCA rcache: gpusm (MCA v2.1, API v3.3, Component v4.0.5)
MCA rcache: rgpusm (MCA v2.1, API v3.3, Component v4.0.5)
MCA reachable: weighted (MCA v2.1, API v2.0, Component v4.0.5)
MCA reachable: netlink (MCA v2.1, API v2.0, Component v4.0.5)
MCA shmem: mmap (MCA v2.1, API v2.0, Component v4.0.5)
MCA shmem: posix (MCA v2.1, API v2.0, Component v4.0.5)
MCA shmem: sysv (MCA v2.1, API v2.0, Component v4.0.5)
MCA timer: linux (MCA v2.1, API v2.0, Component v4.0.5)
MCA errmgr: default_app (MCA v2.1, API v3.0, Component v4.0.5)
MCA errmgr: default_hnp (MCA v2.1, API v3.0, Component v4.0.5)
MCA errmgr: default_orted (MCA v2.1, API v3.0, Component v4.0.5)
MCA errmgr: default_tool (MCA v2.1, API v3.0, Component v4.0.5)
MCA ess: env (MCA v2.1, API v3.0, Component v4.0.5)
MCA ess: hnp (MCA v2.1, API v3.0, Component v4.0.5)
MCA ess: pmi (MCA v2.1, API v3.0, Component v4.0.5)
MCA ess: singleton (MCA v2.1, API v3.0, Component v4.0.5)
MCA ess: tool (MCA v2.1, API v3.0, Component v4.0.5)
MCA ess: slurm (MCA v2.1, API v3.0, Component v4.0.5)
MCA filem: raw (MCA v2.1, API v2.0, Component v4.0.5)
MCA grpcomm: direct (MCA v2.1, API v3.0, Component v4.0.5)
MCA iof: hnp (MCA v2.1, API v2.0, Component v4.0.5)
MCA iof: orted (MCA v2.1, API v2.0, Component v4.0.5)
MCA iof: tool (MCA v2.1, API v2.0, Component v4.0.5)
MCA odls: default (MCA v2.1, API v2.0, Component v4.0.5)
MCA odls: pspawn (MCA v2.1, API v2.0, Component v4.0.5)
MCA oob: tcp (MCA v2.1, API v2.0, Component v4.0.5)
MCA plm: isolated (MCA v2.1, API v2.0, Component v4.0.5)
MCA plm: rsh (MCA v2.1, API v2.0, Component v4.0.5)
MCA plm: slurm (MCA v2.1, API v2.0, Component v4.0.5)
MCA ras: simulator (MCA v2.1, API v2.0, Component v4.0.5)
MCA ras: gridengine (MCA v2.1, API v2.0, Component v4.0.5)
MCA ras: slurm (MCA v2.1, API v2.0, Component v4.0.5)
MCA regx: fwd (MCA v2.1, API v1.0, Component v4.0.5)
MCA regx: naive (MCA v2.1, API v1.0, Component v4.0.5)
MCA regx: reverse (MCA v2.1, API v1.0, Component v4.0.5)
MCA rmaps: mindist (MCA v2.1, API v2.0, Component v4.0.5)
MCA rmaps: ppr (MCA v2.1, API v2.0, Component v4.0.5)
MCA rmaps: rank_file (MCA v2.1, API v2.0, Component v4.0.5)
MCA rmaps: resilient (MCA v2.1, API v2.0, Component v4.0.5)
MCA rmaps: round_robin (MCA v2.1, API v2.0, Component v4.0.5)
MCA rmaps: seq (MCA v2.1, API v2.0, Component v4.0.5)
MCA rml: oob (MCA v2.1, API v3.0, Component v4.0.5)
MCA routed: binomial (MCA v2.1, API v3.0, Component v4.0.5)
MCA routed: direct (MCA v2.1, API v3.0, Component v4.0.5)
MCA routed: radix (MCA v2.1, API v3.0, Component v4.0.5)
MCA rtc: hwloc (MCA v2.1, API v1.0, Component v4.0.5)
MCA schizo: flux (MCA v2.1, API v1.0, Component v4.0.5)
MCA schizo: ompi (MCA v2.1, API v1.0, Component v4.0.5)
MCA schizo: orte (MCA v2.1, API v1.0, Component v4.0.5)
MCA schizo: slurm (MCA v2.1, API v1.0, Component v4.0.5)
MCA state: app (MCA v2.1, API v1.0, Component v4.0.5)
MCA state: hnp (MCA v2.1, API v1.0, Component v4.0.5)
MCA state: novm (MCA v2.1, API v1.0, Component v4.0.5)
MCA state: orted (MCA v2.1, API v1.0, Component v4.0.5)
MCA state: tool (MCA v2.1, API v1.0, Component v4.0.5)
Thanks a lot,
Marc
Please also call:
> srun --mpi=list
> ls -la /usr/lib64/lib*pmi*
Hi, we only provide pmi2 support for srun, pmix and other will be soon provided too: (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ srun --mpi=list srun: MPI types are... srun: none srun: pmi2 srun: cray_shasta (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ ls -la /usr/lib64/lib*pmi* lrwxrwxrwx. 1 root root 21 6. Jul 2020 /usr/lib64/libfreeipmi.so -> libfreeipmi.so.17.1.4 lrwxrwxrwx. 1 root root 21 6. Jul 2020 /usr/lib64/libfreeipmi.so.17 -> libfreeipmi.so.17.1.4 -rwxr-xr-x. 1 root root 5156016 27. Mär 2019 /usr/lib64/libfreeipmi.so.17.1.4 lrwxrwxrwx. 1 root root 23 6. Jul 2020 /usr/lib64/libipmiconsole.so -> libipmiconsole.so.2.3.4 lrwxrwxrwx. 1 root root 23 6. Jul 2020 /usr/lib64/libipmiconsole.so.2 -> libipmiconsole.so.2.3.4 -rwxr-xr-x. 1 root root 249592 27. Mär 2019 /usr/lib64/libipmiconsole.so.2.3.4 lrwxrwxrwx. 1 root root 22 6. Jul 2020 /usr/lib64/libipmidetect.so -> libipmidetect.so.0.0.0 lrwxrwxrwx. 1 root root 22 6. Jul 2020 /usr/lib64/libipmidetect.so.0 -> libipmidetect.so.0.0.0 -rwxr-xr-x. 1 root root 62824 27. Mär 2019 /usr/lib64/libipmidetect.so.0.0.0 lrwxrwxrwx. 1 root root 26 6. Jul 2020 /usr/lib64/libipmimonitoring.so -> libipmimonitoring.so.6.0.6 lrwxrwxrwx. 1 root root 26 6. Jul 2020 /usr/lib64/libipmimonitoring.so.6 -> libipmimonitoring.so.6.0.6 -rwxr-xr-x. 1 root root 121416 27. Mär 2019 /usr/lib64/libipmimonitoring.so.6.0.6 lrwxrwxrwx. 1 root root 16 26. Mai 12:17 /usr/lib64/libpmi2.so -> libpmi2.so.0.0.0 lrwxrwxrwx. 1 root root 16 26. Mai 12:17 /usr/lib64/libpmi2.so.0 -> libpmi2.so.0.0.0 -rwxr-xr-x. 1 root root 239872 14. Mai 10:44 /usr/lib64/libpmi2.so.0.0.0 lrwxrwxrwx. 1 root root 15 26. Mai 12:17 /usr/lib64/libpmi.so -> libpmi.so.0.0.0 lrwxrwxrwx. 1 root root 15 26. Mai 12:17 /usr/lib64/libpmi.so.0 -> libpmi.so.0.0.0 -rwxr-xr-x. 1 root root 230896 14. Mai 10:44 /usr/lib64/libpmi.so.0.0.0 lrwxrwxrwx. 1 root root 17 26. Mai 10:36 /usr/lib64/librpmio.so.3 -> librpmio.so.3.2.2 -rwxr-xr-x. 1 root root 178928 2. Jun 2020 /usr/lib64/librpmio.so.3.2.2 Cheers, Marc (In reply to Marc Caubet Serrabou from comment #12) > we only provide pmi2 support for srun, pmix and other will be soon provided > too: My main concern is that Slurm and the job (via openmpi) are using the correct pmi2 and not the pmi1 compatibility layer in pmi2 (or just not at all). Please call this: Please try this: > sbatch -n 4 --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; ulimit -c unlimited; ./mpi_endlessloop; file core'" Created attachment 20574 [details]
mpi_endlessloop.80s-65756,merlin-c-219.psi.ch.btr
Here it is:
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ cat slurm-519654.out
srun: defined options
srun: -------------------- --------------------
srun: (null) : merlin-c-219
srun: jobid : 519654
srun: job-name : wrap
srun: mem-per-cpu : 4000
srun: mpi : pmi2
srun: nodes : 1
srun: ntasks : 4
srun: verbose : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating RLIMIT_CORE=18446744073709551615
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: debug2: srun PMI messages to port=45457
srun: debug: auth/munge: init: Munge authentication plugin loaded
srun: jobid 519654: nodes(1):`merlin-c-219', cpu counts: 8(x1)
srun: debug2: creating job with 4 tasks
srun: debug: requesting job 519654, user 39177, nodes 1 including ((null))
srun: debug: cpus 4, tasks 4, name bash, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug: Entering slurm_step_launch
srun: debug: mpi type = (null)
srun: debug: mpi/pmi2: p_mpi_hook_client_prelaunch: mpi/pmi2: client_prelaunch
srun: debug: mpi/pmi2: _get_proc_mapping: mpi/pmi2: processor mapping: (vector,(0,1,4))
srun: debug: mpi/pmi2: _setup_srun_socket: mpi/pmi2: srun pmi port: 44382
srun: debug2: mpi/pmi2: _tree_listen_readable: mpi/pmi2: _tree_listen_readable
srun: debug: mpi/pmi2: pmi2_start_agent: mpi/pmi2: started agent thread
srun: debug: Entering _msg_thr_create()
srun: debug: initialized stdio listening socket, port 36265
srun: debug: Started IO server thread (47631167780608)
srun: debug: Entering _launch_tasks
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: launching StepId=519654.0 on host merlin-c-219, 4 tasks: [0-3]
srun: route/default: init: route default plugin loaded
srun: debug2: Tree head got back 0 looking for 1
srun: debug2: Tree head got back 1
srun: debug: launch returned msg_rc=0 err=0 type=8001
srun: debug2: Activity on IO listening socket 15
srun: debug2: Entering io_init_msg_read_from_fd
srun: debug2: Leaving io_init_msg_read_from_fd
srun: debug2: Entering io_init_msg_validate
srun: debug2: Leaving io_init_msg_validate
srun: debug2: Validated IO connection from 129.129.185.99:56418, node rank 0, sd=16
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_read
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: eio_message_socket_accept: got message connection from 129.129.185.99:38236 17
srun: debug2: received task launch
srun: launch/slurm: _task_start: Node merlin-c-219, 4 tasks started
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
unlink: cannot unlink ‘core’: No such file or directory
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
unlink: cannot unlink ‘core’: No such file or directory
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
unlink: cannot unlink ‘core’: No such file or directory
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
unlink: cannot unlink ‘core’: No such file or directory
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: mpi/pmi2: _tree_listen_read: mpi/pmi2: _tree_listen_read
srun: debug2: Tree head got back 0 looking for 1
srun: debug2: Tree head got back 1
srun: debug2: mpi/pmi2: _tree_listen_readable: mpi/pmi2: _tree_listen_readable
srun: debug2: mpi/pmi2: _tree_listen_read: mpi/pmi2: _tree_listen_read
srun: debug2: Tree head got back 0 looking for 1
srun: debug2: Tree head got back 1
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
PID 65756 I am waiting
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
PID 65753 I am waiting
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
PID 65754 is endless waiting
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
mpi_endlessloop:65756 terminated with signal 6 at PC=2b68714be3d7 SP=7fff76d48da8. Backtrace:
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
mpi_endlessloop:65753 terminated with signal 6 at PC=2b3d7e65e3d7 SP=7ffdbd426308. Backtrace:
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
PID 65755 I am waiting
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
mpi_endlessloop:65755 terminated with signal 6 at PC=2b88023433d7 SP=7ffe2168eac8. Backtrace:
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(gsignal+0x37)[0x2b68714be3d7]
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d7e65e3d7]
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(gsignal+0x37)[0x2b88023433d7]
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(abort+0x148)[0x2b68714bfac8]
./mpi_endlessloop[0x4009eb]
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(abort+0x148)[0x2b3d7e65fac8]
./mpi_endlessloop[0x4009eb]
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b68714aa555]
./mpi_endlessloop[0x400859]
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(abort+0x148)[0x2b8802344ac8]
./mpi_endlessloop[0x4009eb]
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3d7e64a555]
./mpi_endlessloop[0x400859]
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b880232f555]
./mpi_endlessloop[0x400859]
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
[1627485536.719843] [merlin-c-219:65756:0] mm_xpmem.c:85 UCX WARN remote segment id 2000100d9 apid 3000100dc is not released, refcount 1
[1627485536.719849] [merlin-c-219:65756:0] mm_xpmem.c:85 UCX WARN remote segment id 2000100db apid 2000100dc is not released, refcount 1
[1627485536.719851] [merlin-c-219:65756:0] mm_xpmem.c:85 UCX WARN remote segment id 2000100da apid 4000100dc is not released, refcount 1
[1627485536.719852] [merlin-c-219:65756:0] mm_xpmem.c:85 UCX WARN remote segment id 2000100dc apid 1000100dc is not released, refcount 1
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
[1627485536.720234] [merlin-c-219:65753:0] mm_xpmem.c:85 UCX WARN remote segment id 2000100d9 apid 1000100d9 is not released, refcount 1
[1627485536.720242] [merlin-c-219:65753:0] mm_xpmem.c:85 UCX WARN remote segment id 2000100db apid 4000100d9 is not released, refcount 1
[1627485536.720244] [merlin-c-219:65753:0] mm_xpmem.c:85 UCX WARN remote segment id 2000100da apid 2000100d9 is not released, refcount 1
[1627485536.720245] [merlin-c-219:65753:0] mm_xpmem.c:85 UCX WARN remote segment id 2000100dc apid 3000100d9 is not released, refcount 1
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
[1627485536.721140] [merlin-c-219:65755:0] mm_xpmem.c:85 UCX WARN remote segment id 2000100d9 apid 2000100db is not released, refcount 1
[1627485536.721153] [merlin-c-219:65755:0] mm_xpmem.c:85 UCX WARN remote segment id 2000100db apid 1000100db is not released, refcount 1
[1627485536.721159] [merlin-c-219:65755:0] mm_xpmem.c:85 UCX WARN remote segment id 2000100da apid 3000100db is not released, refcount 1
[1627485536.721163] [merlin-c-219:65755:0] mm_xpmem.c:85 UCX WARN remote segment id 2000100dc apid 4000100db is not released, refcount 1
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: mpi/pmi2: _tree_listen_readable: mpi/pmi2: _tree_listen_readable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
core: cannot open (No such file or directory)
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
core: cannot open (No such file or directory)
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
core: cannot open (No such file or directory)
srun: debug2: Leaving _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: eio_message_socket_accept: got message connection from 129.129.185.99:38252 17
srun: debug2: received task exit
srun: launch/slurm: _task_finish: Received task exit notification for 3 tasks of StepId=519654.0 (status=0x0000).
srun: launch/slurm: _task_finish: merlin-c-219: tasks 0-2: Completed
srun: debug: task 0 done
srun: debug: task 1 done
srun: debug: task 2 done
BTR files (see attached files) are generated for such cases (srun), instead of core dump (mpirun).
Thanks a lot,
Marc
Created attachment 20575 [details]
mpi_endlessloop.80s-65755,merlin-c-219.psi.ch.btr
Created attachment 20576 [details]
mpi_endlessloop.80s-65753,merlin-c-219.psi.ch.btr
(In reply to Nate Rini from comment #13) > sbatch -n 4 --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; ulimit -c unlimited; ./mpi_endlessloop; file core'" Let's go for a more verbose backtrace: > sbatch -n 4 --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; ulimit -c unlimited; env SEGFAULT_SIGNALS="fault bus abrt" catchsegv ./mpi_endlessloop; file core'" Created attachment 20577 [details]
slurm-519762.out.tar.gz
Now it generates the core dumps. I attach the files in a compressed file as these were pretty big:
- slurm-519701.out: contains the abort() in all ranks except rank 3, that's why is much bigger (the loop in rank 3 continues). So it generates 3 core files.
- slurm-519762.out: I moved the abort() inside the loop, so file gets smaller, so it generates 1 core dump only (I did it because may be easier to debug).
Thanks a lot,
Marc
Created attachment 20578 [details]
slurm-519701.out.tar.gz
(In reply to Marc Caubet Serrabou from comment #19) > Created attachment 20578 [details] > slurm-519701.out.tar.gz Logs confirms pmi is loaded but it has both versions: > 2b6ba5bf0000-2b6ba5bf1000 r--p 00007000 fd:00 14199414 /usr/lib64/libpmi2.so.0.0.0 > 2b6ba5c02000-2b6ba5c07000 r-xp 00000000 fd:00 14199411 /usr/lib64/libpmi.so.0.0.0 Since catchsegv is working, we also know that the kernel has no issue generating core dumps. Looks like the issue is more with limits of the process. Let's call this: > sbatch -n 4 --wrap "srun --mpi=pmi2 bash -c 'ulimit -H -c; ulimit -S -c" Please also attach the slurmd log from 'merlin-c-220'. Created attachment 20579 [details]
slurmd_merlin-c-220.tar.gz
Here it is:
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ sbatch -n 4 --wrap "srun --mpi=pmi2 bash -c 'ulimit -H -c; ulimit -S -c'"
Submitted batch job 520018
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ cat slurm-520018.out
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
unlimited
I attach the log file for merlin-c-220.
Thanks a lot!
Marc
(In reply to Marc Caubet Serrabou from comment #21) > Created attachment 20579 [details] > slurmd_merlin-c-220.tar.gz This is probably unrelated to this ticket but the influxdb server appears to be quite unhappy: > {"error":"partial write: max-series-per-database limit exceeded: (1000000) dropped=216"} Note that when influxdb updates fail, Slurm will cache the update and try again (forever) which will likely slow down the slurmds and use a considerable bit of space on the node's spooldir. Please provide an updated copy of your slurm.conf and cgroup.conf (if present). Created attachment 20630 [details]
cgroup.conf
Hi,
attached both, cgroup.conf and slurm.conf.
Thanks a lot for pointing out the problem with InfluxDB. I was aware of it, our InfluxDB is not capable to store all the generated entries (I wanted to integrate it into our monitoring system), so I would probably move it to a different format (HDF5), process what I need, then send it to InfluxDB. I will change the configuration to get rid of these messages.
Created attachment 20631 [details]
slurm.conf
(In reply to Marc Caubet Serrabou from comment #24) > Created attachment 20630 [details] > cgroup.conf > > ConstrainDevices=no > AllowedDevicesFile=/etc/slurm/cgroup_allowed_devices_file.conf AllowedDevicesFile is no longer required to constrain devices and can safely be removed from your config. (In reply to Marc Caubet Serrabou from comment #25) > Created attachment 20631 [details] > slurm.conf Please change > PropagateResourceLimitsExcept=AS,CPU,DATA,FSIZE,MEMLOCK,NOFILE,NPROC,RSS,STACK to > PropagateResourceLimitsExcept=AS,CPU,DATA,FSIZE,MEMLOCK,NOFILE,NPROC,RSS,STACK,CORE This will require a restart of all Slurm daemons. Then call: > ulimit -c unlimited; sbatch -n 4 --wrap "ulimit -S -c; srun --mpi=pmi2 bash -c 'ulimit -H -c; ulimit -S -c'" > sbatch -n 4 --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; ulimit -c unlimited; ./mpi_endlessloop; file core'" Created attachment 20707 [details]
mpi_endlessloop.80s-32107,merlin-c-320.psi.ch.btr
Hi,
I prepared a partition for it, modified the slurm daemon as proposed, and restarted slurmd.
(base) caubet_m@caubet-laptop:~/vxargs$ ./exec_vxargs.sh merlin6/mu3e "sed -i 's/PropagateResourceLimitsExcept=.*/PropagateResourceLimitsExcept=AS,CPU,DATA,FSIZE,MEMLOCK,NOFILE,NPROC,RSS,STACK,CORE/g' /etc/slurm/slurm.conf; systemctl restart slurmd"
exit code 0: 6 job(s)
total number of jobs: 6
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ ulimit -c unlimited; sbatch -n 4 --partition=mu3e --wrap "ulimit -S -c; srun --mpi=pmi2 bash -c 'ulimit -H -c; ulimit -S -c'"
Submitted batch job 528501
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ sbatch -n 4 --partition mu3e --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; ulimit -c unlimited; ./mpi_endlessloop; file core'"
Submitted batch job 528502
Attached corresponding log files
Created attachment 20708 [details]
slurm-528502.out
Created attachment 20709 [details]
slurm-528501.out
(In reply to Marc Caubet Serrabou from comment #10) > MCA backtrace: execinfo (MCA v2.1, API v2.0, Component v4.0.5) I suspect the problem is that there is already a target for the cores causing the kernel to not dump them twice. Please call: > orte-info --param backtrace all Hi,
here is the output for that command:
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/User/sobbia_r/TEST-RSM]$ orte-info --param backtrace all
MCA backtrace: parameter "backtrace" (current value: "", data source: default, level: 2 user/detail, type: string)
Default selection set of components for the backtrace framework (<none> means use all components that can be found)
MCA backtrace: parameter "backtrace_base_verbose" (current value: "error", data source: default, level: 8 dev/detail, type: int)
Verbosity level for the backtrace framework (default: 0)
Valid values: -1:"none", 0:"error", 10:"component", 20:"warn", 40:"info", 60:"trace", 80:"debug", 100:"max", 0 - 100
Please try this:
> sbatch -n 4 --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; export OMPI_MCA_opal_set_max_sys_limits=1 OMPI_MCA_opal_signal=64 ./mpi_endlessloop; file core'"
Created attachment 20907 [details]
slurm-539438.out
I did it as follows:
(base) ❄ [caubet_m@merlin-l-001 abort_example]$ sbatch -n 4 --partition=cpu-maint --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; export OMPI_MCA_opal_set_max_sys_limits=1 OMPI_MCA_opal_signal=64 ./mpi_endlessloop; file core'"
Submitted batch job 539437
then, the same command but by setting "ulimit -c unlimited" before running the job:
(base) ❄ [caubet_m@merlin-l-001 abort_example]$ ulimit -c unlimited
(base) ❄ [caubet_m@merlin-l-001 abort_example]$ sbatch -n 4 --partition=cpu-maint --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; export OMPI_MCA_opal_set_max_sys_limits=1 OMPI_MCA_opal_signal=64 ./mpi_endlessloop; file core'"
Submitted batch job 539438
Both outputs are attached.
Thanks a lot for your help,
Marc
Created attachment 20908 [details]
slurm-539437.out
Looks like there was a typo in the command:
Please call this instead (env instead of export):
> (base) ❄ [caubet_m@merlin-l-001 abort_example]$ sbatch -n 4
> --partition=cpu-maint --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core;
> env OMPI_MCA_opal_set_max_sys_limits=1 OMPI_MCA_opal_signal=64
> ./mpi_endlessloop; file core'"
> Submitted batch job 539437
Created attachment 20976 [details]
mpi_endlessloop.80s-60583,merlin-c-023.psi.ch.btr
Hi,
attach the output and generated BTR files for:
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m/Software/ADMIN/abort_example]$ sbatch -n 4 --partition=cpu-maint --wrap "srun --mpi=pmi2 -vvv bash -c 'unlink core; env OMPI_MCA_opal_set_max_sys_limits=1 OMPI_MCA_opal_signal=64 ./mpi_endlessloop; file core'"
Submitted batch job 544159
Thanks a lot,
Marc
Created attachment 20977 [details]
slurm-544159.out
(In reply to Marc Caubet Serrabou from comment #39) > Created attachment 20977 [details] > slurm-544159.out Please also attach the slurmd log during the test. Created attachment 20990 [details]
slurmd.log
Hi,
attached the log file for the node.
Thanks a lot,
Marc
Which version of UCX is the job using? Hi, sorry I thought I already answered and I just realized that I never replied back. We run UCX v1.10: (base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]$ ucx_info -v # UCT version=1.10.0 revision a212a09 # configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --without-gdrcopy --with-verbs --without-cm --with-knem --with-rdmacm --without-rocm --with-xpmem --without-ugni --with-cuda=/usr/local/cuda-10.2 Sorry for the delay, Marc Is it possible to recompile UCX without these options?
> --disable-logging --disable-debug
Until Tuesday, September 7, I will be out of the office. For all urgent matters please contact: * PSI@CSCS projects: psi-hpc-at-cscs-admin@lists.psi.ch * MeG cluster: meg-admins@lists.psi.ch * Merlin Clusters: merlin-admins@lists.psi.ch Sorry for any inconvenience and best regards, Marc Caubet Serrabou Hi, this is the compilation for system packages coming from Mellanox OFED repositories. However, I will make a different compilation excluding those options. It would take some time, I will update you as soon as possible. (In reply to Marc Caubet Serrabou from comment #46) > this is the compilation for system packages coming from Mellanox OFED > repositories. Which version of MOFED includes it? > However, I will make a different compilation excluding those > options. It would take some time, I will update you as soon as possible. Great, so far my testing with UCX has not been able to replicate the issue even with the same version. (In reply to Nate Rini from comment #47) > (In reply to Marc Caubet Serrabou from comment #46) > > this is the compilation for system packages coming from Mellanox OFED > > repositories. > Which version of MOFED includes it? Which version of MOFED includes it? Hi Nate, the version is OFED v5.2-2.2.0.1 for rhel7u9. A couple of weeks ago I compiled it but I was not able to make it work. I will make further tests this week. Marc I'm going to time this ticket out while we wait for your test results. Please reply and we can continue debugging. Thanks, --Nate |
Hi, for some reason, srun does not produce a core dump for software crashing due to SIGABRT, while mpirun can handle it perfectly. That's simple to reproduce with the following code: #### Software example #include <mpi.h> #include <stdio.h> #include <unistd.h> #include <stdlib.h> int main(int argc, char* argv[]) { int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 3) { int i = 0; printf("PID %d is endless waiting\n", getpid()); fflush(stdout); while (i == 0) sleep(10); } else { printf("PID %d I am waiting\n", getpid()); abort(); } } #### Running software ulimit -c unlimited sbatch -n 4 --wrap "srun ./mpi_endlessloop" # core dumps not generated sbatch -n 4 --wrap "mpirun ./mpi_endlessloop" # core dumps generated Software running with srun generated a couple of back trace files, while software running with mpirun generates correctly the core dump files. Thanks a lot, Marc