Ticket 3509

Summary: Fix MPIR_partial_attach_ok issues for parallel debuggers
Product: Slurm Reporter: Don Lipari <lipari1>
Component: ContributionsAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: peter.thompson
Version: 17.11.x   
Hardware: Linux   
OS: Linux   
Site: LLNL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 17.11.0-0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Don Lipari 2017-02-27 09:05:28 MST
Please merge the following commit to the master branch.

https://github.com/dongahn/slurm/commit/896cba6218de5de52e30434240d173b98f7a865d

We fixed this in our chaos Slurm.

https://github.com/chaos/slurm/commit/201461e05c9cf3dd7fe85de24bbdf2170232b1dc

The rationale:

As specified in MPIR debug interface
(https://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf),
the presence of the MPIR_partial_attach_ok symbol
should inform the debugger that the initial startup synchronization
is implemented in such a way that the tool need not attach
nor continue MPI processes that the user is not interested in controlling.

To implement this, SLURM chose to send SIGCONT to those processes that are
not attached by the debugger.

However, the old code does not reliably detect the condition
in which a process is traced by the debugger, and this
has lead to various side effects.

On some systems (e.g., TOSS2), the old code sends SIGCONT to
all of the target processes including those attached by the debugger.
On newer systems (e.g., TOSS3), it does not send SIGCONT
to the target processes at all.

It seems that one of the reasons for such undefined behavior
is the use of CLONE_PTRACE.
@grondo found no documentation that indicates
CLONE_PTRACE is for the case where the process is being attached
by a debugger.
More importantly, this code is matching clone(2) flags
to proc(5) process flags, which are not the same, as task->flags
defined as PF_* flags from kernel source include/linux/sched.h.

This patch fixes these problems by replacing
the old detection logic with ones based on the TracerPid field
in /proc/<pid>/status.

From proc(5), TracerPid: PID of process tracing this process (0 if not
being traced).
Comment 1 Danny Auble 2017-04-14 15:31:07 MDT
Thanks Don, this has been committed to 17.11 commit 18e3d6fbc85604.