Ticket 2456

Summary: mpi/pmi2 plugin error
Product: Slurm Reporter: Martin Perry <martin.perry>
Component: OtherAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: alex
Version: 16.05.x   
Hardware: Linux   
OS: Linux   
Site: Atos/Eviden Sites Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: CEA
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 16.05.0-pre2
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: patch to fix incorrect stack use in pmi2

Description Martin Perry 2016-02-18 08:28:57 MST
src/plugins/mpi/pmi2/tree.c:tree_msg_to_stepds sets forward_data_msg_t:address to point to a global char array. This causes an assertion failure on the xfree of that pointer in src/common/slurm_protocol_defs.c:slurm_free_forward_data_msgs.

Example:

[trek2] (slurm) mpihello> srun --mpi=pmi2 ./mpihello
srun: error: () Error: from slurm_protocol_defs.c:1678: (): Assertion (p[0] == 0x42) failed
Aborted (core dumped)

The code is the same in 15.08, so I'm not sure why we don't see the problem there. Maybe the assertion check has been tightened in 16.05?
Comment 1 Alejandro Sanchez 2016-02-18 20:24:28 MST
Martin,

could you please send the backtrace from the core dump?

$ gdb /path/to/slurmctld /path/to/core
$ bt full
Comment 2 Alejandro Sanchez 2016-02-18 20:38:32 MST
In this case it should be:

$ gdb /path/to/srun /path/to/core
(gdb) bt full
Comment 3 Tim Wickberg 2016-02-18 23:46:07 MST
No need for the backtrace, you highlighted exactly where the issue is - I'm not sure why that structure was ever used off the stack, any message components need to be xmalloc'd to prevent exactly this issue.

I'm guessing that you may have built 16.05 with --enable-developer or some additional flag that you did not use for 15.08? The assertions aren't enabled in non-developer builds, so that's likely the only reason this didn't assert on 15.08 for you.

I'll have a proposed patch shortly - do you mind testing it out?

- Tim
Comment 4 Tim Wickberg 2016-02-18 23:58:19 MST
Created attachment 2752 [details]
patch to fix incorrect stack use in pmi2
Comment 5 Martin Perry 2016-02-19 02:34:38 MST
Tim/Alejandro,

I'll test the patch shortly, and let you know the result. Thanks.

I use the same config options in 15.08 and 16.05: both --enable-developer and --enable-debug. That's why I thought it was strange that I only see the problem on 16.05.
Comment 6 Martin Perry 2016-02-19 04:05:32 MST
Unfortunately the patch does not fix the problem. tree_sock_addr is still defined as a global char array rather than being xmalloc'd, so the xfree(msg->address) in slurm_free_forward_data_msg still gets the assertion failure. I tried enhancing your patch to allocate tree_sock_addr with xmalloc, but that resulted in an assertion failure on a different xfree.

As a temporary workaround I have commented out the call to slurm_free_forward_data_msg in slurm_protocol_defs.c. That allows my MPI tests to run so I can continue my development work.

You should be able to reproduce the problem easily. Just run any MPI program  that uses PMI-2, as follows:

srun --mpi=pmi2 <mpi program>
Comment 7 Tim Wickberg 2016-02-19 04:18:59 MST
Okay, thank you for trying that out. It looks like there's a number of other places to fix unfortunately.

Looking between the 15.08 and 16.05, there are a number of places that are set to xfree data structures now that were not before; I believe these were all meant to close out memory leaks, but are now finding locations where data was on the stack rather than the heap. I'll look into it further and get back to you with a more extensive and patch once I've had a chance to test it.
Comment 8 Martin Perry 2016-02-19 04:24:26 MST
Ok, thanks. Let me know if you want me to test any more patches.
Comment 9 Danny Auble 2016-02-19 04:58:58 MST
Martin, could you try commit 1d1ce7d56ef and see if that works for you?
Comment 10 Martin Perry 2016-02-19 06:33:50 MST
Danny, that works. Thanks!
Comment 11 Danny Auble 2016-02-19 06:55:53 MST
No problem Martin, thanks for reporting!
Comment 12 Martin Perry 2016-02-19 07:03:31 MST
I'm now seeing a similar problem in PMI-1. It's different code, so I'll create a separate bug report for it.
Comment 13 Danny Auble 2016-02-19 07:07:46 MST
That is interesting, there doesn't appear to be any similar code there.  If you could give a backtrace when you open the bug that would be handy, thanks!