src/plugins/mpi/pmi2/tree.c:tree_msg_to_stepds sets forward_data_msg_t:address to point to a global char array. This causes an assertion failure on the xfree of that pointer in src/common/slurm_protocol_defs.c:slurm_free_forward_data_msgs. Example: [trek2] (slurm) mpihello> srun --mpi=pmi2 ./mpihello srun: error: () Error: from slurm_protocol_defs.c:1678: (): Assertion (p[0] == 0x42) failed Aborted (core dumped) The code is the same in 15.08, so I'm not sure why we don't see the problem there. Maybe the assertion check has been tightened in 16.05?
Martin, could you please send the backtrace from the core dump? $ gdb /path/to/slurmctld /path/to/core $ bt full
In this case it should be: $ gdb /path/to/srun /path/to/core (gdb) bt full
No need for the backtrace, you highlighted exactly where the issue is - I'm not sure why that structure was ever used off the stack, any message components need to be xmalloc'd to prevent exactly this issue. I'm guessing that you may have built 16.05 with --enable-developer or some additional flag that you did not use for 15.08? The assertions aren't enabled in non-developer builds, so that's likely the only reason this didn't assert on 15.08 for you. I'll have a proposed patch shortly - do you mind testing it out? - Tim
Created attachment 2752 [details] patch to fix incorrect stack use in pmi2
Tim/Alejandro, I'll test the patch shortly, and let you know the result. Thanks. I use the same config options in 15.08 and 16.05: both --enable-developer and --enable-debug. That's why I thought it was strange that I only see the problem on 16.05.
Unfortunately the patch does not fix the problem. tree_sock_addr is still defined as a global char array rather than being xmalloc'd, so the xfree(msg->address) in slurm_free_forward_data_msg still gets the assertion failure. I tried enhancing your patch to allocate tree_sock_addr with xmalloc, but that resulted in an assertion failure on a different xfree. As a temporary workaround I have commented out the call to slurm_free_forward_data_msg in slurm_protocol_defs.c. That allows my MPI tests to run so I can continue my development work. You should be able to reproduce the problem easily. Just run any MPI program that uses PMI-2, as follows: srun --mpi=pmi2 <mpi program>
Okay, thank you for trying that out. It looks like there's a number of other places to fix unfortunately. Looking between the 15.08 and 16.05, there are a number of places that are set to xfree data structures now that were not before; I believe these were all meant to close out memory leaks, but are now finding locations where data was on the stack rather than the heap. I'll look into it further and get back to you with a more extensive and patch once I've had a chance to test it.
Ok, thanks. Let me know if you want me to test any more patches.
Martin, could you try commit 1d1ce7d56ef and see if that works for you?
Danny, that works. Thanks!
No problem Martin, thanks for reporting!
I'm now seeing a similar problem in PMI-1. It's different code, so I'll create a separate bug report for it.
That is interesting, there doesn't appear to be any similar code there. If you could give a backtrace when you open the bug that would be handy, thanks!