| Summary: | mpi/pmi2 plugin error | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Martin Perry <martin.perry> |
| Component: | Other | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex |
| Version: | 16.05.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Atos/Eviden Sites | Alineos Sites: | --- |
| Atos/Eviden Sites: | CEA | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 16.05.0-pre2 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | patch to fix incorrect stack use in pmi2 | ||
|
Description
Martin Perry
2016-02-18 08:28:57 MST
Martin, could you please send the backtrace from the core dump? $ gdb /path/to/slurmctld /path/to/core $ bt full In this case it should be: $ gdb /path/to/srun /path/to/core (gdb) bt full No need for the backtrace, you highlighted exactly where the issue is - I'm not sure why that structure was ever used off the stack, any message components need to be xmalloc'd to prevent exactly this issue. I'm guessing that you may have built 16.05 with --enable-developer or some additional flag that you did not use for 15.08? The assertions aren't enabled in non-developer builds, so that's likely the only reason this didn't assert on 15.08 for you. I'll have a proposed patch shortly - do you mind testing it out? - Tim Created attachment 2752 [details]
patch to fix incorrect stack use in pmi2
Tim/Alejandro, I'll test the patch shortly, and let you know the result. Thanks. I use the same config options in 15.08 and 16.05: both --enable-developer and --enable-debug. That's why I thought it was strange that I only see the problem on 16.05. Unfortunately the patch does not fix the problem. tree_sock_addr is still defined as a global char array rather than being xmalloc'd, so the xfree(msg->address) in slurm_free_forward_data_msg still gets the assertion failure. I tried enhancing your patch to allocate tree_sock_addr with xmalloc, but that resulted in an assertion failure on a different xfree. As a temporary workaround I have commented out the call to slurm_free_forward_data_msg in slurm_protocol_defs.c. That allows my MPI tests to run so I can continue my development work. You should be able to reproduce the problem easily. Just run any MPI program that uses PMI-2, as follows: srun --mpi=pmi2 <mpi program> Okay, thank you for trying that out. It looks like there's a number of other places to fix unfortunately. Looking between the 15.08 and 16.05, there are a number of places that are set to xfree data structures now that were not before; I believe these were all meant to close out memory leaks, but are now finding locations where data was on the stack rather than the heap. I'll look into it further and get back to you with a more extensive and patch once I've had a chance to test it. Ok, thanks. Let me know if you want me to test any more patches. Martin, could you try commit 1d1ce7d56ef and see if that works for you? Danny, that works. Thanks! No problem Martin, thanks for reporting! I'm now seeing a similar problem in PMI-1. It's different code, so I'll create a separate bug report for it. That is interesting, there doesn't appear to be any similar code there. If you could give a backtrace when you open the bug that would be handy, thanks! |