| Summary: | pmix: fix search paths and linking issues in x_ac_pmix.m4 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Philip Kovacs <pkdevel> |
| Component: | Contributions | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | da |
| Version: | 17.11.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurm_x_ac_pmix.m4 | ||
Feel free to use this as a local patch but I will not guarantee it will work in the future. We can't reproduce this problem in house. If pmix is installed in /usr things should work. I would look at your configure log to see what the problem is. I wouldn't expect the first one to matter, and both work for me. Note, PMIX_LIBS should had been removed from the Makefile.am in the mpi/pmix repo and will be removed going forward, so using that is not a good idea. We no longer link to that anymore and do a dlopen on it instead. My guess is your pmix install isn't correct. I would look at your configure log to see what the problem is. At least let me comment before you close. I took the time to post on behalf of Fedora/Red Hat. Problem #1 ---------- My pmix headers are installed in /usr/include/pmix as is permitted with pmix configure (--includedir=/usr/include/pmix). This is standard distro organization of headers. You don't pollute /usr/include with headers, you organize them, just as slurm headers are installed to /usr/include/slurm. # ls -l /usr/include/pmix -rw-r--r--. 1 root root 19725 Aug 19 13:28 pmi2.h -rw-r--r--. 1 root root 25849 Aug 19 13:28 pmi.h -rw-r--r--. 1 root root 103251 Aug 19 13:28 pmix_common.h -rw-r--r--. 1 root root 27457 Aug 19 13:28 pmix.h -rw-r--r--. 1 root root 3413 Jan 7 13:31 pmix_rename.h -rw-r--r--. 1 root root 30176 Aug 19 13:28 pmix_server.h -rw-r--r--. 1 root root 4673 Aug 19 13:28 pmix_tool.h -rw-r--r--. 1 root root 322 Jan 7 13:31 pmix_version.h # Using a stock, unpatched slurm-17.11.2 release tarball # tar -xvf slurm-17.11.2.tar.bz2 # cd slurm-17.11.2 ~~ FAIL ~~ # ./configure # grep pmix config.log # configure:21530: checking for pmix installation # configure:21661: WARNING: unable to locate pmix installation ~~ FAIL ~~ # ./configure --with-pmix=/usr # grep pmix config.log # configure:21530: checking for pmix installation # configure:21661: WARNING: unable to locate pmix installation ~~ FAIL ~~ # ./configure --with-pmix=/usr/include # grep pmix config.log # configure:21530: checking for pmix installation # configure:21661: WARNING: unable to locate pmix installation ~~ FAIL ~~ # ./configure --with-pmix=/usr/include/pmix # grep pmix config.log # configure:21530: checking for pmix installation # configure:21661: WARNING: unable to locate pmix installation And my next comment will be on why the dlopen problem exists and is a problem. I can see your point, but where your patch works for you it would fail for anyone else as it only works for the --includedir you chose and not for anyone elses. I suppose it would work if we added an include dir different than your lib directory for pmix. But we would probably have to do the same with any of the other things we link against (hwloc, ucx, etc) which all rely on the same logic. If you are looking to not pollute the include dir I usually install in a common prefix instead of just moving the includes to a different location than the rest of the package. Perhaps /opt/pmix or something like that. By only moving the include you are still polluting the /usr/lib in a similar manner. In any case, I would urge you to use the --prefix option instead of the --includedir, or contact pmix to install it this way. As you know Slurm installs it's headers by default in includedir/slurm. PMIx for one reason or the other has chosen not to do this. It looks to be the same for hwloc on centos 7 anyway as well (that is the only one I checked). It installs directly in /usr/include hwloc.h then installs the rest in the hwloc dir. Looks to be the same for a bunch of other projects as well to do it that way. But it doesn't seem like there is a standard everyone is following. I really don't want to pollute our configure with a lib and an install option for everything we link against. OK, fair enough. On the second problem, the issue is again the mandated use of hardened builds (full relro, "now" linkage builds) for Fedora/Red Hat. The core of the issue is that lazy linkage is disabled and thus you cannot dlopen a plugin if it has unresolved symbols. # srun -N1 --mpi=pmix /bin/true srun: error: plugin_load_from_file: dlopen(/usr/lib64/slurm/mpi_pmix.so): /usr/lib64/slurm/mpi_pmix.so: undefined symbol: PMIx_server_finalize srun: error: Couldn't load specified plugin name for mpi/pmix: Dlopen of plugin file failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types I have posted another bug in which I provide a patch to "weaken" the symbols for such builds. I recognize that this isn't on your radar and may never be, but for packagers like myself it's a reality I must deal with. Just for fun, try building slurm using these flags. CPPFLAGS="-Wl,-z,relro,-z,now" ./configure... All the shared libraries will have a BIND_NOW setting (readelf -a *.so). That will disable all lazy linkage and give you clarity on why I post bugs like these. It sounds pretty rough. I don't get your issue though. CPPFLAGS = -Wl,-z,relro,-z,now is in all makefiles... ... libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../../../../slurm/src/plugins/mpi/pmix -I../../../.. -I../../../../slurm -I../../../../../../slurm -I../../../../../../slurm/src/common -I/usr/include -I/home/da/pmix/2/include -DHAVE_PMIX_VER=2 -Wl,-z,relro,-z,now -DNUMA_VERSION1_COMPATIBILITY -g -O2 -pthread -Werror -ggdb3 -Wall -g -O1 -fno-strict-aliasing -MT mpi_pmix_v2_la-pmixp_client_v2.lo -MD -MP -MF .deps/mpi_pmix_v2_la-pmixp_client_v2.Tpo -c ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client_v2.c -fPIC -DPIC -o .libs/mpi_pmix_v2_la-pmixp_client_v2.o libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../../../../slurm/src/plugins/mpi/pmix -I../../../.. -I../../../../slurm -I../../../../../../slurm -I../../../../../../slurm/src/common -I/usr/include -I/home/da/pmix/2/include -DHAVE_PMIX_VER=2 -Wl,-z,relro,-z,now -DNUMA_VERSION1_COMPATIBILITY -g -O2 -pthread -Werror -ggdb3 -Wall -g -O1 -fno-strict-aliasing -MT mpi_pmix_v2_la-pmixp_client_v2.lo -MD -MP -MF .deps/mpi_pmix_v2_la-pmixp_client_v2.Tpo -c ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client_v2.c -o mpi_pmix_v2_la-pmixp_client_v2.o >/dev/null 2>&1 mv -f .deps/mpi_pmix_v2_la-pmixp_client_v2.Tpo .deps/mpi_pmix_v2_la-pmixp_client_v2.Plo / ... srun --mpi=pmix hostname snowflake This is with Ubuntu 17.10 though, so perhaps something is different Must be heavily patched, just like mine. Cheers. Mine is vanilla. I am guessing your patches may be the issue? Anyway, I'll let you figure it out. Thanks! Hmm. Run this over your plugins and tell me if you see the BIND_NOW entries for each plugin. If you do, there is no way the installation is fully funcational.
$ find /usr/lib64/slurm -name \*.so -exec readelf -a {} \; | grep BIND_NOW
...
0x0000000000000018 (BIND_NOW)
0x0000000000000018 (BIND_NOW)
0x0000000000000018 (BIND_NOW)
...
|
Created attachment 5858 [details] slurm_x_ac_pmix.m4 There are two problems in x_ac_pmix.m4: Problem #1. The m4 module is unable to locate pmix headers if they are installed in a $d/pmix subdirectory. To reproduce: ---configure and install a pmix package (I used 2.0.2) and specify: ./configure --includedir=/usr/include/pmix ... make make install ---configure slurm 17.11.2 and specify: ./configure ... *or* ./configure --with-pmix=/usr ... Either should work, but neither does. The pmix installation will not be found. Problem #2. Commit c539d34684 removed the set of PMIX_LIBS from x_ac_pmix.m4. This causes the plugin mpi/pmix to be built with no link to libpmix and thus the symbols PMIx_ are unresolved. My attached patch fixes both problems.