Ticket 4344 - Update of slurm packages leads to seg fault of some applications using non native systems libs
Summary: Update of slurm packages leads to seg fault of some applications using non na...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Build System and Packaging (show other tickets)
Version: 17.11.x
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-11-08 01:32 MST by Regine Gaudin
Modified: 2017-11-09 09:06 MST (History)
1 user (show)

See Also:
Site: CEA
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.0-rc3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Regine Gaudin 2017-11-08 01:32:38 MST
Application HPCdrive was failing in segfault after update of slurm whose cause-and-effect relationship was not obvious. However our investigation showed the surprising following scenario:

Consider a node with nvidia card on which first slurm has been installed followed by nvidia driver installation (nvidia driver is installing /usr/lib64/nvidia/libGl.so)

The loader for HPCDRIVER was normally  loading the following libs in this order:
/usr/lib64/libGl.so
/usr/lib64/nvidia/libGl.so (used by HPCdrive ok) 
/usr/lib64/slurm/libslurm.so

Now if we generate and update with new slurm packages generated with your slurm.spec doing the following:
echo '%{_libdir}
%{_libdir}/slurm' > $RPM_BUILD_ROOT/etc/ld.so.conf.d/slurm.conf

the loader will then load  in this new order:
/usr/lib64/libGl.so
/usr/lib64/nvidia/libGl.so 
/usr/lib64/slurm/libslurm.so
/usr/lib64/libGl.so (used by HPCdrive ko, seg fault)

When HPCdrive is using /usr/lib64/libGl.so instead of /usr/lib64/nvidia/libGl.so
it fails in seg fault.

There is no reason for slurm to modify system libs access. However it is justified to modify slurm's libs access.

We used the following war in the slurm.spec
-  echo '%{_libdir}
%{_libdir}/slurm' > $RPM_BUILD_ROOT/etc/ld.so.conf.d/slurm.conf
+  echo '%{_libdir}/slurm' > $RPM_BUILD_ROOT/etc/ld.so.conf.d/slurm.conf'

As it took lot's of time to find not obsvious relationship between slurm and HPCdrive seg fault and problem could be encountered for other application using non native libs,
would it be possible to fix slurm.spec in future release?

Thanks
Regine
Comment 1 Regine Gaudin 2017-11-08 01:36:07 MST
We used the following war in the slurm.spec
-  echo '%{_libdir}
%{_libdir}/slurm' > $RPM_BUILD_ROOT/etc/ld.so.conf.d/slurm.conf
+  echo '%{_libdir}/slurm' > $RPM_BUILD_ROOT/etc/ld.so.conf.d/slurm.conf

without "'" at the end
Comment 2 Tim Wickberg 2017-11-08 09:27:50 MST
Hi Regine -

I'm looking into it now.

I think you can, as a possible alternative, just remove that file entirely.

I believe the ld.conf files are written for the benefit of applications linking against libslurm; everything in the %{_libdir}/slurm directory is internal to the Slurm commands and daemons themselves, and they don't rely on ld to lookup the path.
Comment 4 Tim Wickberg 2017-11-08 11:42:07 MST
I've committed a fix (6dc7201194696) to this for 17.11 where we've been overhauling our slurm.spec file, but will not be applying this to the 17.02 branch.

As mentioned, as a workaround I believe you can safely delete that file, or patch the slurm.spec script to avoid generating it, or rely on your local patch.

- Tim
Comment 5 Regine Gaudin 2017-11-09 01:33:59 MST
Hi

"the ld.conf files are written for the benefit of applications linking against libslurm"

as precision, we are using such libs, so not sure we can use the war of deleting the ld.conf file, the one of modifying the spec file is ok, 

a fix is better

Thanks

Regine
Comment 6 Tim Wickberg 2017-11-09 09:06:18 MST
(In reply to Gaudin from comment #5)
> Hi
> 
> "the ld.conf files are written for the benefit of applications linking
> against libslurm"
> 
> as precision, we are using such libs, so not sure we can use the war of
> deleting the ld.conf file, the one of modifying the spec file is ok, 

/usr/lib64 is already always in the library search path. Having slurm install it a file to point to that same directory was redundant, and it can be removed.

The libraries in /usr/lib64/slurm are Slurm's internal plugins, and the path to these is already set using rpath within the binaries itself.

Thus, neither external applications, nor Slurm itself have been benefiting from that file on your system, and I believe it can be removed safely.

The only place that this can cause some minor complications is if you built the RPMs with an alternate --prefix. But in such a case, I'd rather assume that the person packaging Slurm has some other plan to handle library locations (such as Lmod / modulefiles), and thus having the package itself force the library path that way is improper.

> a fix is better

IMO, that is the best long-term fix.

Digging back through the commit log, this was originally done to support some quirks of the Cray operating system, and I don't believe should have been applied to all Linux installations.

cheers,
- Tim