Summary: | slurmd does not start when built in hardened environment | ||
---|---|---|---|
Product: | Slurm | Reporter: | Nenad Vukicevic <nenad> |
Component: | Contributions | Assignee: | Unassigned Developer <dev-unassigned> |
Status: | OPEN --- | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | CC: | adam.huffman, Andrew.Elwell, cvalvin, dmjacobsen, jcldc13, john.donners, mschmit, patrick.roberts, pkdevel, regine.gaudin, sergey_meirovich, sofya.urbaniec, sts |
Version: | 19.05.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=4627 https://bugs.schedmd.com/show_bug.cgi?id=7806 https://bugs.schedmd.com/show_bug.cgi?id=8499 |
||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | slurm elf weak symbols for full relro patch |
Description
Nenad Vukicevic
2016-02-12 07:36:24 MST
Updating status flags. I need to test this a bit further internally but this looks like the best approach so far to handle issues around -z,now. *** Ticket 2373 has been marked as a duplicate of this ticket. *** *** Ticket 2440 has been marked as a duplicate of this ticket. *** Just to add that, as I pointed out in the other bugs, I'm able to build without disabling the hardened build. In my case, I export the CFLAGS and LDFLAGS in the spec file. FYI, I'm seeing this with slurm-17.02.0-0rc1 on Fedora 25. Haven't looked deeply into it yet, but a vanilla rpmbuild -tb <tarball> on a vanilla Fedora 25 can start slurmcltd but not slurmd for this same reason. [root@dmjdev slurm]# /usr/sbin/slurmd -Dvvv slurmd: error: plugin_load_from_file: dlopen(/usr/lib64/slurm/select_cons_res.so): /usr/lib64/slurm/select_cons_res.so: undefined symbol: powercap_get_cluster_current_cap slurmd: error: Couldn't load specified plugin name for select/cons_res: Dlopen of plugin file failed slurmd: fatal: Can't find plugin for select/cons_res [root@dmjdev slurm]# Created attachment 5545 [details]
slurm elf weak symbols for full relro patch
I've developed a patch that allows slurm to operate when built with full relro flags: -Wl,-z,relro,-z,now. These changes allow slurm to meet the hardening standards of certain distros, e.g. Fedora. The full relro build allows the GOT sections of the ELF binaries to be marked read-only and thus makes the software more secure. The idea is to mark as "weak" those plugin symbols which might not be resolved in every context in which the plugin operates. Since attribute tagging varies from compiler to compiler, I opted to use __GNUC__ and __ELF__ guards around the declarations, so the code is compiled only for gcc on ELF systems. Each plugin in the patch has a comment of the form: TEST_CASE: sacct vs slurmctld meaning that if a full relro build is compiled without the patch, you can expect an immediate dlopen failure in one of the listed programs when using that plugin. Both programs should operate normally with the patch. Other programs may be involved -- the test cases are minimal examples only. The patch is benign in the sense that it introduces no functional changes to the code and merely tags as weak the problematic functions. There already is similar code for __APPLE__ builds in which certain data is tagged "weak_import". I want to mention also that plugins named *cray*, *bluegene*, *alps*, *cncu* are not included in this patch. I am not able to test on cray or bluegene systems. The patch is clean against the slurm-17.02 and slurm-17.11 branches as of today. Phil For us on 17.02.10 that patch is needed to get ANSYS v19.0 to work. Not sure why. Forgot to add. We are on stock RHEL6 - so in our case that is not related to hardening of OS/Slurm itself. Hello. We ran into the same bug with slurm 17.11.7 and Ansys 19 on CentOS Linux release 7.3.1611 (Core). Should we apply this patch ir is it better to upgrade to slurm-18.08.3? Thank you, Sofya The fundamental architectural problem w.r.t. hardening is that you cannot have program A load plugin B if B uses symbols in A. You need a move those symbols into a third party area, library C, and then link plugin B to C so that there are no unresolved symbols when it loads into A. I know the slurm devs recognize the problem as I have brought it to their attention on countless other occasions (bugs), but they have not addressed it yet to my knowledge. Pretty sure 18.x has the same problem. The fix is to configure and compile slurm with minimal hardening and lazy linkage. slurm predates many of the modern hardening techniques we use commonly on fedora and rhel and centos. This patch was my attempt to allow slurm to be fully hardened by marking as weak the symbols in program A that are required when when plugin B loads. Do not count on this patch working beyond the version I wrote it for. The proper way to approach this is to rebuild slurm with lesser hardening as I mentioned, until the slurm devs prioritize hardening. (In reply to sofya from comment #10) > Hello. > > We ran into the same bug with slurm 17.11.7 and Ansys 19 on CentOS Linux > release 7.3.1611 (Core). > > Should we apply this patch ir is it better to upgrade to slurm-18.08.3? Sofya - can you file a separate issue? I'm having a hard time seeing how this would show up specifically with that combination of CentOS alongside Ansys, and that'd be better tackled as a separate support issue until we're sure it's related. (In reply to Philip Kovacs from comment #11) > I know the slurm devs recognize the problem as I have brought it to their > attention on countless other occasions (bugs), but they have not addressed > it yet to my knowledge. Pretty sure 18.x has the same problem. The fix is > to configure and compile slurm with minimal hardening and lazy linkage. > slurm predates many of the modern hardening techniques we use commonly on > fedora and rhel and centos. To be clear - "hardening" in this context refers to enabling a set of restrictive linker flags. My understanding is that this is viewed as "safer" by some security folks insofar as that the behavior that is now being disallowed is not a common pattern in most applications, and that behavior can make certain classes of system exploit simpler. But Slurm isn't most applications, and lazy linking is something Slurm has always relied on within our plugin infrastructure. Use of lazy linking, as Slurm prefers, is not inherently "unsafe" itself, despite protestations from those security folks trying to force this into different packaging systems. The patch Philip has provided works around this by tagging a number of symbols as weak (thus avoiding some of these complications), but I do not believe it will work on 18.08, and would need to be updated there. It's not something I'm looking to apply upstream at the moment. In my view its a bandaid around the build environment trying to force overly restrictive linker options upon us, and not a permanent fix. > This patch was my attempt to allow slurm to be fully hardened by marking as > weak the symbols in program A that are required when when plugin B loads. > Do not count on this patch working beyond the version I wrote it for. The > proper way to approach this is to rebuild slurm with lesser hardening as I > mentioned, until the slurm devs prioritize hardening. Correct. We do not recommend the use of these "hardening" options at this time. Hi Tim, I opened a separate bug here: https://bugs.schedmd.com/show_bug.cgi?id=6112 Can you please take a look. it's a high impact for us. Thank you, Sofya Hello Sofya,
I can't see or comment on #6112, so I comment here. The problem with Ansys 19.0 is their use of LD_BIND_NOW=1 icw srun. You can change the ansys190 script as follows:
diff /ansys_inc/v190/ansys/bin/ansys190*
309c309
< command="${distcmd} ${extra_mpi_args} -np ${mpinp} ${dansys_script} ${ansysargs}"
---
> command="LD_BIND_NOW='' ${distcmd} -genv LD_BIND_NOW=\"$LD_BIND_NOW\" -genvall ${extra_mpi_args} -np ${mpinp} ${dansys_script} ${ansysargs}"
```
and it should work.
Hello, I get the same issue on Centos 8 and slurm package slurm-19.05.3-2 : [root@lmaster slurm-19.05.3-2]# slurmd -D -vvvv slurmd: error: plugin_load_from_file: dlopen(/usr/lib64/slurm/select_cons_res.so): /usr/lib64/slurm/select_cons_res.so: undefined symbol: powercap_get_cluster_current_cap slurmd: error: Couldn't load specified plugin name for select/cons_res: Dlopen of plugin file failed slurmd: fatal: Can't find plugin for select/cons_res Jean-Charles *** Ticket 8438 has been marked as a duplicate of this ticket. *** |