| Summary: | Srun fails with error messages about missing Cray libraries | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Raghu Reddy <Raghu.Reddy> |
| Component: | User Commands | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 18.08.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=6112 https://bugs.schedmd.com/show_bug.cgi?id=4269 |
||
| Site: | NOAA | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | NESCC | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurm.conf file | ||
(In reply to Raghu Reddy from comment #0) > It appears that this happens when users set: > export LD_BIND_NOW=1 Slurm does not support when running with LD_BIND_NOW. Slurm uses a plugin architecture that is not compatible with LD_BIND_NOW. > We were able to remove this line and errors have gone away before. > > But there are instances where users *need* to set this for their application > to work, so we would like to know if these is way for this to work even when > this environment variable is set. Instead of setting LD_BIND_NOW in the job script, it can be added to the MPI job with something as simple as: > srun env LD_BIND_NOW=1 $MPIJOB Raghu I'm going to close this bug, please reply if you have any more questions. --Nate |
Created attachment 9383 [details] slurm.conf file Hi, We don't have a Cray XC type of system, and we have not explicitly configured it for a Cray and yet we are getting these errors: + srun --ntasks=1152 ./fv3.exe srun: error: plugin_load_from_file: dlopen(/apps/slurm/18.08.3/lib/slurm/select_cray.so): /apps/slurm/18.08.3/lib/slurm/select_cray.so: undefined symbol: post_job_step srun: error: Couldn't load specified plugin name for select/cray: Dlopen of plugin file failed srun: error: plugin_load_from_file: dlopen(/apps/slurm/18.08.3/lib/slurm/select_serial.so): /apps/slurm/18.08.3/lib/slurm/select_serial.so: undefined symbol: drain_nodes srun: error: Couldn't load specified plugin name for select/serial: Dlopen of plugin file failed srun: error: plugin_load_from_file: dlopen(/apps/slurm/18.08.3/lib/slurm/select_cons_res.so): /apps/slurm/18.08.3/lib/slurm/select_cons_res.so: undefined symbol: powercap_get_cluster_current_cap srun: error: Couldn't load specified plugin name for select/cons_res: Dlopen of plugin file failed srun: error: plugin_load_from_file: dlopen(/apps/slurm/18.08.3/lib/slurm/select_linear.so): /apps/slurm/18.08.3/lib/slurm/select_linear.so: undefined symbol: slurm_job_preempt_mode srun: error: Couldn't load specified plugin name for select/linear: Dlopen of plugin file failed srun: fatal: Can't find plugin for select/linear ++ date + echo 'Model ended: ' Fri Mar 1 21:52:12 GMT 2019 Model ended: Fri Mar 1 21:52:12 GMT 2019 + exit Is these some config flag missing? Attached is our slurm.conf It appears that this happens when users set: export LD_BIND_NOW=1 We were able to remove this line and errors have gone away before. But there are instances where users *need* to set this for their application to work, so we would like to know if these is way for this to work even when this environment variable is set. Thanks!