Ticket 6619 - Srun fails with error messages about missing Cray libraries
Summary: Srun fails with error messages about missing Cray libraries
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 18.08.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-03-01 15:17 MST by Raghu Reddy
Modified: 2019-03-07 12:40 MST (History)
0 users

See Also:
Site: NOAA
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: NESCC
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf file (7.66 KB, text/plain)
2019-03-01 15:17 MST, Raghu Reddy
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Raghu Reddy 2019-03-01 15:17:48 MST
Created attachment 9383 [details]
slurm.conf file

Hi,

We don't have a Cray XC type of system, and we have not explicitly configured it for a Cray and yet we are getting these errors:

+ srun --ntasks=1152  ./fv3.exe
srun: error: plugin_load_from_file: 
dlopen(/apps/slurm/18.08.3/lib/slurm/select_cray.so): 
/apps/slurm/18.08.3/lib/slurm/select_cray.so: undefined symbol: 
post_job_step
srun: error: Couldn't load specified plugin name for select/cray: Dlopen of plugin file failed
srun: error: plugin_load_from_file: 
dlopen(/apps/slurm/18.08.3/lib/slurm/select_serial.so): 
/apps/slurm/18.08.3/lib/slurm/select_serial.so: undefined symbol: 
drain_nodes
srun: error: Couldn't load specified plugin name for select/serial: 
Dlopen of plugin file failed
srun: error: plugin_load_from_file: 
dlopen(/apps/slurm/18.08.3/lib/slurm/select_cons_res.so): 
/apps/slurm/18.08.3/lib/slurm/select_cons_res.so: undefined symbol: 
powercap_get_cluster_current_cap
srun: error: Couldn't load specified plugin name for select/cons_res: 
Dlopen of plugin file failed
srun: error: plugin_load_from_file: 
dlopen(/apps/slurm/18.08.3/lib/slurm/select_linear.so): 
/apps/slurm/18.08.3/lib/slurm/select_linear.so: undefined symbol: 
slurm_job_preempt_mode
srun: error: Couldn't load specified plugin name for select/linear: 
Dlopen of plugin file failed
srun: fatal: Can't find plugin for select/linear
++ date
+ echo 'Model ended:    ' Fri Mar 1 21:52:12 GMT 2019
Model ended:     Fri Mar 1 21:52:12 GMT 2019
+ exit

Is these some config flag missing?

Attached is our slurm.conf

It appears that this happens when users set:

export LD_BIND_NOW=1 

We were able to remove this line and errors have gone away before.

But there are instances where users *need* to set this for their application to work, so we would like to know if these is way for this to work even when this environment variable is set.

Thanks!
Comment 1 Nate Rini 2019-03-01 15:22:46 MST
(In reply to Raghu Reddy from comment #0)
> It appears that this happens when users set:
> export LD_BIND_NOW=1 
Slurm does not support when running with LD_BIND_NOW. Slurm uses a plugin architecture that is not compatible with LD_BIND_NOW.

> We were able to remove this line and errors have gone away before.
> 
> But there are instances where users *need* to set this for their application
> to work, so we would like to know if these is way for this to work even when
> this environment variable is set.

Instead of setting LD_BIND_NOW in the job script, it can be added to the MPI job with something as simple as:
> srun env LD_BIND_NOW=1 $MPIJOB
Comment 2 Nate Rini 2019-03-07 12:40:02 MST
Raghu

I'm going to close this bug, please reply if you have any more questions.

--Nate