Ticket 6619

Summary: Srun fails with error messages about missing Cray libraries
Product: Slurm Reporter: Raghu Reddy <Raghu.Reddy>
Component: User CommandsAssignee: Nate Rini <nate>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 18.08.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=6112
https://bugs.schedmd.com/show_bug.cgi?id=4269
Site: NOAA Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: NESCC NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm.conf file

Description Raghu Reddy 2019-03-01 15:17:48 MST
Created attachment 9383 [details]
slurm.conf file

Hi,

We don't have a Cray XC type of system, and we have not explicitly configured it for a Cray and yet we are getting these errors:

+ srun --ntasks=1152  ./fv3.exe
srun: error: plugin_load_from_file: 
dlopen(/apps/slurm/18.08.3/lib/slurm/select_cray.so): 
/apps/slurm/18.08.3/lib/slurm/select_cray.so: undefined symbol: 
post_job_step
srun: error: Couldn't load specified plugin name for select/cray: Dlopen of plugin file failed
srun: error: plugin_load_from_file: 
dlopen(/apps/slurm/18.08.3/lib/slurm/select_serial.so): 
/apps/slurm/18.08.3/lib/slurm/select_serial.so: undefined symbol: 
drain_nodes
srun: error: Couldn't load specified plugin name for select/serial: 
Dlopen of plugin file failed
srun: error: plugin_load_from_file: 
dlopen(/apps/slurm/18.08.3/lib/slurm/select_cons_res.so): 
/apps/slurm/18.08.3/lib/slurm/select_cons_res.so: undefined symbol: 
powercap_get_cluster_current_cap
srun: error: Couldn't load specified plugin name for select/cons_res: 
Dlopen of plugin file failed
srun: error: plugin_load_from_file: 
dlopen(/apps/slurm/18.08.3/lib/slurm/select_linear.so): 
/apps/slurm/18.08.3/lib/slurm/select_linear.so: undefined symbol: 
slurm_job_preempt_mode
srun: error: Couldn't load specified plugin name for select/linear: 
Dlopen of plugin file failed
srun: fatal: Can't find plugin for select/linear
++ date
+ echo 'Model ended:    ' Fri Mar 1 21:52:12 GMT 2019
Model ended:     Fri Mar 1 21:52:12 GMT 2019
+ exit

Is these some config flag missing?

Attached is our slurm.conf

It appears that this happens when users set:

export LD_BIND_NOW=1 

We were able to remove this line and errors have gone away before.

But there are instances where users *need* to set this for their application to work, so we would like to know if these is way for this to work even when this environment variable is set.

Thanks!
Comment 1 Nate Rini 2019-03-01 15:22:46 MST
(In reply to Raghu Reddy from comment #0)
> It appears that this happens when users set:
> export LD_BIND_NOW=1 
Slurm does not support when running with LD_BIND_NOW. Slurm uses a plugin architecture that is not compatible with LD_BIND_NOW.

> We were able to remove this line and errors have gone away before.
> 
> But there are instances where users *need* to set this for their application
> to work, so we would like to know if these is way for this to work even when
> this environment variable is set.

Instead of setting LD_BIND_NOW in the job script, it can be added to the MPI job with something as simple as:
> srun env LD_BIND_NOW=1 $MPIJOB
Comment 2 Nate Rini 2019-03-07 12:40:02 MST
Raghu

I'm going to close this bug, please reply if you have any more questions.

--Nate