Ticket 4389

Summary: sinfo --cluster returns "an unknown select plugin_id 108" error
Product: Slurm Reporter: Brian F Gilmer <brian.gilmer>
Component: ConfigurationAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart, da, fabrice.cantos
Version: 17.02.9   
Hardware: Linux   
OS: Linux   
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: Other
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: Kupe CLE Version:
Version Fixed: 17.11.0-rc4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Brian F Gilmer 2017-11-15 16:09:15 MST
The site has an XC (kupe) and 2 VM cluster (kupe_mp and kupe_librarian).  The slurmdbd is running outside of the XC mainframe.  When trying to access sinfo from a 'login' host I get:

[root@ec-login01 munge]# sinfo --cluster=kupe
sinfo: error: Cluster 'kupe' has an unknown select plugin_id 108
sinfo: error: 'kupe' can't be reached now, or it is an invalid entry for --cluster.  Use 'sacctmgr list clusters' to see available clusters.


[root@ec-login01 munge]# sacctmgr show cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
      kupe 192.168.235.165         6817  7936         1                                                                                           normal
kupe_libr+                            0     0         1                                                                                           normal
   kupe_mp   10.64.125.139         6817  7936         1                                                                                           normal

The select plugin ID is for SELECT_PLUGIN_CRAY_CONS_RES.  Since that is not really a plugin it is not getting picked up by looking at the available select plugins.
Comment 1 Moe Jette 2017-11-15 16:30:08 MST
Changing component from "Federation" to "Configuration".
Someone should pursue this more on Thursday.
Comment 2 Brian F Gilmer 2017-11-16 12:30:58 MST
Thanks,

The customer has a workflow manager that runs jobs on both the XC ans CS systems.  They were relying on the --cluster feature.
Comment 3 Dominik Bartkiewicz 2017-11-17 04:58:29 MST
Hi

For now I found only workaround.
Can you add  other_cons_res to SelectTypeParameters on ec-login01.
If you doesn't use select_alps this shouldn't change any thing.

Dominik
Comment 4 Dominik Bartkiewicz 2017-11-17 07:19:31 MST
just in case: if you use on other machines select_cray with select_linear this workaround is wrong.
Comment 10 Brian F Gilmer 2017-11-22 06:56:12 MST
Hello

I modified slurm.conf on kupe_mp (VM cluster).  I added SelectTypeParameter=other_cons_res as a wrok-around for this problem.
Comment 15 Dominik Bartkiewicz 2017-11-27 07:39:23 MST
Hi

Commit https://github.com/SchedMD/slurm/commit/d3338956fe9 should fix this issue.
It will be included in 17.11 release.
Could you confirm if this solved problem on your environment?

Dominik
Comment 16 Dominik Bartkiewicz 2017-11-28 09:35:53 MST
Hi

I'm going to go ahead and mark this as Resolved/Fixed, please feel free to re-open this if there's anything else we can help with.

Dominik