Ticket 19580

Summary: Post-23.11 patches for Slingshot switch plugin
Product: Slurm Reporter: Jim Nordby <james.nordby>
Component: HPE SlingshotAssignee: Tim Wickberg <tim>
Status: OPEN --- QA Contact:
Severity: C - Contributions    
Priority: --- CC: david.gloe, james.nordby
Version: 23.11.5   
Hardware: Cray Shasta   
OS: Linux   
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: Cray Internal
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Concatenated patches for Slingshot plugin for 23.11 and beyond
Fix for switch config bug introduced by last patch

Description Jim Nordby 2024-04-11 12:06:24 MDT
Created attachment 35820 [details]
Concatenated patches for Slingshot plugin for 23.11 and beyond

Attached are latest fixes for the HPE Slingshot switch plugin (concatenated into one patch).
Comment 1 Jim Nordby 2024-04-15 09:37:18 MDT
Created attachment 35865 [details]
Fix for switch config bug introduced by last patch
Comment 2 Tim Wickberg 2024-04-15 13:12:21 MDT
Jim -

Based on the commit descriptions, am I correct in assuming that the "fabric manager" aka "jackaloped" part of this plugin does not work in Slurm 23.11 currently?

Are these patches being provided out-of-band to any customers, or is this meant as a longer-term evolution of those interfaces in anticipation of customer demand?

thanks,
- Tim
Comment 3 Jim Nordby 2024-04-15 13:53:34 MDT
(In reply to Tim Wickberg from comment #2)
> Jim -
> 
> Based on the commit descriptions, am I correct in assuming that the "fabric
> manager" aka "jackaloped" part of this plugin does not work in Slurm 23.11
> currently?
> 
> Are these patches being provided out-of-band to any customers, or is this
> meant as a longer-term evolution of those interfaces in anticipation of
> customer demand?
> 
> thanks,
> - Tim

The slingshot fabric manager and jackaloped are two different REST interfaces,
used for different functionality (jackaloped for "instant on", fabric manager for Slingshot accelerated collectives).  Both work, although we're not recommending using the instant on functionality as it hasn't been tested for scalability.  The collectives feature will be used by customers in future Slingshot releases (hard to say exactly when that will be released, but the plugin code is needed for testing).
Comment 4 Tim Wickberg 2024-04-15 14:29:10 MDT
> The slingshot fabric manager and jackaloped are two different REST
> interfaces,
> used for different functionality (jackaloped for "instant on", fabric
> manager for Slingshot accelerated collectives). 

Ah, my mistake conflating those.

> Both work, although we're
> not recommending using the instant on functionality as it hasn't been tested
> for scalability.

I would agree that this should not be recommended.

> The collectives feature will be used by customers in
> future Slingshot releases (hard to say exactly when that will be released,
> but the plugin code is needed for testing).

Given the "will be", can I assume this is used by zero customers today?

We need to have a higher-level discussion on how to manage these plugins going forward. I'll split that discussion to a direct email thread.