Ticket 19580 - Post-23.11 patches for Slingshot switch plugin
Summary: Post-23.11 patches for Slingshot switch plugin
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: HPE Slingshot (show other tickets)
Version: 23.11.5
Hardware: Cray Shasta Linux
: C - Contributions
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-04-11 12:06 MDT by Jim Nordby
Modified: 2024-05-06 19:20 MDT (History)
2 users (show)

See Also:
Site: CRAY
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: Cray Internal
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Concatenated patches for Slingshot plugin for 23.11 and beyond (184.51 KB, application/mbox)
2024-04-11 12:06 MDT, Jim Nordby
Details
Fix for switch config bug introduced by last patch (14.98 KB, patch)
2024-04-15 09:37 MDT, Jim Nordby
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Jim Nordby 2024-04-11 12:06:24 MDT
Created attachment 35820 [details]
Concatenated patches for Slingshot plugin for 23.11 and beyond

Attached are latest fixes for the HPE Slingshot switch plugin (concatenated into one patch).
Comment 1 Jim Nordby 2024-04-15 09:37:18 MDT
Created attachment 35865 [details]
Fix for switch config bug introduced by last patch
Comment 2 Tim Wickberg 2024-04-15 13:12:21 MDT
Jim -

Based on the commit descriptions, am I correct in assuming that the "fabric manager" aka "jackaloped" part of this plugin does not work in Slurm 23.11 currently?

Are these patches being provided out-of-band to any customers, or is this meant as a longer-term evolution of those interfaces in anticipation of customer demand?

thanks,
- Tim
Comment 3 Jim Nordby 2024-04-15 13:53:34 MDT
(In reply to Tim Wickberg from comment #2)
> Jim -
> 
> Based on the commit descriptions, am I correct in assuming that the "fabric
> manager" aka "jackaloped" part of this plugin does not work in Slurm 23.11
> currently?
> 
> Are these patches being provided out-of-band to any customers, or is this
> meant as a longer-term evolution of those interfaces in anticipation of
> customer demand?
> 
> thanks,
> - Tim

The slingshot fabric manager and jackaloped are two different REST interfaces,
used for different functionality (jackaloped for "instant on", fabric manager for Slingshot accelerated collectives).  Both work, although we're not recommending using the instant on functionality as it hasn't been tested for scalability.  The collectives feature will be used by customers in future Slingshot releases (hard to say exactly when that will be released, but the plugin code is needed for testing).
Comment 4 Tim Wickberg 2024-04-15 14:29:10 MDT
> The slingshot fabric manager and jackaloped are two different REST
> interfaces,
> used for different functionality (jackaloped for "instant on", fabric
> manager for Slingshot accelerated collectives). 

Ah, my mistake conflating those.

> Both work, although we're
> not recommending using the instant on functionality as it hasn't been tested
> for scalability.

I would agree that this should not be recommended.

> The collectives feature will be used by customers in
> future Slingshot releases (hard to say exactly when that will be released,
> but the plugin code is needed for testing).

Given the "will be", can I assume this is used by zero customers today?

We need to have a higher-level discussion on how to manage these plugins going forward. I'll split that discussion to a direct email thread.