Ticket 22439 - Fix Slingshot plugin 401 error handling
Summary: Fix Slingshot plugin 401 error handling
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: HPE Slingshot (show other tickets)
Version: 25.05.x
Hardware: Other Linux
: C - Contributions
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-03-26 10:35 MDT by Jim Nordby
Modified: 2025-03-26 15:18 MDT (History)
2 users (show)

See Also:
Site: CRAY
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: Cray Internal
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Fix Slingshot plugin 401 error handling (3.28 KB, application/mbox)
2025-03-26 10:35 MDT, Jim Nordby
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Jim Nordby 2025-03-26 10:35:07 MDT
Created attachment 41270 [details]
Fix Slingshot plugin 401 error handling

A patch is attached to fix HPE Slingshot plugin handling of Fabric Manager token expiration.  The patch's comment is below:

<start>
At init time, the Slurm Slingshot plugin uses a Fabric Manager (FM)
login endpoint to get a token used for authn/authz for subsequent
REST calls to the FM.  When that token expires, the plugin is
supposed to login again and get a new token.

However, when the token expired, the FM was returning an HTML
error string (not JSON as expected), so the response-to-JSON
routine failed without trying to re-acquire the token.

To fix, don't error out when we can't convert the response to JSON
(or if we don't get any response to the REST call), just keep
going and re-acquire the token on HTTP 401/403.
Also added a way to test token expiration without relying on the FM.

Tested on sawmill with MAX_CACHE_USED=3 (that corrupted the
authorization header after every 3 calls) to simulate token
expiration.
<end>

The patch has a hack that can be used to test the change without actually expiring tokens.
Comment 1 Tim Wickberg 2025-03-26 11:03:14 MDT
Can you comment on when/if this is getting fixed in the FM?
Comment 2 Jim Nordby 2025-03-26 15:18:17 MDT
> Can you comment on when/if this is getting fixed in the FM?

I wouldn't plan on it getting fixed; I believe the html error
is actually coming from an nginx proxy, and I haven't had good
luck asking folks to change the look of the error.