Summary: | Fix Slingshot plugin 401 error handling | ||
---|---|---|---|
Product: | Slurm | Reporter: | Jim Nordby <james.nordby> |
Component: | HPE Slingshot | Assignee: | Tim Wickberg <tim> |
Status: | OPEN --- | QA Contact: | |
Severity: | C - Contributions | ||
Priority: | --- | CC: | david.gloe, james.nordby |
Version: | 25.05.x | ||
Hardware: | Other | ||
OS: | Linux | ||
Site: | CRAY | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | Cray Internal |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | Fix Slingshot plugin 401 error handling |
Can you comment on when/if this is getting fixed in the FM? > Can you comment on when/if this is getting fixed in the FM?
I wouldn't plan on it getting fixed; I believe the html error
is actually coming from an nginx proxy, and I haven't had good
luck asking folks to change the look of the error.
|
Created attachment 41270 [details] Fix Slingshot plugin 401 error handling A patch is attached to fix HPE Slingshot plugin handling of Fabric Manager token expiration. The patch's comment is below: <start> At init time, the Slurm Slingshot plugin uses a Fabric Manager (FM) login endpoint to get a token used for authn/authz for subsequent REST calls to the FM. When that token expires, the plugin is supposed to login again and get a new token. However, when the token expired, the FM was returning an HTML error string (not JSON as expected), so the response-to-JSON routine failed without trying to re-acquire the token. To fix, don't error out when we can't convert the response to JSON (or if we don't get any response to the REST call), just keep going and re-acquire the token on HTTP 401/403. Also added a way to test token expiration without relying on the FM. Tested on sawmill with MAX_CACHE_USED=3 (that corrupted the authorization header after every 3 calls) to simulate token expiration. <end> The patch has a hack that can be used to test the change without actually expiring tokens.