Ticket 22289 - Use of single_node_vni in HPE slingshot
Summary: Use of single_node_vni in HPE slingshot
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: HPE Slingshot (show other tickets)
Version: 23.02.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Tim McMullan
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-03-07 04:33 MST by Thomas.green
Modified: 2025-03-07 05:52 MST (History)
0 users

See Also:
Site: Bristol AI
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Thomas.green 2025-03-07 04:33:53 MST
We are seeing for single node job using HPE slingshot the following error printed in the job output:

srun: error: Unable to create step for job 73324: Error configuring interconnect

It seems this might be related to single_node_vni since the above is outputted when running with:

srun --network=single_node_vni namd3 stmv.namd

If I do not run with --network option I get a traceback from NAMD and particular message:

Reason: OFI::LrtsInit::fi_domain error, for single node use try --network=single_node_vni

Looking at the code I see there is a SwitchParameter with similar name but I may not always want to have it enabled and assume the job setting should be override but the code suggests it checks the config value.  I am not sure whether it is working and suspect it should be reporting an error in the slurmd log but I do not see any errors in the logs which is strange.

So 2 questions:

1. Should single_node_vni be set in SwitchParameters to be able to use it?
2. Should I be seeing errors in the logs?

Our log settings are:

SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldSyslogDebug    = (null)
SlurmdLogFile           = (null)
SlurmdSyslogDebug       = (null)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
Comment 1 Thomas.green 2025-03-07 04:39:46 MST
Just checked slurmctld.log and see:

error: Single-node VNI requested by user, but 'single_node_vni=<all|user>' not set in SwitchParameters

What is odd is that setting that will always use it - I would have assumed I could override with the --network.
Comment 2 Tim McMullan 2025-03-07 05:30:14 MST
You will probably want to set this in the slurm.conf file:
> SwitchParameters=single_node_vni=user

By default it is set to "none" which is why you are seeing the requested by user but not set in the config error.

single_node_vni=user should only allocate the VNI when a user requests it which sounds like the behavior you are expecting!

Let me know if this helps!
Thanks!
--Tim
Comment 3 Thomas.green 2025-03-07 05:36:11 MST
Hi,

Thanks for the quick reply.  I didn't realise that user is what signifies that - I thought it was related to how VNI is configured in Slingshot.  Makes perfect sense - will put in a change request locally to add that option.

Thanks.
Comment 4 Tim McMullan 2025-03-07 05:52:13 MST
(In reply to Thomas.green from comment #3)
> Hi,
> 
> Thanks for the quick reply.  I didn't realise that user is what signifies
> that - I thought it was related to how VNI is configured in Slingshot. 
> Makes perfect sense - will put in a change request locally to add that
> option.
> 
> Thanks.

Sure thing, I'm glad I could help!

Let me now how the change goes!
--Tim