Ticket 22289

Summary: Use of single_node_vni in HPE slingshot
Product: Slurm Reporter: Thomas.green
Component: HPE SlingshotAssignee: Tim McMullan <mcmullan>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 23.02.6   
Hardware: Linux   
OS: Linux   
Site: Bristol AI Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Thomas.green 2025-03-07 04:33:53 MST
We are seeing for single node job using HPE slingshot the following error printed in the job output:

srun: error: Unable to create step for job 73324: Error configuring interconnect

It seems this might be related to single_node_vni since the above is outputted when running with:

srun --network=single_node_vni namd3 stmv.namd

If I do not run with --network option I get a traceback from NAMD and particular message:

Reason: OFI::LrtsInit::fi_domain error, for single node use try --network=single_node_vni

Looking at the code I see there is a SwitchParameter with similar name but I may not always want to have it enabled and assume the job setting should be override but the code suggests it checks the config value.  I am not sure whether it is working and suspect it should be reporting an error in the slurmd log but I do not see any errors in the logs which is strange.

So 2 questions:

1. Should single_node_vni be set in SwitchParameters to be able to use it?
2. Should I be seeing errors in the logs?

Our log settings are:

SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldSyslogDebug    = (null)
SlurmdLogFile           = (null)
SlurmdSyslogDebug       = (null)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
Comment 1 Thomas.green 2025-03-07 04:39:46 MST
Just checked slurmctld.log and see:

error: Single-node VNI requested by user, but 'single_node_vni=<all|user>' not set in SwitchParameters

What is odd is that setting that will always use it - I would have assumed I could override with the --network.
Comment 2 Tim McMullan 2025-03-07 05:30:14 MST
You will probably want to set this in the slurm.conf file:
> SwitchParameters=single_node_vni=user

By default it is set to "none" which is why you are seeing the requested by user but not set in the config error.

single_node_vni=user should only allocate the VNI when a user requests it which sounds like the behavior you are expecting!

Let me know if this helps!
Thanks!
--Tim
Comment 3 Thomas.green 2025-03-07 05:36:11 MST
Hi,

Thanks for the quick reply.  I didn't realise that user is what signifies that - I thought it was related to how VNI is configured in Slingshot.  Makes perfect sense - will put in a change request locally to add that option.

Thanks.
Comment 4 Tim McMullan 2025-03-07 05:52:13 MST
(In reply to Thomas.green from comment #3)
> Hi,
> 
> Thanks for the quick reply.  I didn't realise that user is what signifies
> that - I thought it was related to how VNI is configured in Slingshot. 
> Makes perfect sense - will put in a change request locally to add that
> option.
> 
> Thanks.

Sure thing, I'm glad I could help!

Let me now how the change goes!
--Tim
Comment 5 Tim McMullan 2025-04-03 06:43:32 MDT
Hi!

Since I haven't heard back on this I'm guessing that the change went OK and its done what you needed it to do.

I'll close this for now, but if you have any questions please let us know!

Thanks!
--Tim