We are seeing for single node job using HPE slingshot the following error printed in the job output: srun: error: Unable to create step for job 73324: Error configuring interconnect It seems this might be related to single_node_vni since the above is outputted when running with: srun --network=single_node_vni namd3 stmv.namd If I do not run with --network option I get a traceback from NAMD and particular message: Reason: OFI::LrtsInit::fi_domain error, for single node use try --network=single_node_vni Looking at the code I see there is a SwitchParameter with similar name but I may not always want to have it enabled and assume the job setting should be override but the code suggests it checks the config value. I am not sure whether it is working and suspect it should be reporting an error in the slurmd log but I do not see any errors in the logs which is strange. So 2 questions: 1. Should single_node_vni be set in SwitchParameters to be able to use it? 2. Should I be seeing errors in the logs? Our log settings are: SlurmctldLogFile = /var/log/slurm/slurmctld.log SlurmctldSyslogDebug = (null) SlurmdLogFile = (null) SlurmdSyslogDebug = (null) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0
Just checked slurmctld.log and see: error: Single-node VNI requested by user, but 'single_node_vni=<all|user>' not set in SwitchParameters What is odd is that setting that will always use it - I would have assumed I could override with the --network.
You will probably want to set this in the slurm.conf file: > SwitchParameters=single_node_vni=user By default it is set to "none" which is why you are seeing the requested by user but not set in the config error. single_node_vni=user should only allocate the VNI when a user requests it which sounds like the behavior you are expecting! Let me know if this helps! Thanks! --Tim
Hi, Thanks for the quick reply. I didn't realise that user is what signifies that - I thought it was related to how VNI is configured in Slingshot. Makes perfect sense - will put in a change request locally to add that option. Thanks.
(In reply to Thomas.green from comment #3) > Hi, > > Thanks for the quick reply. I didn't realise that user is what signifies > that - I thought it was related to how VNI is configured in Slingshot. > Makes perfect sense - will put in a change request locally to add that > option. > > Thanks. Sure thing, I'm glad I could help! Let me now how the change goes! --Tim