We are seeing for single node job using HPE slingshot the following error printed in the job output: srun: error: Unable to create step for job 73324: Error configuring interconnect It seems this might be related to single_node_vni since the above is outputted when running with: srun --network=single_node_vni namd3 stmv.namd If I do not run with --network option I get a traceback from NAMD and particular message: Reason: OFI::LrtsInit::fi_domain error, for single node use try --network=single_node_vni Looking at the code I see there is a SwitchParameter with similar name but I may not always want to have it enabled and assume the job setting should be override but the code suggests it checks the config value. I am not sure whether it is working and suspect it should be reporting an error in the slurmd log but I do not see any errors in the logs which is strange. So 2 questions: 1. Should single_node_vni be set in SwitchParameters to be able to use it? 2. Should I be seeing errors in the logs? Our log settings are: SlurmctldLogFile = /var/log/slurm/slurmctld.log SlurmctldSyslogDebug = (null) SlurmdLogFile = (null) SlurmdSyslogDebug = (null) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0
Just checked slurmctld.log and see: error: Single-node VNI requested by user, but 'single_node_vni=<all|user>' not set in SwitchParameters What is odd is that setting that will always use it - I would have assumed I could override with the --network.
You will probably want to set this in the slurm.conf file: > SwitchParameters=single_node_vni=user By default it is set to "none" which is why you are seeing the requested by user but not set in the config error. single_node_vni=user should only allocate the VNI when a user requests it which sounds like the behavior you are expecting! Let me know if this helps! Thanks! --Tim
Hi, Thanks for the quick reply. I didn't realise that user is what signifies that - I thought it was related to how VNI is configured in Slingshot. Makes perfect sense - will put in a change request locally to add that option. Thanks.
(In reply to Thomas.green from comment #3) > Hi, > > Thanks for the quick reply. I didn't realise that user is what signifies > that - I thought it was related to how VNI is configured in Slingshot. > Makes perfect sense - will put in a change request locally to add that > option. > > Thanks. Sure thing, I'm glad I could help! Let me now how the change goes! --Tim
Hi! Since I haven't heard back on this I'm guessing that the change went OK and its done what you needed it to do. I'll close this for now, but if you have any questions please let us know! Thanks! --Tim