Ticket 14687

Summary: UCX error
Product: Slurm Reporter: Shraddha Kiran <Shraddha_Kiran>
Component: Build System and PackagingAssignee: Director of Support <support>
Status: RESOLVED INVALID QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: felip.moll
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: AMAT Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: ucx-error.log

Description Shraddha Kiran 2022-08-04 08:53:25 MDT
Created attachment 26158 [details]
ucx-error.log

Hello,

This is with reference to Bug 13622. The Slurm version is 19.05 and we plan to migrate to the latest supported versions.

We are trying to run a MPI based application but are facing few issues as below:

MPI startup(): Could not import some environment variables. Intel MPI process pinning will not be used.
               Possible reason: Using the Slurm srun command. In this case, Slurm pinning will be used.
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /cm/shared/apps/slurm/19.05.7/lib64/libpmi.so
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /cm/shared/apps/slurm/19.05.7/lib64/libpmi.so
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /cm/shared/apps/slurm/19.05.7/lib64/libpmi.so
[0] MPI startup(): libfabric version: 1.10.1-impi
[0] MPI startup(): libfabric provider: mlx
[1659623014.793865] [dcaldh001:7315 :0]         select.c:445  UCX  ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
[1659623014.793877] [dcaldh001:7318 :0]         select.c:445  UCX  ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
[1659623014.793879] [dcaldh001:7308 :0]         select.c:445  UCX  ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
Abort(1091215) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(1149)..............:
MPIDI_OFI_mpi_init_hook(1657): OFI get address vector map failed
In: PMI_Abort(1091215, Fatal error in PMPI_Init: Other MPI error, error stack


The workaround implemented for now : export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/19.05.7/lib64/libpmi.so

Above error messages are after implementing the workaround on development cluster. However, after implementing the workaround on production nodes no such errors are observed and the job completes successfully.

We were advised to check for ucx_info and we were able to get the information on our development cluster but unfortunately no information on production cluster

Please let us know your suggestions. If possible, please let me know if a meeting can be scheduled so that our team and your team can troubleshoot it together

Thank You
Shraddha
Comment 2 Jason Booth 2022-08-04 09:55:40 MDT
Issues logged against a test/development cluster should always be logged as severity 4. Severity should reflect the impact on production.

Regarding the issue you are experiencing: support for 19.05 ended with the release of 20.02 in 2020. Since then we have released 20.11, 21.08, and 22.05. Our support for that version is limited and you need to upgrade to a supported version before requesting us to look into this further. It is just unfeasible for us to support a version that is 3 years old.


Furthermore, with a quick search on this, It seems that OFI community no longer supports libfabric over UCX. OFI removed the Mellanox (using ucx) provider at some release.

More info here:

https://github.com/openucx/ucx/issues/4742