Ticket 18991 - Broken MPI heterogeneous job using SLURM dynamical node configuration
Summary: Broken MPI heterogeneous job using SLURM dynamical node configuration
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: PMIx (show other tickets)
Version: 23.11.3
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-02-15 00:10 MST by Denis Bertini
Modified: 2024-02-15 00:10 MST (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Denis Bertini 2024-02-15 00:10:06 MST
MPI heterogeneous job systematically failed in the initialisation phase when using dynamical node configuration.

The typical error reads:

slurmstepd: error:  mpi/pmix_v4: _pmix_p2p_send_core: lxbk0746 [0]: pmixp_utils.c:410: Can't find address for host lxbk0754, check slurm.conf


The SLURM mpi/pmix_v4 (slurm 23.11.3, pmix 4.2.7) plugin still try get the IP mapped address from slurm.conf and in case of dynamical node it will failed.

The test was done using standard MPI codes as well as the "test case 38.7" from the slurm official test-suite. 

 https://github.com/SchedMD/slurm/blob/e29f486b261e34d1dea55636bf93bf6dde02c62c/testsuite/expect/test38.7#L348.

Furthermore PMIx communication works within a single het-group on dynamic nodes

All MPI heterogeneous job test cases works a with static nodes configuration.