Ticket 14600

Summary: IntelMPI 2021 update 6 issues with slurm
Product: Slurm Reporter: Erin Boland <erin.k.boland>
Component: OtherAssignee: Felip Moll <felip.moll>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 21.08.7   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=13495
https://bugs.schedmd.com/show_bug.cgi?id=16173
Site: Raytheon Missile, Space and Airborne Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: RHEL Machine Name:
CLE Version: Version Fixed: N/A
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Erin Boland 2022-07-22 16:44:07 MDT
Using Intel MPI 2021 update 6 with slurm and seeing: 

 In: PMI_Abort(2664079, Fatal error in PMPI_Init_thread: Other MPI error, error stack:
  7 MPIR_Init_thread(138).........:
  6 MPID_Init(1117)...............:
  5 MPIDI_SHMI_mpi_init_hook(29)..:
  4 MPIDI_POSIX_mpi_init_hook(141):
  3 MPIDI_POSIX_eager_init(2268)..:
  2 MPIDU_shm_seg_commit(296).....: unable to allocate shared memory)
Comment 1 Felip Moll 2022-07-24 01:27:07 MDT
(In reply to Erin Boland from comment #0)
> Using Intel MPI 2021 update 6 with slurm and seeing: 
> 
>  In: PMI_Abort(2664079, Fatal error in PMPI_Init_thread: Other MPI error,
> error stack:
>   7 MPIR_Init_thread(138).........:
>   6 MPID_Init(1117)...............:
>   5 MPIDI_SHMI_mpi_init_hook(29)..:
>   4 MPIDI_POSIX_mpi_init_hook(141):
>   3 MPIDI_POSIX_eager_init(2268)..:
>   2 MPIDU_shm_seg_commit(296).....: unable to allocate shared memory)

Can you please try to set this environment variable before running the job?

export I_MPI_PMI_LIBRARY=/path_to_slurm/lib/libpmi2.so

and try again?
Comment 2 Erin Boland 2022-07-25 11:17:39 MDT
Hi Felip,

I had already had that environment variable set for this run. 

Erin
Comment 3 Erin Boland 2022-07-26 09:47:27 MDT
W
Comment 4 Erin Boland 2022-07-26 09:47:39 MDT
We got a fix - going to close the bug.
Comment 5 Felip Moll 2022-07-26 10:34:44 MDT
(In reply to Erin Boland from comment #4)
> We got a fix - going to close the bug.

Hi Erin,

Can you explain which was the fix exactly? That could be useful for the future.
Comment 6 Felip Moll 2022-07-28 07:54:24 MDT
I am marking the bug as infogiven.

If possible, I would appreciate some info about how you fixed the issue, that would be of great help for us and future responses/diagnostics.

Thanks!!