Ticket 22432 - Munge PMIx srun psec "requested component was not found"
Summary: Munge PMIx srun psec "requested component was not found"
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: PMIx (show other tickets)
Version: 25.05.x
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Joel Criado
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-03-25 11:51 MDT by Michael Janczyk
Modified: 2025-03-28 09:02 MDT (History)
0 users

See Also:
Site: UFR
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Michael Janczyk 2025-03-25 11:51:24 MDT
Hi,

We have recently switched to SLURM. I am also new to SLURM. When I run an MPI job with srun and PMIx, I get the following error:
=====
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIx stopped checking at the first component that it did not find.
Host:      node01
Framework: psec
Component: munge
=====

Munge is running on the nodes. I could build PMIx v4 myself if this is recommended anyway (and fixes the error).

Best regards,
Michael

System (configless):

-r-------- 1 munge munge 1024 Mar 19 19:31 /etc/munge/munge.key

# cat /etc/slurm/mpi.conf 
PMIxDirectConnUCX=true
PMIxEnv=PMIX_MCA_psec=none

rpmbuild --clean --define "%_topdir $(pwd)/build" --define "_with_nvml --with-nvml=/usr/local/cuda-12.8" --with slurmrestd --with pmix --with ucx -ta "slurm-24.05.6.tar.bz2"

hwloc-2.4.1-5.el9.x86_64
pmix-3.2.3-5.el9.x86_64
munge-0.5.13-13.el9.x86_64
slurm-24.05.6-1.el9.x86_64
ucx-1.18.0-1.2410068.x86_64
Comment 2 Joel Criado 2025-03-26 02:55:28 MDT
Hi,

We recommend that any application compiled against PMIx should use the same PMIx that Slurm is using. I recommend to check our PMIx section under the MPI guide [1]. There are some notes at the bottom that should be relevant for your case.

Kind regards,
Joel

[1] https://slurm.schedmd.com/mpi_guide.html#pmix
Comment 3 Michael Janczyk 2025-03-26 11:33:46 MDT
Hi Joel,

We use Rocky9.5's PMIxv3 for all installations. As far as I know, they should all use munge. I already tried psec=native, but then it hangs indefinitely. I think we'll stick with PMIX_MCA_psec=^munge until I get around to recompiling everything with PMIxv4.

Best regards,
Michael
Comment 4 Michael Janczyk 2025-03-26 12:02:15 MDT
Hi Joel,

Sorry, I have found the problem. I used EasyBuild module and that loaded a newer PMIx v4.2.4. /usr/share/pmix/pmix_client and other software then use the newer PMIx (without Munge).

OK, this could be a general problem as I can't be sure which PMIx is loaded by Easybuild. And I can't compile SLURM against a specific version of MPIx as it might be different for each SW. I could go for PMIx 4.2.4 and hope that at least 4.2.4+ works. I believe LUMI and CSCS also use Easybuild.

I will have a look at v4.2.4 and probably rebuild Slurm with this version.

Best regards,
Michael
Comment 5 Joel Criado 2025-03-28 02:03:37 MDT
Hi,

I'm glad you could find the root of the issue.

I will proceed to close the ticket then.

Kind regards,
Joel
Comment 6 Michael Janczyk 2025-03-28 07:27:56 MDT
Joel,

Before you close this ticket. Are there any recommendations from SchedMD?
I could recompile all PMIx with Munge, but users need to be informed to do this as well if they are using their own software or containers, conda, easybuild. Most will probably just use the default configuration which is without munge.
But if I understand it correctly, this is the only way since we have to use “at least a PMIx with the same security domain”? In our case, this is Munge.
Or should we move to one of the newer methods Slurm/SACK/JWT?

Best regards,
Michael
Comment 7 Joel Criado 2025-03-28 08:45:03 MDT
Hi,

You can compile Slurm against several PMIx versions. You could do that with every PMIx installation you offer system wide. The guide [1] provides info about that. 
Of course, if some user decides to use a custom one without munge the issue might resurface. If you think your users are likely to do that, pivoting to auth/slurm might be a good solution. You can find the info on how to configure auth/slurm here [2].

Kind regards,
Joel

[1] https://slurm.schedmd.com/mpi_guide.html#pmix
[2] https://slurm.schedmd.com/authentication.html#slurm
Comment 8 Michael Janczyk 2025-03-28 09:02:25 MDT
Hi Joel,

Thank you! I will try the "single key setup" as it looks like a simple 1:1 replacement for Munge. I'm hoping this will solve some of the issues in the long run. Anyway, I'll probably move on to creating PMIx v3, v4 and v5 for Slurm manually.

I think I know enough to get started on that. You can close this ticket. I'll create a new ticket if I run into any new issues/questions.

Best regards,
Michael