Summary: | Munge PMIx srun psec "requested component was not found" | ||
---|---|---|---|
Product: | Slurm | Reporter: | Michael Janczyk <michael.janczyk> |
Component: | PMIx | Assignee: | Joel Criado <joel> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 25.05.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | UFR | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Michael Janczyk
2025-03-25 11:51:24 MDT
Hi, We recommend that any application compiled against PMIx should use the same PMIx that Slurm is using. I recommend to check our PMIx section under the MPI guide [1]. There are some notes at the bottom that should be relevant for your case. Kind regards, Joel [1] https://slurm.schedmd.com/mpi_guide.html#pmix Hi Joel, We use Rocky9.5's PMIxv3 for all installations. As far as I know, they should all use munge. I already tried psec=native, but then it hangs indefinitely. I think we'll stick with PMIX_MCA_psec=^munge until I get around to recompiling everything with PMIxv4. Best regards, Michael Hi Joel, Sorry, I have found the problem. I used EasyBuild module and that loaded a newer PMIx v4.2.4. /usr/share/pmix/pmix_client and other software then use the newer PMIx (without Munge). OK, this could be a general problem as I can't be sure which PMIx is loaded by Easybuild. And I can't compile SLURM against a specific version of MPIx as it might be different for each SW. I could go for PMIx 4.2.4 and hope that at least 4.2.4+ works. I believe LUMI and CSCS also use Easybuild. I will have a look at v4.2.4 and probably rebuild Slurm with this version. Best regards, Michael Hi, I'm glad you could find the root of the issue. I will proceed to close the ticket then. Kind regards, Joel Joel, Before you close this ticket. Are there any recommendations from SchedMD? I could recompile all PMIx with Munge, but users need to be informed to do this as well if they are using their own software or containers, conda, easybuild. Most will probably just use the default configuration which is without munge. But if I understand it correctly, this is the only way since we have to use “at least a PMIx with the same security domain”? In our case, this is Munge. Or should we move to one of the newer methods Slurm/SACK/JWT? Best regards, Michael Hi, You can compile Slurm against several PMIx versions. You could do that with every PMIx installation you offer system wide. The guide [1] provides info about that. Of course, if some user decides to use a custom one without munge the issue might resurface. If you think your users are likely to do that, pivoting to auth/slurm might be a good solution. You can find the info on how to configure auth/slurm here [2]. Kind regards, Joel [1] https://slurm.schedmd.com/mpi_guide.html#pmix [2] https://slurm.schedmd.com/authentication.html#slurm Hi Joel, Thank you! I will try the "single key setup" as it looks like a simple 1:1 replacement for Munge. I'm hoping this will solve some of the issues in the long run. Anyway, I'll probably move on to creating PMIx v3, v4 and v5 for Slurm manually. I think I know enough to get started on that. You can close this ticket. I'll create a new ticket if I run into any new issues/questions. Best regards, Michael |