Hi, We have recently switched to SLURM. I am also new to SLURM. When I run an MPI job with srun and PMIx, I get the following error: ===== A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that PMIx stopped checking at the first component that it did not find. Host: node01 Framework: psec Component: munge ===== Munge is running on the nodes. I could build PMIx v4 myself if this is recommended anyway (and fixes the error). Best regards, Michael System (configless): -r-------- 1 munge munge 1024 Mar 19 19:31 /etc/munge/munge.key # cat /etc/slurm/mpi.conf PMIxDirectConnUCX=true PMIxEnv=PMIX_MCA_psec=none rpmbuild --clean --define "%_topdir $(pwd)/build" --define "_with_nvml --with-nvml=/usr/local/cuda-12.8" --with slurmrestd --with pmix --with ucx -ta "slurm-24.05.6.tar.bz2" hwloc-2.4.1-5.el9.x86_64 pmix-3.2.3-5.el9.x86_64 munge-0.5.13-13.el9.x86_64 slurm-24.05.6-1.el9.x86_64 ucx-1.18.0-1.2410068.x86_64
Hi, We recommend that any application compiled against PMIx should use the same PMIx that Slurm is using. I recommend to check our PMIx section under the MPI guide [1]. There are some notes at the bottom that should be relevant for your case. Kind regards, Joel [1] https://slurm.schedmd.com/mpi_guide.html#pmix
Hi Joel, We use Rocky9.5's PMIxv3 for all installations. As far as I know, they should all use munge. I already tried psec=native, but then it hangs indefinitely. I think we'll stick with PMIX_MCA_psec=^munge until I get around to recompiling everything with PMIxv4. Best regards, Michael
Hi Joel, Sorry, I have found the problem. I used EasyBuild module and that loaded a newer PMIx v4.2.4. /usr/share/pmix/pmix_client and other software then use the newer PMIx (without Munge). OK, this could be a general problem as I can't be sure which PMIx is loaded by Easybuild. And I can't compile SLURM against a specific version of MPIx as it might be different for each SW. I could go for PMIx 4.2.4 and hope that at least 4.2.4+ works. I believe LUMI and CSCS also use Easybuild. I will have a look at v4.2.4 and probably rebuild Slurm with this version. Best regards, Michael
Hi, I'm glad you could find the root of the issue. I will proceed to close the ticket then. Kind regards, Joel
Joel, Before you close this ticket. Are there any recommendations from SchedMD? I could recompile all PMIx with Munge, but users need to be informed to do this as well if they are using their own software or containers, conda, easybuild. Most will probably just use the default configuration which is without munge. But if I understand it correctly, this is the only way since we have to use “at least a PMIx with the same security domain”? In our case, this is Munge. Or should we move to one of the newer methods Slurm/SACK/JWT? Best regards, Michael
Hi, You can compile Slurm against several PMIx versions. You could do that with every PMIx installation you offer system wide. The guide [1] provides info about that. Of course, if some user decides to use a custom one without munge the issue might resurface. If you think your users are likely to do that, pivoting to auth/slurm might be a good solution. You can find the info on how to configure auth/slurm here [2]. Kind regards, Joel [1] https://slurm.schedmd.com/mpi_guide.html#pmix [2] https://slurm.schedmd.com/authentication.html#slurm
Hi Joel, Thank you! I will try the "single key setup" as it looks like a simple 1:1 replacement for Munge. I'm hoping this will solve some of the issues in the long run. Anyway, I'll probably move on to creating PMIx v3, v4 and v5 for Slurm manually. I think I know enough to get started on that. You can close this ticket. I'll create a new ticket if I run into any new issues/questions. Best regards, Michael