| Summary: | intel mpi jobs started giving errors after upgrading slurm | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Naveed Near-Ansari <naveed> |
| Component: | Other | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 22.05.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=13474 https://bugs.schedmd.com/show_bug.cgi?id=13495 https://bugs.schedmd.com/show_bug.cgi?id=14600 |
||
| Site: | Caltech | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | etc, config.log and job output | ||
Naveed Near-Ansari, have you compiled and installed libpmi2 from slurm? Just go into contribs/pmi2 directory and run a "make -j install", then try again. Slurm's libpmi2 is necessary for IntelMPI to use srun facilities. Besides Jason comments (I think you're using srun --mpi=pmi2 which in theory means you have it installed?),
What output do you get from 'srun --mpi=list' ?
Do you have enough shared memory in the system for the number of tasks you're running in the node?
[36] MPI startup(): shm segment size (555 MB per rank) * (4 local ranks) = 2221 MB total
[10] MPI startup(): shm segment size (452 MB per rank) * (5 local ranks) = 2263 MB total
[40] MPI startup(): shm segment size (555 MB per rank) * (4 local ranks) = 2221 MB total
[60] MPI startup(): shm segment size (555 MB per rank) * (4 local ranks) = 2221 MB total
[15] MPI startup(): shm segment size (452 MB per rank) * (5 local ranks) = 2263 MB total
[48] MPI startup(): shm segment size (555 MB per rank) * (4 local ranks) = 2221 MB total
[52] MPI startup(): shm segment size (555 MB per rank) * (4 local ranks) = 2221 MB total
[44] MPI startup(): shm segment size (555 MB per rank) * (4 local ranks) = 2221 MB total
I also see you're using Intel MPI Update 5.
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
Have you upgraded the intel version too during the upgrade process?
I see you have IB, not Omnipath, right?
When you say this:
> this was not happening before rebuilding slurm. one change i made on the rebuild one was removing pmix from the local system before building based on advice i had gotten at sc so that it would use internal libraries.
Who must use "internal" libraries? Slurm doesn't ship with a PMIx, but now OpenMPI > 5 does. I guess the advice was for OpenMPI + Slurm but it doesn't relate to Intel. Intel doesn't support PMIx.
Thanks
Hi, for now I am marking this as timedout. Reopen if needed!. Thanks |
Created attachment 29116 [details] etc, config.log and job output we upgrade slurm in december and started getting compaints on intel mpi using srun. other things happened at the time so we were focussing on ofed and the ib side of things, but all that has been validate and it seems to happen specifically with srun. intel mpi when launched with mpirun works fine I am not sure why srun if failing with it. it gives errors like this: Abort(1090575) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(143)...: MPID_Init(1286).........: MPIDU_Init_shm_init(180): unable to allocate shared memory Slurm was built with these options: ./configure --prefix=/central/slurm/install/22.05.6 --sysconfdir=/central/slurm/etc --enable-slurmrestd I will attach output from the jobs, one run with debug on for mpi this was not happening before rebuilding slurm. one change i made on the rebuild one was removing pmix from the local system before building based on advice i had gotten at sc so that it would use internal libraries.