| Summary: | Bad mpi job placement with IntelMPI on Slurm 21.08.5 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Francesco De Martino <fdm> |
| Component: | Scheduling | Assignee: | Director of Support <support> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | mmelato, nick, schedmd-contacts |
| Version: | 21.08.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=16173 | ||
| Site: | DS9 (PSLA) | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | Config and job output | ||
|
Description
Francesco De Martino
2022-02-18 12:05:12 MST
This is a duplicate of bug#13351. We are planning to address this in 21.08.6 which should be out next week sometime. *** This ticket has been marked as a duplicate of ticket 13351 *** Thank Jason for your quick reply. Can you help us understand how to avoid incurring in such issue in Slurm 21.08.5 until the updated version is available? Is there a job submission approach we should prefer that is not affected by this? I went through the duplicate bug you linked and I have the following questions: * When was this issue introduced? Are all Slurm 21.08.x versions affected? * What MPI libraries are affected? * What MPI submission options are affected?Is this happening only when specifying threads per core? * In the issue I reported I also face a different error. When submitting the following job `sbatch -N 2 -n 4 osu_all_reduce.sh` (without pinning threads per core), I get the following error when the mpi application within the job is started with (`srun --mpi=pmi2 intelmpi/osu-micro-benchmarks-5.7.1/mpi/collective/osu_allreduce`). Can you help me explain what is going on? ``` + srun --mpi=pmi2 intelmpi/osu-micro-benchmarks-5.7.1/mpi/collective/osu_allreduce [2] MPI startup(): shm segment size (1068 MB per rank) * (2 local ranks) = 2136 MB total [0] MPI startup(): Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf) [0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved. [0] MPI startup(): library kind: release [0] MPI startup(): shm segment size (1068 MB per rank) * (2 local ranks) = 2136 MB total [0] MPI startup(): libfabric version: 1.13.0-impi Abort(2664079) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(138).........: MPID_Init(1161)...............: MPIDI_SHMI_mpi_init_hook(29)..: MPIDI_POSIX_mpi_init_hook(131): MPIDI_POSIX_eager_init(2523)..: MPIDU_shm_seg_commit(296).....: unable to allocate shared memory slurmstepd: error: *** STEP 26.0 ON queue1-dy-compute-resource1-1 CANCELLED AT 2022-02-21T11:58:38 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: queue1-dy-compute-resource1-2: task 3: Killed srun: error: queue1-dy-compute-resource1-1: tasks 0-2: Killed ``` One more question: I'm seeing also an additional difference between 20.11 and 21.08 which might be expected but want to confirm. If I submit the following job `sbatch -N 2 -n 4 job.sh` where job.sh is: ``` #!/bin/bash module load intelmpi export I_MPI_DEBUG=10 mpirun -np 4 -rr intelmpi/osu-micro-benchmarks-5.7.1/mpi/collective/osu_allreduce ``` I get 2 different behaviors: * In 20.11 Slurm distributes 2 tasks on each node and mpi starts correctly. * In 21.08 Slurm distributes 3 tasks on the first node and 1 on the second. While MPI starts 2 tasks each node and the two tasks started on the second node end up on the same core. Is this expected? Both the change in default behavior and the way mpirun distributes the processes that is not aligned with Slurm reservation? Note that without the `-rr` flag the job runs just fine. All - just a quick update here regarding this issue. I will have an engineer follow up with you with more details a little later this week. The issue you reported here is very similar to bug#13351 and the errors you have reported do seem to indicate this. > MPIDU_shm_seg_commit(296).....: unable to allocate shared memory Barring any major issues, we are tentatively targeting Thursday for the release of 21.08.6, which will include the fix for this issue. > * When was this issue introduced? Are all Slurm 21.08.x versions affected? The issue itself has been around for some time, but not noticeable until we fixed another issue in 21.08.5. It was not until we landed the commit below that it exposed the problem that we are actively working on addressing. > https://github.com/SchedMD/slurm/commit/6e13352fc2 > "Fix srun -c and --threads-per-core imply --exact" > * What MPI libraries are affected? The issue has more to do with how Slurm handles --threads-per-core and --exact and allocating memory for each rank/step. Some more details are in Friday's update bug#13351comment#49. As mentioned above, I will have an engineer reach out to you with more details regarding your questions. Would you be able to either revert commit 6e13352fc2 or try 21.08.4 to see if you can duplicate the issues you are seeing with that that commit? (In reply to Jason Booth from comment #7) > Would you be able to either revert commit 6e13352fc2 or try 21.08.4 to see > if you can duplicate the issues you are seeing with that that commit? I tested with Slurm 21.08.4 and here is what I found out: 1. The issue faced with multiple processes being pinned to the same core when using the -c option does not seem to reproduce 2. I'm still facing the issue below: >In the issue I reported I also face a different error. When submitting the >following job `sbatch -N 2 -n 4 osu_all_reduce.sh` (without pinning threads per >core), I get the following error when the mpi application within the job is >started with (`srun --mpi=pmi2 intelmpi/osu-micro-benchmarks- >5.7.1/mpi/collective/osu_allreduce`). Can you help me explain what is going on? ``` + srun --mpi=pmi2 intelmpi/osu-micro-benchmarks-5.7.1/mpi/collective/osu_allreduce [2] MPI startup(): shm segment size (1068 MB per rank) * (2 local ranks) = 2136 MB total [0] MPI startup(): Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf) [0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved. [0] MPI startup(): library kind: release [0] MPI startup(): shm segment size (1068 MB per rank) * (2 local ranks) = 2136 MB total [0] MPI startup(): libfabric version: 1.13.0-impi Abort(2664079) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(138).........: MPID_Init(1161)...............: MPIDI_SHMI_mpi_init_hook(29)..: MPIDI_POSIX_mpi_init_hook(131): MPIDI_POSIX_eager_init(2523)..: MPIDU_shm_seg_commit(296).....: unable to allocate shared memory slurmstepd: error: *** STEP 26.0 ON queue1-dy-compute-resource1-1 CANCELLED AT 2022-02-21T11:58:38 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: queue1-dy-compute-resource1-2: task 3: Killed srun: error: queue1-dy-compute-resource1-1: tasks 0-2: Killed ``` 3. I'm still noticing the difference in task placement between 20.11 and 21.08 that I reported in Comment 5 - https://bugs.schedmd.com/show_bug.cgi?id=13474#c5 Hi Francesco, (In reply to Francesco De Martino from comment #8) > I tested with Slurm 21.08.4 and here is what I found out: > > 1. The issue faced with multiple processes being pinned to the same core > when using the -c option does not seem to reproduce Ok, then it appears that this issue is a duplicate of bug 13351. This will be fixed in 21.08.6. As a workaround, revert commit 6e13352fc2. > 2. I'm still facing the issue below: > ... > MPIDU_shm_seg_commit(296).....: unable to allocate shared memory I am not quite sure about what is going on here, but it appears to be a separate issue. Does this happen on both 21.08 and 20.11? If so, then this might be something to take up with Intel MPI directly. If not, could you please create an isolated reproducer in 21.08.4 and open a new ticket on it? This will help to not conflate the two issues here. Thanks! For what it's worth, somebody on slurm-users hit this once with Intel MPI: https://groups.google.com/g/slurm-users/c/4pMPL-zQtzU/m/RUd78W8YBwAJ "The MPI needs exclusive access to the interconnect. Cray once provided a workaround, but that was not worth to implement (terrible effort/gain for us). Conclusion: You might have to live with this limitation." > 3. I'm still noticing the difference in task placement between 20.11 and > 21.08 that I reported in Comment 5 - > https://bugs.schedmd.com/show_bug.cgi?id=13474#c5 Yes, task placement is different, but that is not a bug, necessarily. Without constraining --ntasks-per-node, there is no guarantee that the # of tasks per node will always be 2. Thanks, -Michael Opened a separate ticket for the memory allocation problem: https://bugs.schedmd.com/show_bug.cgi?id=13495 This problem does not occur on 20.11 or when submitting the same job with mpirun or mpiexec. However srun is the recommended submission option according to https://slurm.schedmd.com/mpi_guide.html Hi Francesco, This issue has been resolved in bug 13351, and the fixes are included in 21.08.6, which has been released today. See bug 13351 comment 70 and bug 13351 comment 76. I'm going to go ahead and mark this as a duplicate of bug 13351. Let me know if you have any questions. Thanks! -Michael *** This ticket has been marked as a duplicate of ticket 13351 *** |