Ticket 13474

Summary:	Bad mpi job placement with IntelMPI on Slurm 21.08.5
Product:	Slurm	Reporter:	Francesco De Martino <fdm>
Component:	Scheduling	Assignee:	Director of Support <support>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	mmelato, nick, schedmd-contacts
Version:	21.08.5
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=16173
Site:	DS9 (PSLA)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Config and job output

Description Francesco De Martino 2022-02-18 12:05:12 MST

Created attachment 23545 [details]
Config and job output

Hi,

I have two clusters with pretty much the same setup in terms of slurm configuration but the first one is using slurm 20.11.8 while the second one slurm 21.08.5.

On both clusters I'm submitting the same IntelMPI job that I report below. Intel MPI version in use is 2021.4.0.

```
#!/bin/bash

set -x

module load intelmpi
export I_MPI_DEBUG=10
export I_MPI_PIN_RESPECT_CPUSET=0
export I_MPI_HYDRA_BOOTSTRAP=slurm
export I_MPI_HYDRA_RMK=slurm
export I_MPI_PIN=1
env

mpirun intelmpi/osu-micro-benchmarks-5.7.1/mpi/collective/osu_allreduce
srun --mpi=pmi2 intelmpi/osu-micro-benchmarks-5.7.1/mpi/collective/osu_allreduce
mpiexec intelmpi/osu-micro-benchmarks-5.7.1/mpi/collective/osu_allreduce
```

The osu_allreduce application I'm running is a benchmark downloaded from http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.7.1.tgz.

I'm noticing a different behavior in how Slurm is allocating cores to tasks and how intelmpi is running its processes. Also while the various submission options produce a consistent result with Slurm 20.11 this is not the case with Slurm 21.08.

The sbatch command I'm using for submission is the following: `sbatch -N 2 -n 4 -c 1 osu_all_reduce.sh`.

The first difference is that Slurm 20.11 allocates 2 cores per node while Slurm 21.08 allocates 3 cores on the first node and 1 on the second one. This might be fine and eventually expected since I'm not enforcing a certain amount of tasks per node (--ntasks-per-node).

However if I submit the mpi application with mpirun or mpiexec the way processes are pinned to CPU is wrong in Slurm 21.08 and I end up with multiple processes pinned to the same core, thus affecting the overall performance of the application. while if use srun with pmi2 I get a memory allocation error (which goes away if I pin exactly 2 tasks per core with ntasks-per-core).

I'm attaching the job output with all debug information and the Slurm configuration for the 2 clusters.

Comment 2 Jason Booth 2022-02-18 12:52:57 MST

This is a duplicate of bug#13351. We are planning to address this in 21.08.6 which should be out next week sometime.

*** This ticket has been marked as a duplicate of ticket 13351 ***

Comment 3 Francesco De Martino 2022-02-21 01:46:07 MST

Thank Jason for your quick reply. Can you help us understand how to avoid incurring in such issue in Slurm 21.08.5 until the updated version is available? Is there a job submission approach we should prefer that is not affected by this?

Comment 4 Francesco De Martino 2022-02-21 05:05:09 MST

I went through the duplicate bug you linked and I have the following questions:

* When was this issue introduced? Are all Slurm 21.08.x versions affected?
* What MPI libraries are affected?
* What MPI submission options are affected?Is this happening only when specifying threads per core?
* In the issue I reported I also face a different error. When submitting the following job `sbatch -N 2 -n 4 osu_all_reduce.sh` (without pinning threads per core), I get the following error when the mpi application within the job is started with (`srun --mpi=pmi2 intelmpi/osu-micro-benchmarks-5.7.1/mpi/collective/osu_allreduce`). Can you help me explain what is going on?

```
+ srun --mpi=pmi2 intelmpi/osu-micro-benchmarks-5.7.1/mpi/collective/osu_allreduce
[2] MPI startup(): shm segment size (1068 MB per rank) * (2 local ranks) = 2136 MB total
[0] MPI startup(): Intel(R) MPI Library, Version 2021.4  Build 20210831 (id: 758087adf)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (1068 MB per rank) * (2 local ranks) = 2136 MB total
[0] MPI startup(): libfabric version: 1.13.0-impi
Abort(2664079) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138).........:
MPID_Init(1161)...............:
MPIDI_SHMI_mpi_init_hook(29)..:
MPIDI_POSIX_mpi_init_hook(131):
MPIDI_POSIX_eager_init(2523)..:
MPIDU_shm_seg_commit(296).....: unable to allocate shared memory
slurmstepd: error: *** STEP 26.0 ON queue1-dy-compute-resource1-1 CANCELLED AT 2022-02-21T11:58:38 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: queue1-dy-compute-resource1-2: task 3: Killed
srun: error: queue1-dy-compute-resource1-1: tasks 0-2: Killed
```

Comment 5 Francesco De Martino 2022-02-21 09:30:10 MST

One more question:

I'm seeing also an additional difference between 20.11 and 21.08 which might be expected but want to confirm.

If I submit the following job `sbatch -N 2 -n 4 job.sh` where job.sh is:

```
#!/bin/bash
module load intelmpi
export I_MPI_DEBUG=10
mpirun -np 4 -rr intelmpi/osu-micro-benchmarks-5.7.1/mpi/collective/osu_allreduce
``` 

I get 2 different behaviors:
* In 20.11 Slurm distributes 2 tasks on each node and mpi starts correctly.
* In 21.08 Slurm distributes 3 tasks on the first node and 1 on the second. While MPI starts 2 tasks each node and the two tasks started on the second node end up on the same core.

Is this expected? Both the change in default behavior and the way mpirun distributes the processes that is not aligned with Slurm reservation? Note that without the `-rr` flag the job runs just fine.

Comment 6 Jason Booth 2022-02-21 13:31:01 MST

All - just a quick update here regarding this issue. I will have an engineer follow up with you with more details a little later this week. 


The issue you reported here is very similar to bug#13351 and the errors you have reported do seem to indicate this.

> MPIDU_shm_seg_commit(296).....: unable to allocate shared memory


Barring any major issues, we are tentatively targeting Thursday for the release of 21.08.6, which will include the fix for this issue.


> * When was this issue introduced? Are all Slurm 21.08.x versions affected?

The issue itself has been around for some time, but not noticeable until we fixed another issue in 21.08.5. It was not until we landed the commit below that it exposed the problem that we are actively working on addressing.

> https://github.com/SchedMD/slurm/commit/6e13352fc2
> "Fix srun -c and --threads-per-core imply --exact"


> * What MPI libraries are affected?

The issue has more to do with how Slurm handles --threads-per-core and --exact and allocating memory for each rank/step. Some more details are in Friday's update bug#13351comment#49.

As mentioned above, I will have an engineer reach out to you with more details regarding your questions.

Comment 7 Jason Booth 2022-02-21 14:03:16 MST

Would you be able to either revert commit 6e13352fc2 or try 21.08.4 to see if you can duplicate the issues you are seeing with that that commit?

Comment 8 Francesco De Martino 2022-02-22 08:31:54 MST

(In reply to Jason Booth from comment #7)
> Would you be able to either revert commit 6e13352fc2 or try 21.08.4 to see
> if you can duplicate the issues you are seeing with that that commit?

I tested with Slurm 21.08.4 and here is what I found out:

1. The issue faced with multiple processes being pinned to the same core when using the -c option does not seem to reproduce

2. I'm still facing the issue below:

>In the issue I reported I also face a different error. When submitting the 
>following job `sbatch -N 2 -n 4 osu_all_reduce.sh` (without pinning threads per 
>core), I get the following error when the mpi application within the job is 
>started with (`srun --mpi=pmi2 intelmpi/osu-micro-benchmarks-
>5.7.1/mpi/collective/osu_allreduce`). Can you help me explain what is going on?

```
+ srun --mpi=pmi2 intelmpi/osu-micro-benchmarks-5.7.1/mpi/collective/osu_allreduce
[2] MPI startup(): shm segment size (1068 MB per rank) * (2 local ranks) = 2136 MB total
[0] MPI startup(): Intel(R) MPI Library, Version 2021.4  Build 20210831 (id: 758087adf)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (1068 MB per rank) * (2 local ranks) = 2136 MB total
[0] MPI startup(): libfabric version: 1.13.0-impi
Abort(2664079) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138).........:
MPID_Init(1161)...............:
MPIDI_SHMI_mpi_init_hook(29)..:
MPIDI_POSIX_mpi_init_hook(131):
MPIDI_POSIX_eager_init(2523)..:
MPIDU_shm_seg_commit(296).....: unable to allocate shared memory
slurmstepd: error: *** STEP 26.0 ON queue1-dy-compute-resource1-1 CANCELLED AT 2022-02-21T11:58:38 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: queue1-dy-compute-resource1-2: task 3: Killed
srun: error: queue1-dy-compute-resource1-1: tasks 0-2: Killed
```

3. I'm still noticing the difference in task placement between 20.11 and 21.08 that I reported in Comment 5 - https://bugs.schedmd.com/show_bug.cgi?id=13474#c5

Comment 10 Michael Hinton 2022-02-22 09:11:34 MST

Hi Francesco,

(In reply to Francesco De Martino from comment #8)
> I tested with Slurm 21.08.4 and here is what I found out:
> 
> 1. The issue faced with multiple processes being pinned to the same core
> when using the -c option does not seem to reproduce
Ok, then it appears that this issue is a duplicate of bug 13351. This will be fixed in 21.08.6. As a workaround, revert commit 6e13352fc2.

> 2. I'm still facing the issue below:
> ...
> MPIDU_shm_seg_commit(296).....: unable to allocate shared memory
I am not quite sure about what is going on here, but it appears to be a separate issue. Does this happen on both 21.08 and 20.11? If so, then this might be something to take up with Intel MPI directly. If not, could you please create an isolated reproducer in 21.08.4 and open a new ticket on it? This will help to not conflate the two issues here. Thanks!

For what it's worth, somebody on slurm-users hit this once with Intel MPI: https://groups.google.com/g/slurm-users/c/4pMPL-zQtzU/m/RUd78W8YBwAJ

"The MPI needs exclusive access to the interconnect. Cray once provided a workaround, but that was not worth to implement (terrible effort/gain for us). Conclusion: You might have to live with this limitation."

> 3. I'm still noticing the difference in task placement between 20.11 and
> 21.08 that I reported in Comment 5 -
> https://bugs.schedmd.com/show_bug.cgi?id=13474#c5

Yes, task placement is different, but that is not a bug, necessarily. Without constraining --ntasks-per-node, there is no guarantee that the # of tasks per node will always be 2.

Thanks,
-Michael

Comment 11 Francesco De Martino 2022-02-22 11:33:48 MST

Opened a separate ticket for the memory allocation problem: https://bugs.schedmd.com/show_bug.cgi?id=13495

This problem does not occur on 20.11 or when submitting the same job with mpirun or mpiexec. However srun is the recommended submission option according to https://slurm.schedmd.com/mpi_guide.html

Comment 12 Michael Hinton 2022-02-24 12:29:32 MST

Hi Francesco,

This issue has been resolved in bug 13351, and the fixes are included in 21.08.6, which has been released today. See bug 13351 comment 70 and bug 13351 comment 76.

I'm going to go ahead and mark this as a duplicate of bug 13351. Let me know if you have any questions.

Thanks!
-Michael

*** This ticket has been marked as a duplicate of ticket 13351 ***