Ticket 10444

Summary:	Slurm 20.11 breaks mpiexec bindings
Product:	Slurm	Reporter:	Maxime Boissonneault <maxime.boissonneault>
Component:	Scheduling	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	cinek, kaizaad, siegert
Version:	20.11.0
Hardware:	Linux
OS:	Linux
Site:	Simon Fraser University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Maxime Boissonneault 2020-12-15 07:17:56 MST

Hi,
A user has just discovered that the recent upgrade of Slurm to 20.11 broke his MPI code. That user uses mpiexec from OpenMPI because srun does not offer all of the same capability. He uses "--map-by ppr:12:socket" because his code works best when processes are distributed at a certain number by socket. The bug can be seen with this trivial example: 


[mboisson@cedar1]$ salloc --exclusive --nodes=2 --time=1:00:00 --account=def-mboisson --mem=0
salloc: Pending job allocation 57435524
salloc: job 57435524 queued and waiting for resources
salloc: job 57435524 has been allocated resources
salloc: Granted job allocation 57435524
[mboisson@cdr768]$ mpiexec --map-by ppr:12:socket hostname | sort | uniq -c
     24 cdr768.int.cedar.computecanada.ca
     12 cdr774.int.cedar.computecanada.ca


This happens irrespective of the version of mpiexec that I use (tested OpenMPI 2.1.1 and 4.0.3). 

It goes without saying that I have 2 sockets per node, hence there should be 48 processes started, not 36.

Comment 1 Maxime Boissonneault 2020-12-15 08:07:20 MST

More information. cpu bindings are all messed up with Slurm 20.11 and OpenMPI : 

[mboisson@cedar1 def-mboisson]$ cat test.sh
#!/bin/bash
mpiexec --map-by ppr:12:socket --bind-to core:overload-allowed  hostname | sort | uniq -c
mpiexec -n 64 --report-bindings numactl --show | grep physcpubind | sort | uniq -c

[mboisson@cedar1 def-mboisson]$ sbatch --nodes=2 --time=1:00:00 --account=def-mboisson --mem=0 --ntasks=64 test.sh

[mboisson@cedar1 def-mboisson]$ cat slurm-57436990.out
     24 cdr1311.int.cedar.computecanada.ca
     12 cdr1313.int.cedar.computecanada.ca
     32 physcpubind: 0
     32 physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31




Same thing with Slurm 20.02 on Graham: 
$ cat slurm-41965236.out | grep -v "socket"
     24 gra535
     24 gra66
     32 physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
     32 physcpubind: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Comment 2 Maxime Boissonneault 2020-12-15 12:01:16 MST

See other comments on OpenMPI github : 
https://github.com/open-mpi/ompi-www/pull/342

Comment 4 Martin Siegert 2020-12-15 14:43:36 MST

It appears that setting the environment variable SLURM_WHOLE=1 resolves the problem. But does not appear to be any documentation available that explains the effects of setting this variable:
What are the effects of setting SLURM_WHOLE=1 always?
What functionality is not available when SLURM_WHOLE=1 is set?
Should SLURM_WHOLE=1 be set only if --exclusive is specified?

Comment 5 Marcin Stolarek 2020-12-16 01:31:27 MST

Maxime,

Thanks for opening the ticket, but it's actually a duplicate of Bug 10383.

We're aware of the discussion under the pull request to openmpi-www updating Slurm FAQ there, actually I'm the author of it.

Martin,
>But does not appear to be any documentation available that explains the effects of setting this variable
Yes - this was missing, but got fixed in Bug 10430.

>What functionality is not available when SLURM_WHOLE=1 is set?
>Should SLURM_WHOLE=1 be set only if --exclusive is specified?
The brief answer is that the variable follows the convention for srun input variables and seting it is equivalent to `srun --whole` option. It's not disabling any functionality, but creates a step with access to all job resources - not only requested for the step. I think you may find discussion under Bug 10383 interesting and more detailed.

cheers,
Marcin

*** This ticket has been marked as a duplicate of ticket 10383 ***

Comment 6 Maxime Boissonneault 2020-12-16 06:45:02 MST

Mmm, Marcin, then bug 10383 is mistitled. We are not using UCX and we encounter problems. This is why I created this one.

Comment 7 Marcin Stolarek 2020-12-17 05:22:09 MST

>Mmm, Marcin, then bug 10383 is mistitled. We are not using UCX and we encounter problems. This is why I created this one.

Understood, I'm happy you reached out to us and we were able to match that.

The initial "errors" experienced in Bug 10383 were in fact different, but since they have the same root cause we prefer to merge them. It's one issue, but with different symptoms depending on the case.