| Summary: | Slurm 20.11 breaks mpiexec bindings | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Maxime Boissonneault <maxime.boissonneault> |
| Component: | Scheduling | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | cinek, kaizaad, siegert |
| Version: | 20.11.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Simon Fraser University | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
More information. cpu bindings are all messed up with Slurm 20.11 and OpenMPI :
[mboisson@cedar1 def-mboisson]$ cat test.sh
#!/bin/bash
mpiexec --map-by ppr:12:socket --bind-to core:overload-allowed hostname | sort | uniq -c
mpiexec -n 64 --report-bindings numactl --show | grep physcpubind | sort | uniq -c
[mboisson@cedar1 def-mboisson]$ sbatch --nodes=2 --time=1:00:00 --account=def-mboisson --mem=0 --ntasks=64 test.sh
[mboisson@cedar1 def-mboisson]$ cat slurm-57436990.out
24 cdr1311.int.cedar.computecanada.ca
12 cdr1313.int.cedar.computecanada.ca
32 physcpubind: 0
32 physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Same thing with Slurm 20.02 on Graham:
$ cat slurm-41965236.out | grep -v "socket"
24 gra535
24 gra66
32 physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
32 physcpubind: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
See other comments on OpenMPI github : https://github.com/open-mpi/ompi-www/pull/342 It appears that setting the environment variable SLURM_WHOLE=1 resolves the problem. But does not appear to be any documentation available that explains the effects of setting this variable: What are the effects of setting SLURM_WHOLE=1 always? What functionality is not available when SLURM_WHOLE=1 is set? Should SLURM_WHOLE=1 be set only if --exclusive is specified? Maxime, Thanks for opening the ticket, but it's actually a duplicate of Bug 10383. We're aware of the discussion under the pull request to openmpi-www updating Slurm FAQ there, actually I'm the author of it. Martin, >But does not appear to be any documentation available that explains the effects of setting this variable Yes - this was missing, but got fixed in Bug 10430. >What functionality is not available when SLURM_WHOLE=1 is set? >Should SLURM_WHOLE=1 be set only if --exclusive is specified? The brief answer is that the variable follows the convention for srun input variables and seting it is equivalent to `srun --whole` option. It's not disabling any functionality, but creates a step with access to all job resources - not only requested for the step. I think you may find discussion under Bug 10383 interesting and more detailed. cheers, Marcin *** This ticket has been marked as a duplicate of ticket 10383 *** Mmm, Marcin, then bug 10383 is mistitled. We are not using UCX and we encounter problems. This is why I created this one. >Mmm, Marcin, then bug 10383 is mistitled. We are not using UCX and we encounter problems. This is why I created this one. Understood, I'm happy you reached out to us and we were able to match that. The initial "errors" experienced in Bug 10383 were in fact different, but since they have the same root cause we prefer to merge them. It's one issue, but with different symptoms depending on the case. |
Hi, A user has just discovered that the recent upgrade of Slurm to 20.11 broke his MPI code. That user uses mpiexec from OpenMPI because srun does not offer all of the same capability. He uses "--map-by ppr:12:socket" because his code works best when processes are distributed at a certain number by socket. The bug can be seen with this trivial example: [mboisson@cedar1]$ salloc --exclusive --nodes=2 --time=1:00:00 --account=def-mboisson --mem=0 salloc: Pending job allocation 57435524 salloc: job 57435524 queued and waiting for resources salloc: job 57435524 has been allocated resources salloc: Granted job allocation 57435524 [mboisson@cdr768]$ mpiexec --map-by ppr:12:socket hostname | sort | uniq -c 24 cdr768.int.cedar.computecanada.ca 12 cdr774.int.cedar.computecanada.ca This happens irrespective of the version of mpiexec that I use (tested OpenMPI 2.1.1 and 4.0.3). It goes without saying that I have 2 sockets per node, hence there should be 48 processes started, not 36.