| Summary: | MPI jobs only run on the batch host - little CPU usage on other hosts | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Greg Wickham <greg.wickham> |
| Component: | slurmctld | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 1 - System not usable | ||
| Priority: | --- | CC: | cinek |
| Version: | 20.11.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | KAUST | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | Slurm configuration. | ||
|
Description
Greg Wickham
2020-12-17 04:26:06 MST
Created attachment 17195 [details]
Slurm configuration.
Slurm 20.11.1 PMIX 3.2.2 CentOS 7.9 sbatch file:
#!/bin/bash
#SBATCH -N 2
#SBATCH -n 64
#SBATCH --tasks-per-node=32
#SBATCH --cpus-per-task=4
#SBATCH --partition=batch
#SBATCH -J hpl
#SBATCH -o hpl-NPS4-32threads.%N.%J.out
#SBATCH -e hpl-NPS4-32threads.%N.%J.err
#SBATCH --time=04:10:00
#SBATCH --mem=0
#SBATCH --reservation=IBEX_CS
#run the application:
module load intelstack-default
module load openmpi/4.0.1/.gnu-6.4.0
mpirun -np 64 --mca btl self,vader --report-bindings --map-by l3cache -x OMP_NUM_THREADS=4 -x OMP_PROC_BIND=TRUE -x OMP_PLACES=cores ./xhpl
$ squeue -j 13329509
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13329509 batch hpl wickhagj R 0:08 2 cn506-02-l,cn506-03-l
cn506-02-l:
top - 14:57:00 up 22:32, 1 user, load average: 25.54, 9.20, 3.39
Tasks: 1463 total, 33 running, 1430 sleeping, 0 stopped, 0 zombie
%Cpu(s): 25.0 us, 0.0 sy, 0.0 ni, 75.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 52820464+total, 41650988+free, 10474582+used, 6948924 buff/cache
KiB Swap: 31457276 total, 31447756 free, 9520 used. 41612816+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
80613 wickhagj 20 0 3125448 2.6g 13544 R 100.3 0.5 1:45.12 xhpl
80600 wickhagj 20 0 3064572 2.5g 13604 R 100.0 0.5 1:45.04 xhpl
80601 wickhagj 20 0 3085320 2.5g 13556 R 100.0 0.5 1:45.03 xhpl
80602 wickhagj 20 0 3098100 2.5g 13868 R 100.0 0.5 1:45.04 xhpl
80603 wickhagj 20 0 3064572 2.5g 13484 R 100.0 0.5 1:45.04 xhpl
80604 wickhagj 20 0 3133088 2.6g 13524 R 100.0 0.5 1:45.08 xhpl
80605 wickhagj 20 0 3166268 2.6g 13628 R 100.0 0.5 1:45.09 xhpl
80606 wickhagj 20 0 3154176 2.6g 13832 R 100.0 0.5 1:45.08 xhpl
cn506-03-l
top - 14:57:24 up 22:33, 1 user, load average: 2.05, 0.89, 0.43
Tasks: 1409 total, 6 running, 1403 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.4 us, 0.5 sy, 0.0 ni, 99.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 52820464+total, 50044947+free, 20958684 used, 6796484 buff/cache
KiB Swap: 31457276 total, 31457020 free, 256 used. 49999168+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
80416 wickhagj 20 0 474080 68496 11316 S 4.6 0.0 0:05.88 xhpl
80419 wickhagj 20 0 474080 70472 11316 S 4.6 0.0 0:05.92 xhpl
80420 wickhagj 20 0 474080 68508 11324 S 4.6 0.0 0:05.88 xhpl
80424 wickhagj 20 0 474080 68488 11312 S 4.6 0.0 0:05.90 xhpl
80425 wickhagj 20 0 474080 68504 11324 S 4.6 0.0 0:05.91 xhpl
80427 wickhagj 20 0 474080 68492 11312 S 4.6 0.0 0:05.88 xhpl
Bump. Greg, Sorry for delay. Could you please try: >export SLURM_WHOLE=1 before mpirun call? This is very likely a duplicate of Bug 10383, where you can find more details. cheers, Marcin Dear Marcin, Thanks for the work around. Confirming it works for us. Please resolve this ticket. With thanks, -Greg Resolving as duplicate *** This ticket has been marked as a duplicate of ticket 10383 *** |