21594 – MPI startup time of srun is much longer than mpirun with OpenMPI-4 and PMIx-4

Ticket 21594 - MPI startup time of srun is much longer than mpirun with OpenMPI-4 and PMIx-4

Summary: MPI startup time of srun is much longer than mpirun with OpenMPI-4 and PMIx-4

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	PMIx (show other tickets)
Version:	24.05.7
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Joel Criado
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2024-12-09 08:58 MST by VUB HPC
Modified:	2025-04-01 03:20 MDT (History)
CC List:	2 users (show)

See Also:
Site:	VUB
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Output of srun -vvvvvv with OpenMPI v4 and PMIx v4.1 (64.30 KB, text/plain) 2024-12-16 09:16 MST, VUB HPC	Details
slurmd logs of MPI startup test with 16 task in 1 node (63.98 KB, text/plain) 2024-12-19 08:49 MST, VUB HPC	Details
MPI startup test - job 10506249 - single node with infiniband - slurmd debug3 logs (134.75 KB, text/x-log) 2025-02-21 01:49 MST, VUB HPC	Details
MPI startup test - single node with ethernet - slurmd debug3 logs (144.70 KB, text/x-log) 2025-02-21 01:50 MST, VUB HPC	Details
MPI startup test - job 10506257 - single node with ethernet - slurmd debug3 logs (143.72 KB, text/x-log) 2025-02-21 02:13 MST, VUB HPC	Details
slurm.conf (21.23 KB, text/plain) 2025-04-01 02:18 MDT, VUB HPC	Details
task prolog (192 bytes, text/plain) 2025-04-01 02:18 MDT, VUB HPC	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description VUB HPC 2024-12-09 08:58:17 MST

We have detected a systematic and quite striking difference in the amount of time it takes to launch a simple bash script (with just some echo commands) with srun and mpirun (openmpi) in our cluster with Slurm 24.05.4.

While mpirun can launch all MPI ranks within a few seconds (typically <10 s) regardless of number of tasks per node, srun shows a slowdown that is dependent on the number of tasks per node and quickly increases to minutes.

This issue already manifests in small jobs with 16 task per node:

1 node / 16 tasks:
- mpirun: 1 sec
- srun: 42 sec
4 nodes / 64 tasks:
- mpirun: 4 sec
- srun: 1 min 44 sec

This issue happens on both TCP and InfiniBand networks:
1 node / 16 tasks / Ininiband:
- mpirun: 1 sec
- srun: 42 sec
1 node / 16 tasks / TCP:
- mpirun: 1 sec
- srun: 1 min 28 sec


This behaviour is systematic and we get consistent slow starts with srun. This is the data we gathered so far on the TCP system:

Launcher  Nodes  Tasks/node  Tasks  Duration
srun      2      32          64     00:02:08
srun      4      16          64     00:01:43
srun      8      8           64     00:00:13
mpirun    2      32          64     00:00:27
mpirun    4      16          64     00:00:26
mpirun    8      8           64     00:00:01
----
srun      1      32          32     00:01:43
srun      2      16          32     00:00:45
srun      4      8           32     00:00:06
srun      8      4           32     00:00:01
mpirun    1      32          32     00:00:02
mpirun    2      16          32     00:00:01
mpirun    4      8           32     00:00:01
mpirun    8      4           32     00:00:00
----
srun      1      16          16     00:01:28
srun      2      8           16     00:00:13
srun      4      4           16     00:00:01
srun      8      2           16     00:00:00
mpirun    1      16          16     00:00:01
mpirun    2      8           16     00:00:01
mpirun    4      4           16     00:00:02
mpirun    8      2           16     00:00:00

We have built slurm with support for pmix following standard procedure in the documentation and pmix is enabled by default (SLURM_MPI_TYPE=pmix). The version of pmix in the system and the one used to build slurm is v4.2.6. The OS of the system is Rocky Linux 8.10.

The jobs are submitted with sbatch. The job script loads OpenMPI v4.1.5 and launches a minimal script with srun/mpirun that prints the current time. Each job either uses srun or mpirun, but not both. 

The startup time measured is just the time it takes for all tasks in the job to start processing. The delay between the first task and the last one starting. This is the very minimal script:
-----------
#!/bin/bash

echo "== [$(hostname)] Start: $(date)"
-----------

As you can see there is not even the need to start a proper MPI application to trigger this issue. So we suspect that this issue stems from srun/pmix itself. This is our "/etc/slurm/mpi.conf":
-----------
PMIxDirectConn=YES
PMIxDirectConnEarly=YES
PMIxDirectConnUCX=YES
PMIxDirectSameArch=YES
-----------

Can you help us troubleshoot and fix this issue?

Comment 1 Joel Criado 2024-12-13 02:37:01 MST

Hi,

Is this something that you have seen only with those particular OpenMPI + PMIx versions? Or is this something you have seen with other versions too?

Can you perform the same tests adding verbosity to srun ("srun -vvvvvv") and send us the output? No need to run them all, we can start evaluating a few of them for now.

Kind regards,
Joel

Comment 2 VUB HPC 2024-12-16 09:16:16 MST

Created attachment 40153 [details]
Output of srun -vvvvvv with OpenMPI v4 and PMIx v4.1

Comment 3 VUB HPC 2024-12-16 09:31:21 MST

Hi,

I also checked Intel MPI 2021.9.0 with PMI2 and the behavior is the opposite. In that case srun is consistently faster than mpirun.

In the following I ran the minimal test as described in the first post on a single node with 32 tasks. Each result for Startup Time is the average of 3 runs:

MPI     PMI     Launcher    Nodes   Tasks StartupTime
-----------------------------------------------------
OpenMPI PMIx    srun        1       32    00:00:52
OpenMPI PMIx    mpirun      1       32    00:00:01
Intel   PMI2    srun        1       32    00:00:19
Intel   PMI2    mpirun      1       32    00:01:02

I have not tested any other versions of OpenMPI. We currently only serve OpenMPI v4.1 with PMIx 4.2.

We already noticed this issue with slurm 23.02 though. Built in the same way as slurm 24.05 and both built with PMIx 4.1. However, we no longer have any system running slurm 23.02.

I attached the output of srun -vvvvvv for the smallest test showing this issue, 1 node and 16 tasks with OpenMPI v4.1.

Comment 4 Joel Criado 2024-12-18 02:52:38 MST

Hi,

Thanks for sharing this info.

Can you also attach the slurmd log from the node where that srun ran?

About this parameter:
> PMIxDirectConnUCX=YES
The default value is false. What is your reason to set it to true?
Could you try setting it to false and see if the timings change?

Kind regards,
Joel

Comment 5 VUB HPC 2024-12-19 08:47:47 MST

Outcome of the test with logs from slurmd.

The following is the starting time of each task launched with srun (16 in total in 1 node)

Test: Startup time (srun)
---------------------
== Start: Thu Dec 19 16:41:00 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:02 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:02 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:04 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:04 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:07 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:14 CET 2024
== [node240.hydra.os] Start: Thu Dec 19 16:41:27 CET 2024

As seen in other tests, first batch of tasks starts fast, but the last few ones take a lot of time to start.

Logs of slurmd for this job are attached below.

Comment 6 VUB HPC 2024-12-19 08:49:00 MST

Created attachment 40212 [details]
slurmd logs of MPI startup test with 16 task in 1 node

Comment 7 VUB HPC 2024-12-19 08:52:36 MST

Regarding your question about PMIxDirectConnUCX=YES. We have an heterogeneous cluster with partitions that use different interconnects (some with IB, some withEthernet). We rely on UCX (through UCX_TLS) to automatically use the best transport on each partition. That's why we also enable PMIxDirectConnUCX, so that PMIx can use IB if available.

Comment 8 VUB HPC 2024-12-19 11:36:53 MST

I run the same tests disabling PMIxDirectConnUCX as requested and it does not have any effect.

Comment 10 Joel Criado 2025-01-27 04:44:33 MST

Hi,

First of all, sorry for the silence for a long period. I plan to focus more on this ticket in the following days.

Are you still experiencing the delay in your executions?

Can you also share your slurm.conf? 
Do you have a TaskProlog? If that's the case, can you share that?

When you launch the srun, which exact command do you use? Are you adding any additional flags?

Finally, can you share a new set of slurmd logs? I will be interested in logs with debug3, and with the Steps DebugFlags [1] active. You can change the Debug level and DebugFlags with scontrol [2], without restarting the controller. After the test, you can revert the Debug level and DebugFlags to your usual values.

Kind regards,
Joel

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags
[2] https://slurm.schedmd.com/scontrol.html#OPT_setdebug

Comment 11 Joel Criado 2025-02-06 07:47:59 MST

Hi,

Did you had time to check my last comment?

Kind regards,
Joel

Comment 12 Joel Criado 2025-02-17 05:06:34 MST

Hi,

Do you have any update about this issue?

Kind regards,
Joel

Comment 13 VUB HPC 2025-02-21 01:49:53 MST

Created attachment 40887 [details]
MPI startup test - job 10506249 - single node with infiniband - slurmd debug3 logs

Comment 14 VUB HPC 2025-02-21 01:50:47 MST

Created attachment 40888 [details]
MPI startup test - single node with ethernet - slurmd debug3 logs

Comment 15 VUB HPC 2025-02-21 02:09:53 MST

Hi Joel,

Thanks for coming back to this issue. I generated the requested slurmd logs for two test jobs that reproduce this issue. In both cases we submitted the following job script:

===============
#!/bin/bash

#SBATCH --time=10:00
#SBATCH --job-name=mpi-test-app
#SBATCH --output="%x.%j.out"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=1

WORKDIR="$SCRATCH/$SLURM_JOB_NAME.$SLURM_JOB_ID"
mkdir -p $WORKDIR
cd $WORKDIR

module load foss/2023a

srun app-minimal.sh
===============

where "app-minimal.sh" is the following script:
===============
#!/bin/bash

echo "== [$(hostname)] Start: $(date)"
===============

The differences between the two tests is the hardware of the node:
- job 10506249: Intel Skylake with InfiniBand network
- job 10506257: AMD Genoa-X with Ethernet network

Output of job 10506249:
---------------
== [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:36:19 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:36:19 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:36:21 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:36:24 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:36:31 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:36:44 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:37:00 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:37:15 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:37:30 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:37:46 CET 2025
== [node398.hydra.os] Start: Fri Feb 21 09:38:01 CET 2025
--------------

Output of job 10506254:
--------------
== [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:46 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:46 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:46 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:47 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:47 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:49 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:52 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:06:59 CET 2025
== [node600.hydra.os] Start: Fri Feb 21 10:07:12 CET 2025
--------------

Hope this helps, let me know if you need anything else.

Comment 16 VUB HPC 2025-02-21 02:13:42 MST

Created attachment 40889 [details]
MPI startup test - job 10506257 - single node with ethernet - slurmd debug3 logs

Comment 18 Joel Criado 2025-03-03 06:21:52 MST

Hi, 

Can you share your slurm.conf and your task prolog?

Kind regards,
Joel

Comment 19 Joel Criado 2025-03-18 02:52:09 MDT

Hi,

Just a reminder, please attach the slurm.conf and taskprolog to the ticket.

Kind regards,
Joel

Comment 20 VUB HPC 2025-04-01 02:18:05 MDT

Created attachment 41331 [details]
slurm.conf

Comment 21 VUB HPC 2025-04-01 02:18:48 MDT

Created attachment 41332 [details]
task prolog

Comment 22 VUB HPC 2025-04-01 02:30:18 MDT

I uploaded the request config files.

In the meantime, we upgraded to Slurm 24.05.7 and this issue is still prevalent.

As reminder, this happens on single-node jobs, so network type, network configuration or settings like PMIxDirectConn should be irrelevant.

The following are up-to-date results:

Test: Startup time of 16 tasks on 1 node (srun)
---------------------
== Start: Tue Apr  1 10:09:26 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:27 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:27 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:27 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:27 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:27 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:27 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:28 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:28 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:30 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:33 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:33 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:40 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:09:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:10:08 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:10:24 CEST 2025

Test: Startup time of 16 tasks on 1 node (mpirun)
---------------------
== Start: Tue Apr  1 10:19:52 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:52 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025
== [node314.hydra.os] Start: Tue Apr  1 10:19:53 CEST 2025

Comment 23 Joel Criado 2025-04-01 03:20:52 MDT

Hi,

From the last logs you shared, the tasks are all started at the same time:
[2025-02-21T09:36:18.158] [10506249.0] debug3: Unblocking StepId=10506249.0 task 0, writefd = 137
[2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 1, writefd = 138
[2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 2, writefd = 139
[2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 3, writefd = 140
[2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 4, writefd = 141
[2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 5, writefd = 142
[2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 6, writefd = 143
[2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 7, writefd = 144
[2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 8, writefd = 145
[2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 9, writefd = 146
[2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 10, writefd = 147
[2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 11, writefd = 148
[2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 12, writefd = 149
[2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 13, writefd = 150
[2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 14, writefd = 151
[2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 15, writefd = 152

My first suspicion right now is that the time difference you observe might be coming from the TaskProlog. Could you try executing the same without the TaskProlog in place?

Kind regards,
Joel