We have detected a systematic and quite striking difference in the amount of time it takes to launch a simple bash script (with just some echo commands) with srun and mpirun (openmpi) in our cluster with Slurm 24.05.4. While mpirun can launch all MPI ranks within a few seconds (typically <10 s) regardless of number of tasks per node, srun shows a slowdown that is dependent on the number of tasks per node and quickly increases to minutes. This issue already manifests in small jobs with 16 task per node: 1 node / 16 tasks: - mpirun: 1 sec - srun: 42 sec 4 nodes / 64 tasks: - mpirun: 4 sec - srun: 1 min 44 sec This issue happens on both TCP and InfiniBand networks: 1 node / 16 tasks / Ininiband: - mpirun: 1 sec - srun: 42 sec 1 node / 16 tasks / TCP: - mpirun: 1 sec - srun: 1 min 28 sec This behaviour is systematic and we get consistent slow starts with srun. This is the data we gathered so far on the TCP system: Launcher Nodes Tasks/node Tasks Duration srun 2 32 64 00:02:08 srun 4 16 64 00:01:43 srun 8 8 64 00:00:13 mpirun 2 32 64 00:00:27 mpirun 4 16 64 00:00:26 mpirun 8 8 64 00:00:01 ---- srun 1 32 32 00:01:43 srun 2 16 32 00:00:45 srun 4 8 32 00:00:06 srun 8 4 32 00:00:01 mpirun 1 32 32 00:00:02 mpirun 2 16 32 00:00:01 mpirun 4 8 32 00:00:01 mpirun 8 4 32 00:00:00 ---- srun 1 16 16 00:01:28 srun 2 8 16 00:00:13 srun 4 4 16 00:00:01 srun 8 2 16 00:00:00 mpirun 1 16 16 00:00:01 mpirun 2 8 16 00:00:01 mpirun 4 4 16 00:00:02 mpirun 8 2 16 00:00:00 We have built slurm with support for pmix following standard procedure in the documentation and pmix is enabled by default (SLURM_MPI_TYPE=pmix). The version of pmix in the system and the one used to build slurm is v4.2.6. The OS of the system is Rocky Linux 8.10. The jobs are submitted with sbatch. The job script loads OpenMPI v4.1.5 and launches a minimal script with srun/mpirun that prints the current time. Each job either uses srun or mpirun, but not both. The startup time measured is just the time it takes for all tasks in the job to start processing. The delay between the first task and the last one starting. This is the very minimal script: ----------- #!/bin/bash echo "== [$(hostname)] Start: $(date)" ----------- As you can see there is not even the need to start a proper MPI application to trigger this issue. So we suspect that this issue stems from srun/pmix itself. This is our "/etc/slurm/mpi.conf": ----------- PMIxDirectConn=YES PMIxDirectConnEarly=YES PMIxDirectConnUCX=YES PMIxDirectSameArch=YES ----------- Can you help us troubleshoot and fix this issue?
Hi, Is this something that you have seen only with those particular OpenMPI + PMIx versions? Or is this something you have seen with other versions too? Can you perform the same tests adding verbosity to srun ("srun -vvvvvv") and send us the output? No need to run them all, we can start evaluating a few of them for now. Kind regards, Joel
Created attachment 40153 [details] Output of srun -vvvvvv with OpenMPI v4 and PMIx v4.1
Hi, I also checked Intel MPI 2021.9.0 with PMI2 and the behavior is the opposite. In that case srun is consistently faster than mpirun. In the following I ran the minimal test as described in the first post on a single node with 32 tasks. Each result for Startup Time is the average of 3 runs: MPI PMI Launcher Nodes Tasks StartupTime ----------------------------------------------------- OpenMPI PMIx srun 1 32 00:00:52 OpenMPI PMIx mpirun 1 32 00:00:01 Intel PMI2 srun 1 32 00:00:19 Intel PMI2 mpirun 1 32 00:01:02 I have not tested any other versions of OpenMPI. We currently only serve OpenMPI v4.1 with PMIx 4.2. We already noticed this issue with slurm 23.02 though. Built in the same way as slurm 24.05 and both built with PMIx 4.1. However, we no longer have any system running slurm 23.02. I attached the output of srun -vvvvvv for the smallest test showing this issue, 1 node and 16 tasks with OpenMPI v4.1.
Hi, Thanks for sharing this info. Can you also attach the slurmd log from the node where that srun ran? About this parameter: > PMIxDirectConnUCX=YES The default value is false. What is your reason to set it to true? Could you try setting it to false and see if the timings change? Kind regards, Joel
Outcome of the test with logs from slurmd. The following is the starting time of each task launched with srun (16 in total in 1 node) Test: Startup time (srun) --------------------- == Start: Thu Dec 19 16:41:00 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:02 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:02 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:04 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:04 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:07 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:14 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:27 CET 2024 As seen in other tests, first batch of tasks starts fast, but the last few ones take a lot of time to start. Logs of slurmd for this job are attached below.
Created attachment 40212 [details] slurmd logs of MPI startup test with 16 task in 1 node
Regarding your question about PMIxDirectConnUCX=YES. We have an heterogeneous cluster with partitions that use different interconnects (some with IB, some withEthernet). We rely on UCX (through UCX_TLS) to automatically use the best transport on each partition. That's why we also enable PMIxDirectConnUCX, so that PMIx can use IB if available.
I run the same tests disabling PMIxDirectConnUCX as requested and it does not have any effect.
Hi, First of all, sorry for the silence for a long period. I plan to focus more on this ticket in the following days. Are you still experiencing the delay in your executions? Can you also share your slurm.conf? Do you have a TaskProlog? If that's the case, can you share that? When you launch the srun, which exact command do you use? Are you adding any additional flags? Finally, can you share a new set of slurmd logs? I will be interested in logs with debug3, and with the Steps DebugFlags [1] active. You can change the Debug level and DebugFlags with scontrol [2], without restarting the controller. After the test, you can revert the Debug level and DebugFlags to your usual values. Kind regards, Joel [1] https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags [2] https://slurm.schedmd.com/scontrol.html#OPT_setdebug
Hi, Did you had time to check my last comment? Kind regards, Joel
Hi, Do you have any update about this issue? Kind regards, Joel
Created attachment 40887 [details] MPI startup test - job 10506249 - single node with infiniband - slurmd debug3 logs
Created attachment 40888 [details] MPI startup test - single node with ethernet - slurmd debug3 logs
Hi Joel, Thanks for coming back to this issue. I generated the requested slurmd logs for two test jobs that reproduce this issue. In both cases we submitted the following job script: =============== #!/bin/bash #SBATCH --time=10:00 #SBATCH --job-name=mpi-test-app #SBATCH --output="%x.%j.out" #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 #SBATCH --cpus-per-task=1 WORKDIR="$SCRATCH/$SLURM_JOB_NAME.$SLURM_JOB_ID" mkdir -p $WORKDIR cd $WORKDIR module load foss/2023a srun app-minimal.sh =============== where "app-minimal.sh" is the following script: =============== #!/bin/bash echo "== [$(hostname)] Start: $(date)" =============== The differences between the two tests is the hardware of the node: - job 10506249: Intel Skylake with InfiniBand network - job 10506257: AMD Genoa-X with Ethernet network Output of job 10506249: --------------- == [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:19 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:19 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:21 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:24 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:31 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:44 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:37:00 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:37:15 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:37:30 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:37:46 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:38:01 CET 2025 -------------- Output of job 10506254: -------------- == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:46 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:46 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:46 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:47 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:47 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:49 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:52 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:59 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:07:12 CET 2025 -------------- Hope this helps, let me know if you need anything else.
Created attachment 40889 [details] MPI startup test - job 10506257 - single node with ethernet - slurmd debug3 logs
Hi, Can you share your slurm.conf and your task prolog? Kind regards, Joel
Hi, Just a reminder, please attach the slurm.conf and taskprolog to the ticket. Kind regards, Joel
Created attachment 41331 [details] slurm.conf
Created attachment 41332 [details] task prolog
I uploaded the request config files. In the meantime, we upgraded to Slurm 24.05.7 and this issue is still prevalent. As reminder, this happens on single-node jobs, so network type, network configuration or settings like PMIxDirectConn should be irrelevant. The following are up-to-date results: Test: Startup time of 16 tasks on 1 node (srun) --------------------- == Start: Tue Apr 1 10:09:26 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:27 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:27 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:27 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:27 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:27 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:27 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:28 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:28 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:30 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:33 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:33 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:40 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:10:08 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:10:24 CEST 2025 Test: Startup time of 16 tasks on 1 node (mpirun) --------------------- == Start: Tue Apr 1 10:19:52 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:52 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025
Hi, From the last logs you shared, the tasks are all started at the same time: [2025-02-21T09:36:18.158] [10506249.0] debug3: Unblocking StepId=10506249.0 task 0, writefd = 137 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 1, writefd = 138 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 2, writefd = 139 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 3, writefd = 140 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 4, writefd = 141 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 5, writefd = 142 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 6, writefd = 143 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 7, writefd = 144 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 8, writefd = 145 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 9, writefd = 146 [2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 10, writefd = 147 [2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 11, writefd = 148 [2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 12, writefd = 149 [2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 13, writefd = 150 [2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 14, writefd = 151 [2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 15, writefd = 152 My first suspicion right now is that the time difference you observe might be coming from the TaskProlog. Could you try executing the same without the TaskProlog in place? Kind regards, Joel