Description
VUB HPC
2024-12-09 08:58:17 MST
Hi, Is this something that you have seen only with those particular OpenMPI + PMIx versions? Or is this something you have seen with other versions too? Can you perform the same tests adding verbosity to srun ("srun -vvvvvv") and send us the output? No need to run them all, we can start evaluating a few of them for now. Kind regards, Joel Created attachment 40153 [details]
Output of srun -vvvvvv with OpenMPI v4 and PMIx v4.1
Hi, I also checked Intel MPI 2021.9.0 with PMI2 and the behavior is the opposite. In that case srun is consistently faster than mpirun. In the following I ran the minimal test as described in the first post on a single node with 32 tasks. Each result for Startup Time is the average of 3 runs: MPI PMI Launcher Nodes Tasks StartupTime ----------------------------------------------------- OpenMPI PMIx srun 1 32 00:00:52 OpenMPI PMIx mpirun 1 32 00:00:01 Intel PMI2 srun 1 32 00:00:19 Intel PMI2 mpirun 1 32 00:01:02 I have not tested any other versions of OpenMPI. We currently only serve OpenMPI v4.1 with PMIx 4.2. We already noticed this issue with slurm 23.02 though. Built in the same way as slurm 24.05 and both built with PMIx 4.1. However, we no longer have any system running slurm 23.02. I attached the output of srun -vvvvvv for the smallest test showing this issue, 1 node and 16 tasks with OpenMPI v4.1. Hi,
Thanks for sharing this info.
Can you also attach the slurmd log from the node where that srun ran?
About this parameter:
> PMIxDirectConnUCX=YES
The default value is false. What is your reason to set it to true?
Could you try setting it to false and see if the timings change?
Kind regards,
Joel
Outcome of the test with logs from slurmd. The following is the starting time of each task launched with srun (16 in total in 1 node) Test: Startup time (srun) --------------------- == Start: Thu Dec 19 16:41:00 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:01 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:02 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:02 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:04 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:04 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:07 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:14 CET 2024 == [node240.hydra.os] Start: Thu Dec 19 16:41:27 CET 2024 As seen in other tests, first batch of tasks starts fast, but the last few ones take a lot of time to start. Logs of slurmd for this job are attached below. Created attachment 40212 [details]
slurmd logs of MPI startup test with 16 task in 1 node
Regarding your question about PMIxDirectConnUCX=YES. We have an heterogeneous cluster with partitions that use different interconnects (some with IB, some withEthernet). We rely on UCX (through UCX_TLS) to automatically use the best transport on each partition. That's why we also enable PMIxDirectConnUCX, so that PMIx can use IB if available. I run the same tests disabling PMIxDirectConnUCX as requested and it does not have any effect. Hi, First of all, sorry for the silence for a long period. I plan to focus more on this ticket in the following days. Are you still experiencing the delay in your executions? Can you also share your slurm.conf? Do you have a TaskProlog? If that's the case, can you share that? When you launch the srun, which exact command do you use? Are you adding any additional flags? Finally, can you share a new set of slurmd logs? I will be interested in logs with debug3, and with the Steps DebugFlags [1] active. You can change the Debug level and DebugFlags with scontrol [2], without restarting the controller. After the test, you can revert the Debug level and DebugFlags to your usual values. Kind regards, Joel [1] https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags [2] https://slurm.schedmd.com/scontrol.html#OPT_setdebug Hi, Did you had time to check my last comment? Kind regards, Joel Hi, Do you have any update about this issue? Kind regards, Joel Created attachment 40887 [details]
MPI startup test - job 10506249 - single node with infiniband - slurmd debug3 logs
Created attachment 40888 [details]
MPI startup test - single node with ethernet - slurmd debug3 logs
Hi Joel, Thanks for coming back to this issue. I generated the requested slurmd logs for two test jobs that reproduce this issue. In both cases we submitted the following job script: =============== #!/bin/bash #SBATCH --time=10:00 #SBATCH --job-name=mpi-test-app #SBATCH --output="%x.%j.out" #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 #SBATCH --cpus-per-task=1 WORKDIR="$SCRATCH/$SLURM_JOB_NAME.$SLURM_JOB_ID" mkdir -p $WORKDIR cd $WORKDIR module load foss/2023a srun app-minimal.sh =============== where "app-minimal.sh" is the following script: =============== #!/bin/bash echo "== [$(hostname)] Start: $(date)" =============== The differences between the two tests is the hardware of the node: - job 10506249: Intel Skylake with InfiniBand network - job 10506257: AMD Genoa-X with Ethernet network Output of job 10506249: --------------- == [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:18 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:19 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:19 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:21 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:24 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:31 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:36:44 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:37:00 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:37:15 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:37:30 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:37:46 CET 2025 == [node398.hydra.os] Start: Fri Feb 21 09:38:01 CET 2025 -------------- Output of job 10506254: -------------- == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:45 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:46 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:46 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:46 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:47 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:47 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:49 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:52 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:06:59 CET 2025 == [node600.hydra.os] Start: Fri Feb 21 10:07:12 CET 2025 -------------- Hope this helps, let me know if you need anything else. Created attachment 40889 [details]
MPI startup test - job 10506257 - single node with ethernet - slurmd debug3 logs
Hi, Can you share your slurm.conf and your task prolog? Kind regards, Joel Hi, Just a reminder, please attach the slurm.conf and taskprolog to the ticket. Kind regards, Joel Created attachment 41331 [details]
slurm.conf
Created attachment 41332 [details]
task prolog
I uploaded the request config files. In the meantime, we upgraded to Slurm 24.05.7 and this issue is still prevalent. As reminder, this happens on single-node jobs, so network type, network configuration or settings like PMIxDirectConn should be irrelevant. The following are up-to-date results: Test: Startup time of 16 tasks on 1 node (srun) --------------------- == Start: Tue Apr 1 10:09:26 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:27 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:27 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:27 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:27 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:27 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:27 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:28 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:28 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:30 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:33 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:33 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:40 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:09:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:10:08 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:10:24 CEST 2025 Test: Startup time of 16 tasks on 1 node (mpirun) --------------------- == Start: Tue Apr 1 10:19:52 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:52 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 == [node314.hydra.os] Start: Tue Apr 1 10:19:53 CEST 2025 Hi, From the last logs you shared, the tasks are all started at the same time: [2025-02-21T09:36:18.158] [10506249.0] debug3: Unblocking StepId=10506249.0 task 0, writefd = 137 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 1, writefd = 138 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 2, writefd = 139 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 3, writefd = 140 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 4, writefd = 141 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 5, writefd = 142 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 6, writefd = 143 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 7, writefd = 144 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 8, writefd = 145 [2025-02-21T09:36:18.159] [10506249.0] debug3: Unblocking StepId=10506249.0 task 9, writefd = 146 [2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 10, writefd = 147 [2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 11, writefd = 148 [2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 12, writefd = 149 [2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 13, writefd = 150 [2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 14, writefd = 151 [2025-02-21T09:36:18.160] [10506249.0] debug3: Unblocking StepId=10506249.0 task 15, writefd = 152 My first suspicion right now is that the time difference you observe might be coming from the TaskProlog. Could you try executing the same without the TaskProlog in place? Kind regards, Joel |