Two users reported that when they submit an mpi job, they have this error: srun: error: mpi/pmi2: failed to send temp kvs to compute nodes It's not the case with most of the software we use. It's confirmed with the software named Dalton2015. The soft is compiled with openmpi 1.8.4 and gcc 4.4.7 slurm and openmpi are compiled with pmi. In slurm.conf I have set MpiDefault=pmi2 and MpiParams=ports=12000-12999 Is it a number too small? we have ~150 nodes and ~2200 cores.
What seems to happen is that srun fails to send the the pim2 data to all components of the parallel job upon PMI2_KVS_Fence() call inside the pmi2 client code. Does this happens always or it is just a transient error? Do you happen to have the slurmd.log on the hosts where this failed? It would be also interesting to see log/output of the Dalton2015 application when this happen. A good example how the pmi2 works can be found here: slurm/contribs/pmi2/testpmi2.c In the case of this error not all rank will get passed the PMI2_KVS_Fence() and the job will abort. David
Dave, We are seeing the same error with a "hello-world" application on a Cray CS srun -N 1500 -n 50000 hello The application works below 50K pe and fails above 50k pes.
Hi David, WRT your question: Does this happens always or it is just a transient error? On our CS system (what Brian references in comment #2), the problem is consistent and reproducible. Best regards,
We were seeing this error in the slurmd.log files. [2015-03-26T19:24:56.942] error: forward_thread: slurm_msg_sendto: Socket timed out on send/recv operation So we increased MessageTimeout from 10 to 20 and we were able to run on all of the nodes.
If increasing the MessageTimeout helped this may indicate some congestion and retransmission which eventually cleared up. With 50000 ranks there are quite a bit of pim2 data to be sent in one shot from srun to the slurmstepds. Yann do you have more information about this problem? David
(In reply to David Bigagli from comment #5) > If increasing the MessageTimeout helped this may indicate some congestion > and retransmission which eventually cleared up. With 50000 ranks there > are quite a bit of pim2 data to be sent in one shot from srun to the > slurmstepds. David, Does the Slurm logic try to space out these communications over time for large jobs? If not, that might improve performance over the TCP retry logic. I know there is logic of that sort in a couple of places where there are sleep() calls added based upon the task ID/rank.
It retries with a sleep interval only if the send receive api fails. We don't know enough about this issue yet. David
(In reply to David Bigagli from comment #7) > It retries with a sleep interval only if the send receive api fails. > We don't know enough about this issue yet. I'm not sure what is happening either, but my concern is that TCP retransmit logic uses exponential backoff and all of the messages (including retransmits) might be happening at the same time. So there is a bit data storm at time 0, then again at time 1 second, then again at time 3 seconds, then again at 7 seconds, etc. Each time there is a data storm, some messages do get through, but then the network is idle until the set retransmissions (at the same time). There is logic in src/slurmd/slurmd/req.c _delay_rpc(), see below, that spaces communications out over time and demonstrates much better scalability for some message traffic. This type of logic _might_ also be needed in the pmi2 plugin for highly parallel applications. /* Delay a message based upon the host index, total host count and RPC_TIME. * This logic depends upon synchronized clocks across the cluster. */ static void _delay_rpc(int host_inx, int host_cnt, int usec_per_rpc) { struct timeval tv1; uint32_t cur_time; /* current time in usec (just 9 digits) */ uint32_t tot_time; /* total time expected for all RPCs */ uint32_t offset_time; /* relative time within tot_time */ uint32_t target_time; /* desired time to issue the RPC */ uint32_t delta_time; again: if (gettimeofday(&tv1, NULL)) { usleep(host_inx * usec_per_rpc); return; } cur_time = ((tv1.tv_sec % 1000) * 1000000) + tv1.tv_usec; tot_time = host_cnt * usec_per_rpc; offset_time = cur_time % tot_time; target_time = host_inx * usec_per_rpc; if (target_time < offset_time) delta_time = target_time - offset_time + tot_time; else delta_time = target_time - offset_time; if (usleep(delta_time)) { if (errno == EINVAL) /* usleep for more than 1 sec */ usleep(900000); /* errno == EINTR */ goto again; } }
_delay_rpc() actually caused unnecessary delays as startup of large parallel application and had to be disabled for rank 0. There is simply not enough information yet about what is causing problem this time. David
Information given. Please reopen if necessary. David
On a second thought let's keep this open as we need to investigate the scalability of protocols involved. David
Need to investigate pmi2 scalability. We are currently working on a new algorithm to bootstrap an mpi job faster then pmi2. David
I am fairly sure this can be closed