1546 – srun: error: mpi/pmi2: failed to send temp kvs to compute nodes

Ticket 1546 - srun: error: mpi/pmi2: failed to send temp kvs to compute nodes

Summary: srun: error: mpi/pmi2: failed to send temp kvs to compute nodes

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	14.11.4
Hardware:	Linux Linux

Severity:	5 - Enhancement
Assignee:	Unassigned Developer
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-03-19 03:36 MDT by Yann
Modified:	2019-09-23 11:46 MDT (History)
CC List:	4 users (show)

See Also:	7781
Site:	Université de Genève
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	17.02.0
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Yann 2015-03-19 03:36:51 MDT

Two users reported that when they submit an mpi job, they have this error:

srun: error: mpi/pmi2: failed to send temp kvs to compute nodes

It's not the case with most of the software we use.

It's confirmed with the software named Dalton2015.

The soft is compiled with openmpi 1.8.4 and gcc 4.4.7

slurm and openmpi are compiled with pmi.

In slurm.conf I have set MpiDefault=pmi2 and MpiParams=ports=12000-12999

Is it a number too small?

we have ~150 nodes and ~2200 cores.

Comment 1 David Bigagli 2015-03-19 09:50:12 MDT

What seems to happen is that srun fails to send the the pim2 data to all
components of the parallel job upon PMI2_KVS_Fence() call inside the 
pmi2 client code. Does this happens always or it is just a transient error?
Do you happen to have the slurmd.log on the hosts where this failed?
It would be also interesting to see log/output of the Dalton2015 application when this happen.

A good example how the pmi2 works can be found here:

slurm/contribs/pmi2/testpmi2.c

In the case of this error not all rank will get passed the PMI2_KVS_Fence() 
and the job will abort. 

David

Comment 2 Brian F Gilmer 2015-03-24 03:31:14 MDT

Dave,

We are seeing the same error with a "hello-world" application on a Cray CS

srun -N 1500 -n 50000 hello


The application works below 50K pe and fails above 50k pes.

Comment 3 David Parks 2015-03-26 00:58:17 MDT

Hi David,

WRT your question:
Does this happens always or it is just a transient error?

On our CS system (what Brian references in comment #2), the problem is consistent and reproducible.

Best regards,

Comment 4 Brian F Gilmer 2015-03-26 08:50:12 MDT

We were seeing this error in the slurmd.log files.
[2015-03-26T19:24:56.942] error: forward_thread: slurm_msg_sendto: Socket timed out on send/recv operation

So we increased MessageTimeout from 10 to 20 and we were able to run on all of the nodes.

Comment 5 David Bigagli 2015-03-26 09:49:58 MDT

If increasing the MessageTimeout helped this may indicate some congestion
and retransmission which eventually cleared up. With 50000 ranks there
are quite a bit of pim2 data to be sent in one shot from srun to the 
slurmstepds.

Yann do you have more information about this problem?

David

Comment 6 Moe Jette 2015-03-26 09:59:49 MDT

(In reply to David Bigagli from comment #5)
> If increasing the MessageTimeout helped this may indicate some congestion
> and retransmission which eventually cleared up. With 50000 ranks there
> are quite a bit of pim2 data to be sent in one shot from srun to the 
> slurmstepds.

David, Does the Slurm logic try to space out these communications over time for large jobs? If not, that might improve performance over the TCP retry logic. I know there is logic of that sort in a couple of places where there are sleep() calls added based upon the task ID/rank.

Comment 7 David Bigagli 2015-03-26 10:58:47 MDT

It retries with a sleep interval only if the send receive api fails.
We don't know enough about this issue yet.

David

Comment 8 Moe Jette 2015-03-26 11:10:58 MDT

(In reply to David Bigagli from comment #7)
> It retries with a sleep interval only if the send receive api fails.
> We don't know enough about this issue yet.

I'm not sure what is happening either, but my concern is that TCP retransmit logic uses exponential backoff and all of the messages (including retransmits) might be happening at the same time. So there is a bit data storm at time 0, then again at time 1 second, then again at time 3 seconds, then again at 7 seconds, etc. Each time there is a data storm, some messages do get through, but then the network is idle until the set retransmissions (at the same time).

There is logic in src/slurmd/slurmd/req.c _delay_rpc(), see below, that spaces communications out over time and demonstrates much better scalability for some message traffic. This type of logic _might_ also be needed in the pmi2 plugin for highly parallel applications.


/* Delay a message based upon the host index, total host count and RPC_TIME.
 * This logic depends upon synchronized clocks across the cluster. */
static void _delay_rpc(int host_inx, int host_cnt, int usec_per_rpc)
{
	struct timeval tv1;
	uint32_t cur_time;	/* current time in usec (just 9 digits) */
	uint32_t tot_time;	/* total time expected for all RPCs */
	uint32_t offset_time;	/* relative time within tot_time */
	uint32_t target_time;	/* desired time to issue the RPC */
	uint32_t delta_time;

again:	if (gettimeofday(&tv1, NULL)) {
		usleep(host_inx * usec_per_rpc);
		return;
	}

	cur_time = ((tv1.tv_sec % 1000) * 1000000) + tv1.tv_usec;
	tot_time = host_cnt * usec_per_rpc;
	offset_time = cur_time % tot_time;
	target_time = host_inx * usec_per_rpc;
	if (target_time < offset_time)
		delta_time = target_time - offset_time + tot_time;
	else
		delta_time = target_time - offset_time;
	if (usleep(delta_time)) {
		if (errno == EINVAL) /* usleep for more than 1 sec */
			usleep(900000);
		/* errno == EINTR */
		goto again;
	}
}

Comment 9 David Bigagli 2015-03-26 11:13:34 MDT

_delay_rpc() actually caused unnecessary delays as startup of large parallel
application and had to be disabled for rank 0. There is simply not enough 
information yet about what is causing problem this time.

David

Comment 10 David Bigagli 2015-03-30 06:41:15 MDT

Information given. Please reopen if necessary.

David

Comment 11 David Bigagli 2015-03-30 07:04:22 MDT

On a second thought let's keep this open as we need to investigate the scalability of protocols involved.

David

Comment 12 David Bigagli 2015-04-16 09:41:42 MDT

Need to investigate pmi2 scalability. We are currently working on a new algorithm
to bootstrap an mpi job faster then pmi2.

David

Comment 13 Danny Auble 2017-02-15 13:39:08 MST

I am fairly sure this can be closed