Ticket 4242

Summary:	PMI2 process mapping exceeds PMI2_MAX_VALLEN on PPC
Product:	Slurm	Reporter:	Artem Polyakov <artpol84>
Component:	slurmstepd	Assignee:	Unassigned Developer <dev-unassigned>
Status:	OPEN ---	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	artpol84
Version:	17.11.x
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Artem Polyakov 2017-10-10 05:44:01 MDT

We observe this for "cyclic" distribution on our PPC machine that has 128 hw threads.
When we launch 1 proc per hw thread process mapping with cyclic distribution looks like this:
mapping="(vector,(0,1,1),(0,1,1),...,(0,1,1))"
"(0,1,1)" repeats 128 times.

The strlen(mapping) = 1032:
 * strlen("(vector)") = 8 
 * 128 * strlen(",(0,1,1)") = 128 * 8 = 1024

PMI2_MAX_VALLEN = 1024 < 1032

So when client requests "PMI_process_mapping" key (https://github.com/SchedMD/slurm/blob/master/src/plugins/mpi/pmi2/setup.c#L362)
it fails on the following check:
https://github.com/SchedMD/slurm/blob/master/contribs/pmi2/pmi2_api.c#L1278

Comment 1 Moe Jette 2017-10-10 09:43:24 MDT

Artem,
I am not sure exactly what the environment variables for PMI look like.
Would you like to suggest a specific limit for the largest size systems?
32k or perhaps something a bit smaller?

Comment 2 Artem Polyakov 2017-10-10 14:34:08 MDT

Moe,

I talked to Danny today. Posting my thoughts on this.

* The problem I'm reporting here was observed on one PPC node with 128 hw threads. 

* For this configuration (and for any number fully-packed 128-proc PPC nodes) cyclic proc distribution will result in the process mapping string containing 128 tuples which will cause it to overflow the 1K limit of PMI2 value length (PMI2_MAX_VALLEN) as I described. 

* Increasing PMI2_MAX_VALLEN (x2 for example) will solve the particular problem, but has following side effects:
  * it may theoretically break some implementations relying on 1K value length. This is just a concern, I don't have any particular examples. But PMI2_MAX_VALLEN is a macro so if somebody built with it - it is hardcoded and we may have a backward compatibility issue.

  * Newer systems with more procs per node may overflow this in future.

  * (needs to be verified) but I believe that PMIX_MAX_VALLEN in some places for buffer packing. This may results in message size/ memory footprint increase, thus perf issues.

* As a clear thought on the topic:
Block distribution (as opposed to cyclic) has a very nice packing metrics for this notation. I.e. for my case (1 node, 128 proc per node):
   * cyclic = (vector,(0,1,1),(0,1,1),..<repeats 126 times>) 
   * block = (vector,(0,1,128))
If we will consider the transposed version of process mapping we can get much better packing by introducing a new packing type (for example, "vector_t"):
   * cyclic = (vector_t,(0,128,1)).
Note that only "cycle" distribution will benefit from this, also all MPI implementations that rely on "PMI_process_mapping" will need to support this new type which brings up 100% backward compatibility issues.

Comment 4 Alejandro Sanchez 2019-07-15 03:58:11 MDT

Hi Artem,

Moe is currently not available and I will be following up on this.

I'm not very familiar either with how PMI variables look like. Reading through the code (correct me if I'm wrong), it seems like _setup_stepd_kvs() gets the info from PMI2_PREPUT_CNT_ENV to build the clique information associated with the distribution of the tasks.

The tuples are currently composed of these values:

(vector, (x0, y0, z0), ... (xn, yn, zn))

it seems like y represents the number of allocated nodes while z is the number of cpus per node.

I'm not sure what x and vector components represent. I can see what the problem is with the current notation of repeating cpus per task exceeding PMI2_MAX_VALLEN though.

Increasing PMI2_MAX_VALLEN to i.e. 8192 or 16384 looks like the more straightforward solution, albeit the potential concerns you mentioned. But if you have no examples of implementations relying on the current 1k I think I'd just go with it.

I'm also open to explore the other solution you mentioned, meaning transposing the notation of the clique info. I'm not sure if you could come up with a patch to address this since I don't have so much background on this notation, the vector_t struct you propose to incorporate to the tuple and the possible consequences of such change.

Comment 5 Alejandro Sanchez 2019-12-20 10:15:39 MST

Hey Artem,

do you have any updates on this? 

thanks.

Comment 6 Artem Polyakov 2019-12-20 12:05:10 MST

Hello,

Unfortunately I don’t. We are focusing on PMIx that doesn’t have this problem.
IIRC OMPI already abandoned PMI2 support, so it may not be very important now for us.
But the issue remains though.

Comment 9 Alejandro Sanchez 2020-01-06 08:36:33 MST

Hi Artem,

Nobody outside this bug has ever complained about this PMI2 specific issue. From your last comment, it doesn't really seem an important matter on your side, since you are focusing on PMIx and if you recall correctly OMPI abandoned PMI2 support. As you mentioned, increasing the macro might potentially have side-effects, break some implementations or encounter retro compatibility issues. 

For all these reasons, I'm inclined to just leave this as-is for now. If anyone in the future complains about this problem we can retake the discussion.