1282 – PMI2 scalability issues

Ticket 1282 - PMI2 scalability issues

Summary: PMI2 scalability issues

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	15.08.x
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-12-01 01:48 MST by Artem Polyakov
Modified:	2021-10-18 00:53 MDT (History)
CC List:	2 users (show)

See Also:	10735
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	15.08.0pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
PMI2 scalability issues (24.04 KB, patch) 2014-12-01 01:48 MST, Artem Polyakov	Details \| Diff
PMI2 verification program (2.48 KB, text/x-csrc) 2014-12-01 01:50 MST, Artem Polyakov	Details
Verification batch script (870 bytes, application/octet-stream) 2014-12-01 01:50 MST, Artem Polyakov	Details
Patch to reduce pack limit size for testing purposes (474 bytes, patch) 2014-12-02 11:50 MST, Artem Polyakov	Details \| Diff
Batch script to demonstrate original PMI2 problem (477 bytes, application/octet-stream) 2014-12-02 11:51 MST, Artem Polyakov	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Artem Polyakov 2014-12-01 01:48:24 MST

Created attachment 1469 [details]
PMI2 scalability issues

As discussed in slurm-dev thread (https://groups.google.com/forum/#!searchin/slurm-devel/polyakov/slurm-devel/ZK7puI8Syaw/0xGtdX0CyPcJ) I implement PMI2 database splitting. This patch fixes 3 scalability issues:
1. The one discovered by Andy Riebs and discussed in the mentioned thread.
2. The one noted by Hongjia Cao in setup.c in _setup_srun_tree_info and _setup_stepd_job_info related to tree_info.children_kvs_seq size reduse. This was implemented based on reverse_tree_direct_children that I introduce previously for PMIx plugin. This function calculates exact node ID's of this node children. Now the size of children_kvs_seq is no more than tree_width.
3. The one that was discovered by me under stress testing with waste resource contention. The problem was a corner case with following conditions:
a. broadcast of PMI2 DB from srun fail for some of the nodes, others receive DB successfuly.
b. nodes that receives DB successfuly continue, enter next fence and start to wait for the next DB broadcast (with eait_kvs_resp == 1).
c. srun tries to broadcast previous DB again and nodes that are one step ahared receive it and call slurm_kill_job_step(job_info.jobid, job_info.stepid, SIGKILL).

Since the changes are huge enough I performed testing using attached program and batch script. However I have no access to the SLURM-driven clusted that is not in production thus was able to test only on my laptop. Additional testing on the real system is needed with the following notes:
- attached program can overflow any DB limit but to quickly reach it max_msg_size variable in kvs.c may be decreased to 1MB during testing.
- in case of problems it would be nice for me to have access to this system for analysis.

Comment 1 Artem Polyakov 2014-12-01 01:50:15 MST

Created attachment 1470 [details]
PMI2 verification program

Comment 2 Artem Polyakov 2014-12-01 01:50:57 MST

Created attachment 1471 [details]
Verification batch script

Comment 3 David Bigagli 2014-12-02 09:33:25 MST

Hi Artem,
        I study the original problem and I have a question for you.
If I understand things correctly then the amount of data that srun is pushing
to the slurmstepd is greater then MAX_PACK_MEM_LEN so the packmem()
functions when packing data fails. However packmem() does not return 
anything so even if I force the failure I don't see the same error.
How were you able to reproduce their scenario?

Thanks,

David

Comment 4 Artem Polyakov 2014-12-02 11:50:59 MST

Created attachment 1478 [details]
Patch to reduce pack limit size for testing purposes

Comment 5 Artem Polyakov 2014-12-02 11:51:46 MST

Created attachment 1479 [details]
Batch script to demonstrate original PMI2 problem

Comment 6 Artem Polyakov 2014-12-02 11:54:05 MST

(In reply to David Bigagli from comment #3)
> Hi Artem,
>         I study the original problem and I have a question for you.
> If I understand things correctly then the amount of data that srun is pushing
> to the slurmstepd is greater then MAX_PACK_MEM_LEN so the packmem()
> functions when packing data fails. However packmem() does not return 
> anything so even if I force the failure I don't see the same error.
> How were you able to reproduce their scenario?
> 
> Thanks,
> 
> David

Hello, David.

To reproduce the problem:
1. Apply attached patch that reduces the pack limit to 1MB. This is convinient for small systems.
2. Use the demonstration batch script attached to launch the job.

With this and current SLURM master I have the following output:

srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: mpi/pmi2: failed to send temp kvs to compute nodes

Comment 7 Artem Polyakov 2014-12-02 11:55:32 MST

(In reply to Artem Polyakov from comment #6)
> (In reply to David Bigagli from comment #3)
> > Hi Artem,
> >         I study the original problem and I have a question for you.
> > If I understand things correctly then the amount of data that srun is pushing
> > to the slurmstepd is greater then MAX_PACK_MEM_LEN so the packmem()
> > functions when packing data fails. However packmem() does not return 
> > anything so even if I force the failure I don't see the same error.
> > How were you able to reproduce their scenario?
> > 
> > Thanks,
> > 
> > David
> 
> Hello, David.
> 
> To reproduce the problem:
> 1. Apply attached patch that reduces the pack limit to 1MB. This is
> convinient for small systems.
> 2. Use the demonstration batch script attached to launch the job.
P.S. I also refer to my demo program attached earlier in that batch script.

> 
> With this and current SLURM master I have the following output:
> 
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: mpi/pmi2: failed to send temp kvs to compute nodes

Comment 8 David Bigagli 2014-12-03 09:45:54 MST

Hi Artem,
  I now understand better the entire flow. The encode fails but the message is still sent out to the slurmstepd which discovers the message is only partial and
returns the error code ESLURM_PROTOCOL_INCOMPLETE_PACKET and srun kills 
the job.

I have installed and tested your patch. It works fine. However it is a complex
solution when there is a simpler solution, increase the buffer size.
After all 36MB is not much and if you look at MAX_BUF_SIZE in pack.h 
it is 4GB (0xffff0000). Writing big buffers into tcp/ip is not a problem anyway
as the protocol fragments the messages and recomposes them at the receiver
so why we want to emulate the same behavior in user space.

The second option is to use a different message length for the REQUEST_FORWARD_DATA protocol message. Since the large amount data is legitimate
in this case, as opposite the usual short message between daemons or job 
starting message. 

I think for now we can use the larger buffer and later implement the new
limit for REQUEST_FORWARD_DATA.

Thank you for your effort!

David

Comment 9 Artem Polyakov 2014-12-03 11:11:20 MST

(In reply to David Bigagli from comment #8)
> Hi Artem,
>   I now understand better the entire flow. The encode fails but the message
> is still sent out to the slurmstepd which discovers the message is only
> partial and
> returns the error code ESLURM_PROTOCOL_INCOMPLETE_PACKET and srun kills 
> the job.
> 
> I have installed and tested your patch. It works fine. However it is a
> complex
> solution when there is a simpler solution, increase the buffer size.
> After all 36MB is not much and if you look at MAX_BUF_SIZE in pack.h 
> it is 4GB (0xffff0000). Writing big buffers into tcp/ip is not a problem
> anyway
> as the protocol fragments the messages and recomposes them at the receiver
> so why we want to emulate the same behavior in user space.
> 
> The second option is to use a different message length for the
> REQUEST_FORWARD_DATA protocol message. Since the large amount data is
> legitimate
> in this case, as opposite the usual short message between daemons or job 
> starting message. 
> 
> I think for now we can use the larger buffer and later implement the new
> limit for REQUEST_FORWARD_DATA.
> 
> Thank you for your effort!

Thank you, David.

I suggest to have additional email discussion of this solution with Andy and Moe. I had the correct estimation of it's complexity and forecasted the rejection. However I had a green light from Moe and support from Andy and thus decided to do that.
I will prepare my argumentation and start discussion later today. I need to check few things. OK?
> 
> David

Comment 10 David Bigagli 2014-12-03 11:16:06 MST

Sure. We actually discussed it internally quite a bit. 
There are two arguments essentially, one is that if simple solution exist we should pursue it, second it is risky to disrupt existing functionality. It is my understanding that pmix is going to be superior to pmi2 in many ways so we think it is better to focus effort in that direction.

But this is a free world! :-) So please speak out.

David

Comment 11 David Bigagli 2014-12-08 05:16:21 MST

Buffer enlarged.

David