Ticket 1282 - PMI2 scalability issues
Summary: PMI2 scalability issues
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 15.08.x
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-12-01 01:48 MST by Artem Polyakov
Modified: 2021-10-18 00:53 MDT (History)
2 users (show)

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 15.08.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
PMI2 scalability issues (24.04 KB, patch)
2014-12-01 01:48 MST, Artem Polyakov
Details | Diff
PMI2 verification program (2.48 KB, text/x-csrc)
2014-12-01 01:50 MST, Artem Polyakov
Details
Verification batch script (870 bytes, application/octet-stream)
2014-12-01 01:50 MST, Artem Polyakov
Details
Patch to reduce pack limit size for testing purposes (474 bytes, patch)
2014-12-02 11:50 MST, Artem Polyakov
Details | Diff
Batch script to demonstrate original PMI2 problem (477 bytes, application/octet-stream)
2014-12-02 11:51 MST, Artem Polyakov
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Artem Polyakov 2014-12-01 01:48:24 MST
Created attachment 1469 [details]
PMI2 scalability issues

As discussed in slurm-dev thread (https://groups.google.com/forum/#!searchin/slurm-devel/polyakov/slurm-devel/ZK7puI8Syaw/0xGtdX0CyPcJ) I implement PMI2 database splitting. This patch fixes 3 scalability issues:
1. The one discovered by Andy Riebs and discussed in the mentioned thread.
2. The one noted by Hongjia Cao in setup.c in _setup_srun_tree_info and _setup_stepd_job_info related to tree_info.children_kvs_seq size reduse. This was implemented based on reverse_tree_direct_children that I introduce previously for PMIx plugin. This function calculates exact node ID's of this node children. Now the size of children_kvs_seq is no more than tree_width.
3. The one that was discovered by me under stress testing with waste resource contention. The problem was a corner case with following conditions:
a. broadcast of PMI2 DB from srun fail for some of the nodes, others receive DB successfuly.
b. nodes that receives DB successfuly continue, enter next fence and start to wait for the next DB broadcast (with eait_kvs_resp == 1).
c. srun tries to broadcast previous DB again and nodes that are one step ahared receive it and call slurm_kill_job_step(job_info.jobid, job_info.stepid, SIGKILL).

Since the changes are huge enough I performed testing using attached program and batch script. However I have no access to the SLURM-driven clusted that is not in production thus was able to test only on my laptop. Additional testing on the real system is needed with the following notes:
- attached program can overflow any DB limit but to quickly reach it max_msg_size variable in kvs.c may be decreased to 1MB during testing.
- in case of problems it would be nice for me to have access to this system for analysis.
Comment 1 Artem Polyakov 2014-12-01 01:50:15 MST
Created attachment 1470 [details]
PMI2 verification program
Comment 2 Artem Polyakov 2014-12-01 01:50:57 MST
Created attachment 1471 [details]
Verification batch script
Comment 3 David Bigagli 2014-12-02 09:33:25 MST
Hi Artem,
        I study the original problem and I have a question for you.
If I understand things correctly then the amount of data that srun is pushing
to the slurmstepd is greater then MAX_PACK_MEM_LEN so the packmem()
functions when packing data fails. However packmem() does not return 
anything so even if I force the failure I don't see the same error.
How were you able to reproduce their scenario?

Thanks,

David
Comment 4 Artem Polyakov 2014-12-02 11:50:59 MST
Created attachment 1478 [details]
Patch to reduce pack limit size for testing purposes
Comment 5 Artem Polyakov 2014-12-02 11:51:46 MST
Created attachment 1479 [details]
Batch script to demonstrate original PMI2 problem
Comment 6 Artem Polyakov 2014-12-02 11:54:05 MST
(In reply to David Bigagli from comment #3)
> Hi Artem,
>         I study the original problem and I have a question for you.
> If I understand things correctly then the amount of data that srun is pushing
> to the slurmstepd is greater then MAX_PACK_MEM_LEN so the packmem()
> functions when packing data fails. However packmem() does not return 
> anything so even if I force the failure I don't see the same error.
> How were you able to reproduce their scenario?
> 
> Thanks,
> 
> David

Hello, David.

To reproduce the problem:
1. Apply attached patch that reduces the pack limit to 1MB. This is convinient for small systems.
2. Use the demonstration batch script attached to launch the job.

With this and current SLURM master I have the following output:

srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
Comment 7 Artem Polyakov 2014-12-02 11:55:32 MST
(In reply to Artem Polyakov from comment #6)
> (In reply to David Bigagli from comment #3)
> > Hi Artem,
> >         I study the original problem and I have a question for you.
> > If I understand things correctly then the amount of data that srun is pushing
> > to the slurmstepd is greater then MAX_PACK_MEM_LEN so the packmem()
> > functions when packing data fails. However packmem() does not return 
> > anything so even if I force the failure I don't see the same error.
> > How were you able to reproduce their scenario?
> > 
> > Thanks,
> > 
> > David
> 
> Hello, David.
> 
> To reproduce the problem:
> 1. Apply attached patch that reduces the pack limit to 1MB. This is
> convinient for small systems.
> 2. Use the demonstration batch script attached to launch the job.
P.S. I also refer to my demo program attached earlier in that batch script.

> 
> With this and current SLURM master I have the following output:
> 
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576)
> srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
Comment 8 David Bigagli 2014-12-03 09:45:54 MST
Hi Artem,
  I now understand better the entire flow. The encode fails but the message is still sent out to the slurmstepd which discovers the message is only partial and
returns the error code ESLURM_PROTOCOL_INCOMPLETE_PACKET and srun kills 
the job.

I have installed and tested your patch. It works fine. However it is a complex
solution when there is a simpler solution, increase the buffer size.
After all 36MB is not much and if you look at MAX_BUF_SIZE in pack.h 
it is 4GB (0xffff0000). Writing big buffers into tcp/ip is not a problem anyway
as the protocol fragments the messages and recomposes them at the receiver
so why we want to emulate the same behavior in user space.

The second option is to use a different message length for the REQUEST_FORWARD_DATA protocol message. Since the large amount data is legitimate
in this case, as opposite the usual short message between daemons or job 
starting message. 

I think for now we can use the larger buffer and later implement the new
limit for REQUEST_FORWARD_DATA.

Thank you for your effort!

David
Comment 9 Artem Polyakov 2014-12-03 11:11:20 MST
(In reply to David Bigagli from comment #8)
> Hi Artem,
>   I now understand better the entire flow. The encode fails but the message
> is still sent out to the slurmstepd which discovers the message is only
> partial and
> returns the error code ESLURM_PROTOCOL_INCOMPLETE_PACKET and srun kills 
> the job.
> 
> I have installed and tested your patch. It works fine. However it is a
> complex
> solution when there is a simpler solution, increase the buffer size.
> After all 36MB is not much and if you look at MAX_BUF_SIZE in pack.h 
> it is 4GB (0xffff0000). Writing big buffers into tcp/ip is not a problem
> anyway
> as the protocol fragments the messages and recomposes them at the receiver
> so why we want to emulate the same behavior in user space.
> 
> The second option is to use a different message length for the
> REQUEST_FORWARD_DATA protocol message. Since the large amount data is
> legitimate
> in this case, as opposite the usual short message between daemons or job 
> starting message. 
> 
> I think for now we can use the larger buffer and later implement the new
> limit for REQUEST_FORWARD_DATA.
> 
> Thank you for your effort!

Thank you, David.

I suggest to have additional email discussion of this solution with Andy and Moe. I had the correct estimation of it's complexity and forecasted the rejection. However I had a green light from Moe and support from Andy and thus decided to do that.
I will prepare my argumentation and start discussion later today. I need to check few things. OK?
> 
> David
Comment 10 David Bigagli 2014-12-03 11:16:06 MST
Sure. We actually discussed it internally quite a bit. 
There are two arguments essentially, one is that if simple solution exist we should pursue it, second it is risky to disrupt existing functionality. It is my understanding that pmix is going to be superior to pmi2 in many ways so we think it is better to focus effort in that direction.

But this is a free world! :-) So please speak out.

David
Comment 11 David Bigagli 2014-12-08 05:16:21 MST
Buffer enlarged.

David