Description
Artem Polyakov
2014-12-01 01:48:24 MST
Created attachment 1470 [details]
PMI2 verification program
Created attachment 1471 [details]
Verification batch script
Hi Artem, I study the original problem and I have a question for you. If I understand things correctly then the amount of data that srun is pushing to the slurmstepd is greater then MAX_PACK_MEM_LEN so the packmem() functions when packing data fails. However packmem() does not return anything so even if I force the failure I don't see the same error. How were you able to reproduce their scenario? Thanks, David Created attachment 1478 [details]
Patch to reduce pack limit size for testing purposes
Created attachment 1479 [details]
Batch script to demonstrate original PMI2 problem
(In reply to David Bigagli from comment #3) > Hi Artem, > I study the original problem and I have a question for you. > If I understand things correctly then the amount of data that srun is pushing > to the slurmstepd is greater then MAX_PACK_MEM_LEN so the packmem() > functions when packing data fails. However packmem() does not return > anything so even if I force the failure I don't see the same error. > How were you able to reproduce their scenario? > > Thanks, > > David Hello, David. To reproduce the problem: 1. Apply attached patch that reduces the pack limit to 1MB. This is convinient for small systems. 2. Use the demonstration batch script attached to launch the job. With this and current SLURM master I have the following output: srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) srun: error: mpi/pmi2: failed to send temp kvs to compute nodes (In reply to Artem Polyakov from comment #6) > (In reply to David Bigagli from comment #3) > > Hi Artem, > > I study the original problem and I have a question for you. > > If I understand things correctly then the amount of data that srun is pushing > > to the slurmstepd is greater then MAX_PACK_MEM_LEN so the packmem() > > functions when packing data fails. However packmem() does not return > > anything so even if I force the failure I don't see the same error. > > How were you able to reproduce their scenario? > > > > Thanks, > > > > David > > Hello, David. > > To reproduce the problem: > 1. Apply attached patch that reduces the pack limit to 1MB. This is > convinient for small systems. > 2. Use the demonstration batch script attached to launch the job. P.S. I also refer to my demo program attached earlier in that batch script. > > With this and current SLURM master I have the following output: > > srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) > srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) > srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) > srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) > srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) > srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) > srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) > srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) > srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) > srun: error: packmem: Buffer to be packed is too large (2045786 > 1048576) > srun: error: mpi/pmi2: failed to send temp kvs to compute nodes Hi Artem, I now understand better the entire flow. The encode fails but the message is still sent out to the slurmstepd which discovers the message is only partial and returns the error code ESLURM_PROTOCOL_INCOMPLETE_PACKET and srun kills the job. I have installed and tested your patch. It works fine. However it is a complex solution when there is a simpler solution, increase the buffer size. After all 36MB is not much and if you look at MAX_BUF_SIZE in pack.h it is 4GB (0xffff0000). Writing big buffers into tcp/ip is not a problem anyway as the protocol fragments the messages and recomposes them at the receiver so why we want to emulate the same behavior in user space. The second option is to use a different message length for the REQUEST_FORWARD_DATA protocol message. Since the large amount data is legitimate in this case, as opposite the usual short message between daemons or job starting message. I think for now we can use the larger buffer and later implement the new limit for REQUEST_FORWARD_DATA. Thank you for your effort! David (In reply to David Bigagli from comment #8) > Hi Artem, > I now understand better the entire flow. The encode fails but the message > is still sent out to the slurmstepd which discovers the message is only > partial and > returns the error code ESLURM_PROTOCOL_INCOMPLETE_PACKET and srun kills > the job. > > I have installed and tested your patch. It works fine. However it is a > complex > solution when there is a simpler solution, increase the buffer size. > After all 36MB is not much and if you look at MAX_BUF_SIZE in pack.h > it is 4GB (0xffff0000). Writing big buffers into tcp/ip is not a problem > anyway > as the protocol fragments the messages and recomposes them at the receiver > so why we want to emulate the same behavior in user space. > > The second option is to use a different message length for the > REQUEST_FORWARD_DATA protocol message. Since the large amount data is > legitimate > in this case, as opposite the usual short message between daemons or job > starting message. > > I think for now we can use the larger buffer and later implement the new > limit for REQUEST_FORWARD_DATA. > > Thank you for your effort! Thank you, David. I suggest to have additional email discussion of this solution with Andy and Moe. I had the correct estimation of it's complexity and forecasted the rejection. However I had a green light from Moe and support from Andy and thus decided to do that. I will prepare my argumentation and start discussion later today. I need to check few things. OK? > > David Sure. We actually discussed it internally quite a bit. There are two arguments essentially, one is that if simple solution exist we should pursue it, second it is risky to disrupt existing functionality. It is my understanding that pmix is going to be superior to pmi2 in many ways so we think it is better to focus effort in that direction. But this is a free world! :-) So please speak out. David Buffer enlarged. David |