We have recently upgraded slurm from 18.08.3 to 19.05.2 and since then we haven't been able to run jobs larger than 128 nodes. Here are some of the error messages that we are noticing from the compute nodes. ./r5i2n18:Oct 21 12:29:30 r5i2n18 slurmd[3566]: error: unpackmem_xmalloc: Buffer to be unpacked is too large (61543492 > 10000000) ./r5i2n18:Oct 21 12:29:30 r5i2n18 slurmd[3566]: error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received ./r5i2n18:Oct 21 12:29:30 r5i2n18 slurmd[3566]: error: slurm_receive_msg_and_forward: Header lengths are longer than data received ./r5i2n18:Oct 21 12:29:30 r5i2n18 slurmd[3566]: error: service_connection: slurm_receive_msg: Header lengths are longer than data received ./r5i2n18:Oct 21 12:29:32 r5i2n18 slurmd[3566]: error: unpackmem_xmalloc: Buffer to be unpacked is too large (61543492 > 10000000) ./r5i2n18:Oct 21 12:29:32 r5i2n18 slurmd[3566]: error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received ./r5i2n18:Oct 21 12:29:32 r5i2n18 slurmd[3566]: error: slurm_receive_msg_and_forward: Header lengths are longer than data received ./r5i2n18:Oct 21 12:29:32 r5i2n18 slurmd[3566]: error: service_conn./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in reqtion: slurm_receive_msg: Header lengths are longer than data received ./r5i2n18:Oct 21 12:29:34 r5i2n18 slurmd[3566]: error: unpackmem_xmalloc: Buffer to be unpacked is too large (61543492 > 10000000) ./r5i2n18:Oct 21 12:29:34 r5i2n18 slurmd[3566]: error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received ./r5i2n18:Oct 21 12:29:34 r5i2n18 slurmd[3566]: error: slurm_receive_msg_and_forward: Header lengths are longer than data received ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req ./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key in req
Please attach an updated slurm.conf file. Which MPI is being used by the application?
Given the logs presented, this patch should correct the issue: https://github.com/SchedMD/slurm/commit/f7bed728b5b63633079829da21274032187a5d0e
Created attachment 12046 [details] slurm.conf
A 128 node job with Intel-MPI-2018 the OSU micro-benchmark runs fine but fails on using OpenMPI or MPICH. The same job ran fine on the previous version of slurm with MPICH and OpenMPI on 500 nodes.
(In reply to surendra from comment #5) > A 128 node job with Intel-MPI-2018 the OSU micro-benchmark runs fine but > fails on using OpenMPI or MPICH. The same job ran fine on the previous > version of slurm with MPICH and OpenMPI on 500 nodes. Please give the patch in comment#3 a try.
The patch that is provided is probably going to address the "unpackmem_xmalloc: Buffer to be unpacked is too large" issue. We also have been getting "failed to send temp kvs to $NODE" messages r6i6n25:Oct 21 14:48:42 r6i6n25 slurmstepd[355744]: error: mpi/pmi2: failed to send temp kvs to r6i2n27 r6i6n26:Oct 21 14:48:42 r6i6n26 slurmstepd[402458]: error: mpi/pmi2: failed to send temp kvs to r6i2n27 r6i6n27:Oct 21 14:48:42 r6i6n27 slurmstepd[272849]: error: mpi/pmi2: failed to send temp kvs to r6i2n27 r7i5n31:Oct 21 12:30:00 r7i5n31 slurmstepd[223147]: error: mpi/pmi2: failed to send temp kvs to r7i3n18 r7i5n32:Oct 21 12:30:00 r7i5n32 slurmstepd[370062]: error: mpi/pmi2: failed to send temp kvs to r7i3n18 r7i5n33:Oct 21 12:29:59 r7i5n33 slurmstepd[329728]: error: mpi/pmi2: failed to send temp kvs to r7i3n18
(In reply to surendra from comment #7) > The patch that is provided is probably going to address the > "unpackmem_xmalloc: Buffer to be unpacked is too large" issue. We also have > been getting "failed to send temp kvs to $NODE" messages Please give the patch a try. The errors are dependent on the max size issue.
Surendra, Any updates? Thanks, --Nate
(In reply to Nate Rini from comment #9) > Surendra, > > Any updates? > > Thanks, > --Nate We haven't applied the patch yet. Can we put this on hold for now and i will get back on this next week.
(In reply to surendra from comment #10) > We haven't applied the patch yet. Can we put this on hold for now and i will > get back on this next week. I'm going to lower the severity of this ticket per your response. We can continue this testing whenever convenient for your site. Thanks, --Nate
Surendra, I'm going to time this ticket out. Please respond when your ready to continue working on the issue or if you have any questions. Thanks, --Nate
As discussed on the srun scaling issue with Jason/Jess here are the logs of the hpgmg job. I have attached the job script and the output with intel mpi and also for openmpi. I can provide you the slurmctld logs but that would be without debugging enabled.
Created attachment 12842 [details] impi_srun_scaling.tar.gz
Created attachment 12843 [details] openmpi_srun_scaling.tar.gz
Surendra, We've found the most probable cause of the issue. The real error is that one: ./r5i2n18:Oct 21 12:29:30 r5i2n18 slurmd[3566]: error: unpackmem_xmalloc: Buffer to be unpacked is too large (61543492 > 10000000) To workaround it temporarily you can apply this patch: diff --git a/src/common/pack.c b/src/common/pack.c index 2d03b9702c..923495a05f 100644 --- a/src/common/pack.c +++ b/src/common/pack.c @@ -64,7 +64,7 @@ #define MAX_ARRAY_LEN_SMALL 10000 #define MAX_ARRAY_LEN_MEDIUM 1000000 -#define MAX_ARRAY_LEN_LARGE 100000000 +#define MAX_ARRAY_LEN_LARGE 1000000000 We will inform you when the issue is finally fixed and commited.
we will apply the patch during the system time next week.
Surendra, Timing this ticket out. Please respond when convenient and it will be re-opened. Thanks, --Nate