Ticket 7973 - slurmstepd[74271]: error: mpi/pmi2: no value for key
Summary: slurmstepd[74271]: error: mpi/pmi2: no value for key
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 19.05.2
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-10-22 11:04 MDT by surendra
Modified: 2021-01-29 05:38 MST (History)
1 user (show)

See Also:
Site: NREL
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (6.39 KB, text/plain)
2019-10-22 12:52 MDT, surendra
Details
impi_srun_scaling.tar.gz (1.53 MB, application/x-gzip)
2020-01-24 16:06 MST, surendra
Details
openmpi_srun_scaling.tar.gz (2.13 MB, application/x-gzip)
2020-01-24 16:07 MST, surendra
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description surendra 2019-10-22 11:04:30 MDT
We have recently upgraded slurm from 18.08.3 to 19.05.2 and since then we haven't been able to run jobs larger than 128 nodes.  

Here are some of the error messages that we are noticing from the compute nodes.

./r5i2n18:Oct 21 12:29:30 r5i2n18 slurmd[3566]: error: unpackmem_xmalloc: Buffer to be unpacked is too large (61543492 > 10000000)
./r5i2n18:Oct 21 12:29:30 r5i2n18 slurmd[3566]: error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received
./r5i2n18:Oct 21 12:29:30 r5i2n18 slurmd[3566]: error: slurm_receive_msg_and_forward: Header lengths are longer than data received
./r5i2n18:Oct 21 12:29:30 r5i2n18 slurmd[3566]: error: service_connection: slurm_receive_msg: Header lengths are longer than data received
./r5i2n18:Oct 21 12:29:32 r5i2n18 slurmd[3566]: error: unpackmem_xmalloc: Buffer to be unpacked is too large (61543492 > 10000000)
./r5i2n18:Oct 21 12:29:32 r5i2n18 slurmd[3566]: error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received
./r5i2n18:Oct 21 12:29:32 r5i2n18 slurmd[3566]: error: slurm_receive_msg_and_forward: Header lengths are longer than data received
./r5i2n18:Oct 21 12:29:32 r5i2n18 slurmd[3566]: error: service_conn./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in reqtion: slurm_receive_msg: Header lengths are longer than data received
./r5i2n18:Oct 21 12:29:34 r5i2n18 slurmd[3566]: error: unpackmem_xmalloc: Buffer to be unpacked is too large (61543492 > 10000000)
./r5i2n18:Oct 21 12:29:34 r5i2n18 slurmd[3566]: error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received
./r5i2n18:Oct 21 12:29:34 r5i2n18 slurmd[3566]: error: slurm_receive_msg_and_forward: Header lengths are longer than data received


./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
./r3i1n0:Oct 20 11:30:44 r3i1n0 slurmstepd[272064]: error: mpi/pmi2: no value for key  in req
Comment 2 Nate Rini 2019-10-22 12:01:18 MDT
Please attach an updated slurm.conf file. Which MPI is being used by the application?
Comment 3 Nate Rini 2019-10-22 12:48:31 MDT
Given the logs presented, this patch should correct the issue:
https://github.com/SchedMD/slurm/commit/f7bed728b5b63633079829da21274032187a5d0e
Comment 4 surendra 2019-10-22 12:52:15 MDT
Created attachment 12046 [details]
slurm.conf
Comment 5 surendra 2019-10-22 13:03:34 MDT
A 128 node job with Intel-MPI-2018 the OSU micro-benchmark runs fine but fails on using OpenMPI or MPICH. The same job ran fine on the previous version of slurm with MPICH and OpenMPI on 500 nodes.
Comment 6 Nate Rini 2019-10-22 13:12:26 MDT
(In reply to surendra from comment #5)
> A 128 node job with Intel-MPI-2018 the OSU micro-benchmark runs fine but
> fails on using OpenMPI or MPICH. The same job ran fine on the previous
> version of slurm with MPICH and OpenMPI on 500 nodes.

Please give the patch in comment#3 a try.
Comment 7 surendra 2019-10-22 13:32:55 MDT
The patch that is provided is probably going to address the "unpackmem_xmalloc: Buffer to be unpacked is too large" issue. We also have been getting "failed to send temp kvs to $NODE" messages 

r6i6n25:Oct 21 14:48:42 r6i6n25 slurmstepd[355744]: error: mpi/pmi2: failed to send temp kvs to r6i2n27
r6i6n26:Oct 21 14:48:42 r6i6n26 slurmstepd[402458]: error: mpi/pmi2: failed to send temp kvs to r6i2n27
r6i6n27:Oct 21 14:48:42 r6i6n27 slurmstepd[272849]: error: mpi/pmi2: failed to send temp kvs to r6i2n27

r7i5n31:Oct 21 12:30:00 r7i5n31 slurmstepd[223147]: error: mpi/pmi2: failed to send temp kvs to r7i3n18
r7i5n32:Oct 21 12:30:00 r7i5n32 slurmstepd[370062]: error: mpi/pmi2: failed to send temp kvs to r7i3n18
r7i5n33:Oct 21 12:29:59 r7i5n33 slurmstepd[329728]: error: mpi/pmi2: failed to send temp kvs to r7i3n18
Comment 8 Nate Rini 2019-10-22 13:39:26 MDT
(In reply to surendra from comment #7)
> The patch that is provided is probably going to address the
> "unpackmem_xmalloc: Buffer to be unpacked is too large" issue. We also have
> been getting "failed to send temp kvs to $NODE" messages

Please give the patch a try. The errors are dependent on the max size issue.
Comment 9 Nate Rini 2019-10-23 09:40:57 MDT
Surendra,

Any updates?

Thanks,
--Nate
Comment 10 surendra 2019-10-25 13:39:39 MDT
(In reply to Nate Rini from comment #9)
> Surendra,
> 
> Any updates?
> 
> Thanks,
> --Nate

We haven't applied the patch yet. Can we put this on hold for now and i will get back on this next week.
Comment 11 Nate Rini 2019-10-25 13:41:39 MDT
(In reply to surendra from comment #10)
> We haven't applied the patch yet. Can we put this on hold for now and i will
> get back on this next week.

I'm going to lower the severity of this ticket per your response. We can continue this testing whenever convenient for your site.

Thanks,
--Nate
Comment 12 Nate Rini 2019-10-31 12:41:47 MDT
Surendra,

I'm going to time this ticket out. Please respond when your ready to continue working on the issue or if you have any questions.

Thanks,
--Nate
Comment 14 surendra 2020-01-24 16:05:56 MST
As discussed on the srun scaling issue with Jason/Jess here are the logs of the hpgmg job. 

I have attached the job script and the output with intel mpi and also for openmpi.

I can provide you the slurmctld logs but that would be without debugging enabled.
Comment 15 surendra 2020-01-24 16:06:33 MST
Created attachment 12842 [details]
impi_srun_scaling.tar.gz
Comment 16 surendra 2020-01-24 16:07:17 MST
Created attachment 12843 [details]
openmpi_srun_scaling.tar.gz
Comment 18 Felip Moll 2020-01-27 09:36:08 MST
Surendra,

We've found the most probable cause of the issue.

The real error is that one:

./r5i2n18:Oct 21 12:29:30 r5i2n18 slurmd[3566]: error: unpackmem_xmalloc: Buffer to be unpacked is too large (61543492 > 10000000)

To workaround it temporarily you can apply this patch:

diff --git a/src/common/pack.c b/src/common/pack.c
index 2d03b9702c..923495a05f 100644
--- a/src/common/pack.c
+++ b/src/common/pack.c
@@ -64,7 +64,7 @@
 
 #define MAX_ARRAY_LEN_SMALL    10000
 #define MAX_ARRAY_LEN_MEDIUM   1000000
-#define MAX_ARRAY_LEN_LARGE    100000000
+#define MAX_ARRAY_LEN_LARGE    1000000000
 

We will inform you when the issue is finally fixed and commited.
Comment 19 surendra 2020-01-28 16:18:16 MST
we will apply the patch during the system time next week.
Comment 20 Nate Rini 2020-03-09 15:56:06 MDT
Surendra,

Timing this ticket out. Please respond when convenient and it will be re-opened.

Thanks,
--Nate