Ticket 10735

Summary:	srun fails with "srun: error: mpi/pmi2: failed to send temp kvs to compute nodes"
Product:	Slurm	Reporter:	issp2020support
Component:	slurmd	Assignee:	Carlos Tripiana Montes <tripiana>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, felip.moll
Version:	20.02.6
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7495 https://bugs.schedmd.com/show_bug.cgi?id=7973 https://bugs.schedmd.com/show_bug.cgi?id=7781 https://bugs.schedmd.com/show_bug.cgi?id=7744 https://bugs.schedmd.com/show_bug.cgi?id=4364 https://bugs.schedmd.com/show_bug.cgi?id=1282 https://bugs.schedmd.com/show_bug.cgi?id=12927
Site:	U of Tokyo	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	21.08.6, 22.05.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	error_log Config Files debug logs

Description issp2020support 2021-01-28 22:34:10 MST

User programs cannot start due to the following error when 14,000 - 16,000 MPI ranks (about 150 nodes) are requested: 
srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

I can run them when we use "mpirun" command provided by OpenMPI 4.0.4.

"pmi2" is set to MpiDefault:
MpiDefault=pmi2

What should I do to run them with srun?

Comment 1 Felip Moll 2021-01-29 02:43:36 MST

Hi,

Can you please send some slurmd logs? I am interested in the batch node and maybe some other compute one. If you can identify which are the ones that fail one slurmd log there would be useful.

I am looking for some string starting like this:

"error: slurm_receive_msg_and_forward"

Thanks

Comment 2 issp2020support 2021-01-29 03:58:52 MST

Created attachment 17668 [details]
error_log

I attached the logs.

The job ran on the following nodes, and BatchHost was c09u01n1.
c09u01n[1-4],c09u03n[1,3],c09u04n3,c09u05n[2-3],c09u07n[1,3],c09u08n1,c09u09n[1,3-4],c09u10n[1,4],c09u11n3,c09u12n3,c09u13n3,c09u14n[2,4],c09u15n[1,3-4],c09u16n[1-2],c10u01n4,c10u03n[1,4],c10u05n1,c10u11n4,c10u14n[1,3],c10u16n1,c10u18n[1,4],c13u01n[1-4],c13u02n[1-4],c13u03n[1-4],c13u04n[1-4],c13u05n[1-4],c13u06n[1-4],c13u07n[1-4],c13u08n[1-4],c13u09n[1-4],c13u10n[1-4],c13u11n[1-4],c13u12n[1-4],c13u13n[1-4],c13u14n[1-4],c13u15n[1-4],c13u16n[1-4],c13u17n[1-4],c13u18n[1-4],c14u09n[3-4],c14u10n[1-4],c14u11n[1-4],c14u12n[1-4],c14u13n[1-4],c14u14n[1-4],c14u15n[1-4],c14u16n[1-4],c14u17n[1-4],c14u18n1

I found many lines as follows:
error: slurm_receive_msg_and_forward: Header lengths are longer than data received

and also
error: unpackmem_xmalloc: Buffer to be unpacked is too large (117260348 > 100000000)

Is MAX_ARRAY_LEN_LARGE too small as seen in Bug 7781?

Comment 5 Felip Moll 2021-01-29 05:57:58 MST

(In reply to issp2020support from comment #2)
> Created attachment 17668 [details]
> error_log
> 
> I attached the logs.
> 
> The job ran on the following nodes, and BatchHost was c09u01n1.
> c09u01n[1-4],c09u03n[1,3],c09u04n3,c09u05n[2-3],c09u07n[1,3],c09u08n1,
> c09u09n[1,3-4],c09u10n[1,4],c09u11n3,c09u12n3,c09u13n3,c09u14n[2,4],
> c09u15n[1,3-4],c09u16n[1-2],c10u01n4,c10u03n[1,4],c10u05n1,c10u11n4,
> c10u14n[1,3],c10u16n1,c10u18n[1,4],c13u01n[1-4],c13u02n[1-4],c13u03n[1-4],
> c13u04n[1-4],c13u05n[1-4],c13u06n[1-4],c13u07n[1-4],c13u08n[1-4],c13u09n[1-
> 4],c13u10n[1-4],c13u11n[1-4],c13u12n[1-4],c13u13n[1-4],c13u14n[1-4],
> c13u15n[1-4],c13u16n[1-4],c13u17n[1-4],c13u18n[1-4],c14u09n[3-4],c14u10n[1-
> 4],c14u11n[1-4],c14u12n[1-4],c14u13n[1-4],c14u14n[1-4],c14u15n[1-4],
> c14u16n[1-4],c14u17n[1-4],c14u18n1
> 
> I found many lines as follows:
> error: slurm_receive_msg_and_forward: Header lengths are longer than data
> received
> 
> and also
> error: unpackmem_xmalloc: Buffer to be unpacked is too large (117260348 >
> 100000000)
> 
> Is MAX_ARRAY_LEN_LARGE too small as seen in Bug 7781?

Yes. This is a limitation we haven't corrected but we should have in bug 7973. It all comes from an internal bug where we needed to restrict the amount of bytes being packed/unpacked for security reasons, if unlimited an attacker could crash controller's.

The proposed patch is just to increase the array length limits, e.g.:

diff --git a/src/common/pack.c b/src/common/pack.c
index 2d03b9702c4..60fe72c3206 100644
--- a/src/common/pack.c
+++ b/src/common/pack.c
@@ -62,9 +62,9 @@
 #include "src/common/xassert.h"
 #include "src/slurmdbd/read_config.h"
 
-#define MAX_ARRAY_LEN_SMALL    10000
-#define MAX_ARRAY_LEN_MEDIUM   1000000
-#define MAX_ARRAY_LEN_LARGE    100000000
+#define MAX_ARRAY_LEN_SMALL    100000
+#define MAX_ARRAY_LEN_MEDIUM   10000000
+#define MAX_ARRAY_LEN_LARGE    1000000000

I already submitted the patch to our QA team for review and inclusion. Will let you know about the final outcome. Feel free to apply this patch which shouldn't harm anything if you need it urgently.

Thanks

Comment 7 issp2020support 2021-01-29 08:12:53 MST

Thank you.

I will apply the patch in the next maintenance, then check if it's resolved or not.

Comment 10 Felip Moll 2021-02-02 09:27:21 MST

(In reply to issp2020support from comment #7)
> Thank you.
> 
> I will apply the patch in the next maintenance, then check if it's resolved
> or not.

Hi, can you tell me more or less when will this happen?

Also.. I have a feeling that such a large buffer is not an usual thing even if you have 150 nodes jobs, it is ~112MB.
At the very least I feel I need to understand why the buffer is so large so besides the proposed patch I want to let you know I am studying if this is expected or not.

c09u01n1: [2021-01-13T19:52:33.506] error: unpackmem_xmalloc: Buffer to be unpacked is too large (117260348 > 100000000)

thanks

Comment 12 Felip Moll 2021-02-02 12:16:14 MST

(In reply to Felip Moll from comment #10)
> (In reply to issp2020support from comment #7)
> > Thank you.
> > 
> > I will apply the patch in the next maintenance, then check if it's resolved
> > or not.
> 
> Hi, can you tell me more or less when will this happen?
> 
> Also.. I have a feeling that such a large buffer is not an usual thing even
> if you have 150 nodes jobs, it is ~112MB.
> At the very least I feel I need to understand why the buffer is so large so
> besides the proposed patch I want to let you know I am studying if this is
> expected or not.
> 
> c09u01n1: [2021-01-13T19:52:33.506] error: unpackmem_xmalloc: Buffer to be
> unpacked is too large (117260348 > 100000000)
> 
> thanks

Can you please send me your latest slurm.conf and topology.conf?

Thanks

Comment 13 issp2020support 2021-02-02 18:58:56 MST

Created attachment 17726 [details]
Config Files

I attached slurm.conf and topology.conf

The maintenance will be held Feb. 12 JST.

Comment 14 Felip Moll 2021-02-03 10:04:15 MST

(In reply to issp2020support from comment #13)
> Created attachment 17726 [details]
> Config Files
> 
> I attached slurm.conf and topology.conf
> 
> The maintenance will be held Feb. 12 JST.

Hi, 

Would it be possible to set SlurmdDebug to debug2, reconfigure, and try to reproduce the issue?

Then send me back the slurmd logs please.

Comment 15 issp2020support 2021-02-04 12:35:31 MST

I won't have much time in the maintenance.
I would get debug log if I had time.

Comment 16 Felip Moll 2021-02-04 14:42:47 MST

(In reply to issp2020support from comment #15)
> I won't have much time in the maintenance.
> I would get debug log if I had time.

Actually I was requesting to grab this info before you apply the plugin and do the maintenance, is that possible?

Comment 17 issp2020support 2021-02-04 15:06:22 MST

I can't get the debug logs when the system is in production because the system is busy running users' jobs.
Of course I have time to apply the patch and to check if the problem is resolved, but it's not certain whether I can get the logs before applying the patch.

Comment 18 Felip Moll 2021-02-08 11:34:44 MST

(In reply to issp2020support from comment #17)
> I can't get the debug logs when the system is in production because the
> system is busy running users' jobs.
> Of course I have time to apply the patch and to check if the problem is
> resolved, but it's not certain whether I can get the logs before applying
> the patch.

Ok, please, try to get the logs before applying the patch. Send me them back.
Then try applying it and revert if it doesn't fix the issue.

Does the issue happen very often?

I will standby until 12th and then wait for any news.

Comment 19 Felip Moll 2021-02-18 05:59:57 MST

(In reply to Felip Moll from comment #18)
> (In reply to issp2020support from comment #17)
> > I can't get the debug logs when the system is in production because the
> > system is busy running users' jobs.
> > Of course I have time to apply the patch and to check if the problem is
> > resolved, but it's not certain whether I can get the logs before applying
> > the patch.
> 
> Ok, please, try to get the logs before applying the patch. Send me them back.
> Then try applying it and revert if it doesn't fix the issue.
> 
> Does the issue happen very often?
> 
> I will standby until 12th and then wait for any news.

Hi, 

were you able to get the requested information?

Thanks!

Comment 20 issp2020support 2021-02-21 22:01:06 MST

Created attachment 18043 [details]
debug logs

I collected debug logs on the Batchhost:c09u01n1 and other compute node: c09u01n2.
Could you see them?

Unfortunately I could not update SLURM in the maintenance in Feb.
I will try it again in the maintenance held on Mar. 12.

Comment 22 Felip Moll 2021-03-08 09:00:00 MST

Hi, 

That's to inform that I am continuing working on this. Despite the workaround I proposed will probably fix your issue, I am doing some profiling with OpenMPI and IntelMPI to understand how this 112MB buffer is originated in your system. This is the size of the buffer that exceeds the limits according to your logs.

Will keep you posted.

Comment 23 issp2020support 2021-03-16 19:54:16 MDT

The problem is fixed by applying the workaround.
Please keep investigating the root cause of too large buffer size.

Comment 24 Felip Moll 2021-03-17 02:53:44 MDT

(In reply to issp2020support from comment #23)
> The problem is fixed by applying the workaround.
> Please keep investigating the root cause of too large buffer size.

Thank you.

I understand you're in production and limited to apply debug patches, and I don't have such a large system to test it, though I will try. In any case would you be willing to help me with the diagnosis and show me how the user did submit the job and which software was he running?

I have the feeling the issue is an scalability problem with Intel.

Comment 25 issp2020support 2021-03-17 03:08:37 MDT

I don't think this problem depends on the program type.
I used IMB (Intel MPI Benchmarks) for getting debug logs, but other programs also caused the same problem.

The job was submitted as follows:
$ sbatch -n XXXXX -c 1 -N 144 jobscript

Comment 26 Felip Moll 2021-03-17 10:01:34 MDT

(In reply to issp2020support from comment #25)
> I don't think this problem depends on the program type.
> I used IMB (Intel MPI Benchmarks) for getting debug logs, but other programs
> also caused the same problem.
> 
> The job was submitted as follows:
> $ sbatch -n XXXXX -c 1 -N 144 jobscript

I was interested in the "jobscript", do you have one test I can run which causes this?

And I was referring to IntelMPI combined with PMI2. Does it also happen with OpenMPI? Maybe I did a wrong assumption from comment 0, and it is running just fine with mpirun of openmpi but not with srun and openmpi:

> I can run them when we use "mpirun" command provided by OpenMPI 4.0.4.

Comment 27 issp2020support 2021-03-17 10:46:40 MDT

You are correct.
 NG: srun (pmi2) under openmpi environment
 OK: mpirun of openmpi

In the jobscript, I just run:
srun ./IMB-MPI1 AlltoAll -npmin XXXXX


The following parameters are set for Infiniband environment.
pml  = ucx
osc  = ucx
spml = ucx
btl  = ^openib
coll_hcoll_enable = 0

Comment 29 Felip Moll 2021-05-05 09:35:05 MDT

(In reply to issp2020support from comment #27)
> You are correct.
>  NG: srun (pmi2) under openmpi environment
>  OK: mpirun of openmpi
> 
> In the jobscript, I just run:
> srun ./IMB-MPI1 AlltoAll -npmin XXXXX
> 
> 
> The following parameters are set for Infiniband environment.
> pml  = ucx
> osc  = ucx
> spml = ucx
> btl  = ^openib
> coll_hcoll_enable = 0

I have been working a bit more with this issue.

Your last example is a run of IMB-MPI1. This is a software compiled with IntelMPI libs, so in theory you say it is failing under openmpi, but in this case also fails with an intelmpi environment. Does it make sense?

I run it like this..

Depending on which PMI I want:
PMI-1 is: export I_MPI_PMI_LIBRARY=/nfs/slurm/20.11/inst/lib/libpmi.so
PMI-2 is: export I_MPI_PMI_LIBRARY=/nfs/slurm/20.11/inst/lib/libpmi2.so

Then I source the intel vars and run it.

]$ .  ./bin/intelmpi/compilers_and_libraries/linux/bin/compilervars.sh  intel64
]$ export I_MPI_PMI_LIBRARY=/nfs/slurm/20.11/inst/lib/libpmi2.so
]$ export I_MPI_PMI2=yes
]$ salloc -N20
]$ srun --mpi=pmi2 ./bin/intelmpi/compilers_and_libraries_2018.2.199/linux/mpi_2019/intel64/bin/IMB-MPI1 AlltoAll -npmin 10

What I see is that increasing the number of ranks, the buffer size of the KVS-Fence operation startup, grows. I even crashed my machine, but I do not have a way to test it and see the growth with more than 3000 ranks. I wanted to do a graph and see if it makes sense that in your case it is using a buffer of 117260348 bytes.

My guess is that when using mpirun the pmi_proxy does a different management of the buffers than the one done by srun/libpmi2, but I have no conclusion yet and it is being difficult to figure out the exact differences which makes srun+libpmi2 to scale different. I think it is a bit out of the scope of Slurm code itself, but instead more a design issue.

Some ideas I am investigating may be related to what was studied in this 'old (2015)' paper: "Non-blocking PMI Extensions for Fast MPI Startup" - S.Chakraborty,H.Subramoni,A.Moody,A. Venkatesh, J.Perkins, and D.K.Panda.

https://souravzzz.github.io/papers/ccgrid_15.pdf

which is the reason PMIx does exist.
Let me know if you observe anything relevant from your side.

Thanks

Comment 30 issp2020support 2021-05-05 17:36:05 MDT

>Your last example is a run of IMB-MPI1. This is a software compiled with IntelMPI libs, so in theory you say it is failing under openmpi, but in this case also fails with an intelmpi environment. Does it make sense?

I compiled IMB from source with Intel Compiler and OpenMPI.
The location of the source (2020u4) is as follows:
/InstallDir/compilers_and_libraries_2020.4.304/linux/mpi/benchmarks/imb/src_c

Comment 33 Felip Moll 2021-07-05 05:12:13 MDT

Hi,

I apologize for the late response. Here there are my conclusions:

The test used is IMB-MPI1 AlltoAll collective. I've found that this test uses explicitly PMI1. PMI-1 doesn't scale and one of the major handicaps is the memory footprint and having to pass a lot of information between processes. This could explain the 'large' 120MB buffer we're seeing here. Each node requires O(number of processes * processes per node) memory for storing remote endpoint information.

For example, with SHMEM PMI this can be reduced to O(ppn). With PMIx, or PMI-2 this is improved in other ways.

Solution 1: Keep the limits as is, and document that PMI-1 is deprecated and have a high-memory footprint which in collective all-to-all operations can exceed our network buffer limits.
Recommend people not to use PMI-1 (which we already do). I thing this is the better option.

Solution 2: Increase our limits, but how much? I think the current values are large enough for expected workloads.

Note that during the investigation of the issue I was focused in pmi2 since the pmi2 related errors and MpiDefault=pmi2 confused me. But the test internally does use PMI1 methods. Since pmi2 plugin is compatible with pmi1 it does work.

Besides other important limitations, the KVS database of pmi-1 doesn't have a way to limit the scope of the information it puts into the local KVS db. This means that all information is global, and when another process requires this information *everything* must be passed. The information stored in the KVS db can be of any sort, like system info: processors distribution, shared-memory keys, SMP, NUMA, and so on.. among other related MPI/program info. This makes that every process (rank) can have several KiB of data.

In an All-Gather operation, all nodes will get the info. of all other processes, and having a big number of processes the cost is exponential.

Besides that, the test that is failing is:

./IMB-MPI1 AlltoAll -npmin XXXXX

This test is classified under:
--------
Collective Benchmarks
Collective benchmarks measure MPI collective operations. Each benchmark is run with varying message lengths. The timing is averaged over multiple samples. The basic MPI data type for all messages is MPI_BYTE for pure data movement functions and MPI_FLOAT for reductions. Collective benchmarks show bare timings.

This introduces 'varying message lengths', that given a large number of procs and I guess can also lead to having a 120MB buffer.

https://software.intel.com/content/www/us/en/develop/documentation/imb-user-guide/top/mpi-1-benchmarks/classification-of-mpi-1-benchmarks.html

So my suggestion is to just ignore this error, and do recommend NOT using pmi-1 to users. Only if you really need a non-scalable software to run with PMI-1, increase the limits.

Does it make sense to you?

Comment 34 Felip Moll 2021-07-05 05:13:01 MDT

Note,

You can see how the test is using PMI-1 under the scene:

> [2021-02-12T10:18:20.490] [189177.0] mpi/pmi2: got client PMI1 init, version=2.0    <--------------
> [2021-02-12T10:18:20.490] [189177.0] debug2: mpi/pmi2: _tree_listen_readable
> [2021-02-12T10:18:20.490] [189177.0] debug2: mpi/pmi2: _task_readable
> [2021-02-12T10:18:20.490] [189177.0] debug2: mpi/pmi2: _task_readable

Comment 35 issp2020support 2021-07-05 11:04:08 MDT

Thank you for update.

I understand that PMI-1 should not be used.
But I still don't know how.

Does this code (or users' codes) request PMI-1 explicitly?
Could you tell me where and how it should be re-written?

In addition, is there any system settings to force PMI-2 regardless of user's request?

Comment 37 Felip Moll 2021-07-21 07:31:03 MDT

(In reply to issp2020support from comment #35)
> Thank you for update.
> 
> I understand that PMI-1 should not be used.
> But I still don't know how.
> 
> Does this code (or users' codes) request PMI-1 explicitly?
> Could you tell me where and how it should be re-written?
> 
> In addition, is there any system settings to force PMI-2 regardless of
> user's request?

Hi,

I found that the log was misleading. It was telling pmi-1 was being used but is incorrect. This must be changed in code..

so we're at the start point again, though I still think that's a scalability issue.

Comment 43 Carlos Tripiana Montes 2021-10-18 01:08:36 MDT

Hi,

This is just an update to confirm you that our problem is a scalability issue with PMI2. 100% confirmed. This extent is out of our (Slurm) control.

We are now internally discussing if we have any better way to mitigate the issue that just increase the max buffer length.

We'll keep you posted ASAP.

Sorry for the delay on replying.

Regards,
Carlos.

Comment 44 issp2020support 2021-10-18 04:07:47 MDT

Thank you for update.
I'll wait for info about the mitigation.

Comment 67 Carlos Tripiana Montes 2022-01-25 09:58:19 MST

Hi,

Sorry that we haven't updated you before.

This doesn't mean that we weren't working on the issue. We were internally developing a patch for this.

We are, I hope, close to the end of the review process and I'll let you know once the commits have been pushed.

Cheers,
Carlos.

Comment 71 Carlos Tripiana Montes 2022-01-28 01:23:30 MST

Hi,

Fixes for this error have been pushed to master branch (for future major release 22.05):

*   3a4bf5e7f5 (HEAD -> master, origin/master, origin/HEAD) Merge branch 'bug10735'
|\  
| * 01977768f2 Remove size check from unpackstr_array
| * 5104405728 Revert "Add unpack bounds check to eliminate Coverity errors"
| * bd9f9b466e Revert "Add size checks on all unpack array functions"
|/

I'm going to close this bug as fixed now.

Have a good day,
Carlos.

Comment 72 issp2020support 2022-01-28 06:40:38 MST

Thank you for update.
Will the same fix be included in 21.08.X ?

Comment 73 Carlos Tripiana Montes 2022-01-28 06:59:45 MST

I'm afraid not.

The changes made are important enough to keep them only in master.

As always, you can derive a patch from those commits. But bear in mind we don't provide official support for ad-hoc patches if they are applied outside the branches in which we have published those commits.

Best regards,
Carlos.

Comment 74 Carlos Tripiana Montes 2022-01-29 00:05:38 MST

I've internally discussed your needs.

Let us a while, I hope to see these changes in 21.08 as well.

I'll be back asap.

Cheers,
Carlos.

Comment 76 Carlos Tripiana Montes 2022-01-30 01:49:23 MST

Here it is.

*   2ea5aaad66 (HEAD -> slurm-21.08, origin/slurm-21.08) Merge branch 'bug10735' into slurm-21.08
|\  
| * 11f38c7933 (bug10735) Update 21.08 NEWS
| * 5f8a025226 Remove size check from unpackstr_array
| * 1d5a6cdf5a Revert "Add unpack bounds check to eliminate Coverity errors"
| * cfb600cc6c Revert "Add size checks on all unpack array functions"
|/

We expect the next 21.08 to happen in the next weeks, but you can apply these commits as a patch now if you want.

Cheers,
Carlos.

Comment 77 issp2020support 2022-01-30 01:54:46 MST

Thank you.
I'll wait for 21.08 release that includes the fix.