Description
issp2020support
2021-01-28 22:34:10 MST
Hi, Can you please send some slurmd logs? I am interested in the batch node and maybe some other compute one. If you can identify which are the ones that fail one slurmd log there would be useful. I am looking for some string starting like this: "error: slurm_receive_msg_and_forward" Thanks Created attachment 17668 [details] error_log I attached the logs. The job ran on the following nodes, and BatchHost was c09u01n1. c09u01n[1-4],c09u03n[1,3],c09u04n3,c09u05n[2-3],c09u07n[1,3],c09u08n1,c09u09n[1,3-4],c09u10n[1,4],c09u11n3,c09u12n3,c09u13n3,c09u14n[2,4],c09u15n[1,3-4],c09u16n[1-2],c10u01n4,c10u03n[1,4],c10u05n1,c10u11n4,c10u14n[1,3],c10u16n1,c10u18n[1,4],c13u01n[1-4],c13u02n[1-4],c13u03n[1-4],c13u04n[1-4],c13u05n[1-4],c13u06n[1-4],c13u07n[1-4],c13u08n[1-4],c13u09n[1-4],c13u10n[1-4],c13u11n[1-4],c13u12n[1-4],c13u13n[1-4],c13u14n[1-4],c13u15n[1-4],c13u16n[1-4],c13u17n[1-4],c13u18n[1-4],c14u09n[3-4],c14u10n[1-4],c14u11n[1-4],c14u12n[1-4],c14u13n[1-4],c14u14n[1-4],c14u15n[1-4],c14u16n[1-4],c14u17n[1-4],c14u18n1 I found many lines as follows: error: slurm_receive_msg_and_forward: Header lengths are longer than data received and also error: unpackmem_xmalloc: Buffer to be unpacked is too large (117260348 > 100000000) Is MAX_ARRAY_LEN_LARGE too small as seen in Bug 7781? (In reply to issp2020support from comment #2) > Created attachment 17668 [details] > error_log > > I attached the logs. > > The job ran on the following nodes, and BatchHost was c09u01n1. > c09u01n[1-4],c09u03n[1,3],c09u04n3,c09u05n[2-3],c09u07n[1,3],c09u08n1, > c09u09n[1,3-4],c09u10n[1,4],c09u11n3,c09u12n3,c09u13n3,c09u14n[2,4], > c09u15n[1,3-4],c09u16n[1-2],c10u01n4,c10u03n[1,4],c10u05n1,c10u11n4, > c10u14n[1,3],c10u16n1,c10u18n[1,4],c13u01n[1-4],c13u02n[1-4],c13u03n[1-4], > c13u04n[1-4],c13u05n[1-4],c13u06n[1-4],c13u07n[1-4],c13u08n[1-4],c13u09n[1- > 4],c13u10n[1-4],c13u11n[1-4],c13u12n[1-4],c13u13n[1-4],c13u14n[1-4], > c13u15n[1-4],c13u16n[1-4],c13u17n[1-4],c13u18n[1-4],c14u09n[3-4],c14u10n[1- > 4],c14u11n[1-4],c14u12n[1-4],c14u13n[1-4],c14u14n[1-4],c14u15n[1-4], > c14u16n[1-4],c14u17n[1-4],c14u18n1 > > I found many lines as follows: > error: slurm_receive_msg_and_forward: Header lengths are longer than data > received > > and also > error: unpackmem_xmalloc: Buffer to be unpacked is too large (117260348 > > 100000000) > > Is MAX_ARRAY_LEN_LARGE too small as seen in Bug 7781? Yes. This is a limitation we haven't corrected but we should have in bug 7973. It all comes from an internal bug where we needed to restrict the amount of bytes being packed/unpacked for security reasons, if unlimited an attacker could crash controller's. The proposed patch is just to increase the array length limits, e.g.: diff --git a/src/common/pack.c b/src/common/pack.c index 2d03b9702c4..60fe72c3206 100644 --- a/src/common/pack.c +++ b/src/common/pack.c @@ -62,9 +62,9 @@ #include "src/common/xassert.h" #include "src/slurmdbd/read_config.h" -#define MAX_ARRAY_LEN_SMALL 10000 -#define MAX_ARRAY_LEN_MEDIUM 1000000 -#define MAX_ARRAY_LEN_LARGE 100000000 +#define MAX_ARRAY_LEN_SMALL 100000 +#define MAX_ARRAY_LEN_MEDIUM 10000000 +#define MAX_ARRAY_LEN_LARGE 1000000000 I already submitted the patch to our QA team for review and inclusion. Will let you know about the final outcome. Feel free to apply this patch which shouldn't harm anything if you need it urgently. Thanks Thank you. I will apply the patch in the next maintenance, then check if it's resolved or not. (In reply to issp2020support from comment #7) > Thank you. > > I will apply the patch in the next maintenance, then check if it's resolved > or not. Hi, can you tell me more or less when will this happen? Also.. I have a feeling that such a large buffer is not an usual thing even if you have 150 nodes jobs, it is ~112MB. At the very least I feel I need to understand why the buffer is so large so besides the proposed patch I want to let you know I am studying if this is expected or not. c09u01n1: [2021-01-13T19:52:33.506] error: unpackmem_xmalloc: Buffer to be unpacked is too large (117260348 > 100000000) thanks (In reply to Felip Moll from comment #10) > (In reply to issp2020support from comment #7) > > Thank you. > > > > I will apply the patch in the next maintenance, then check if it's resolved > > or not. > > Hi, can you tell me more or less when will this happen? > > Also.. I have a feeling that such a large buffer is not an usual thing even > if you have 150 nodes jobs, it is ~112MB. > At the very least I feel I need to understand why the buffer is so large so > besides the proposed patch I want to let you know I am studying if this is > expected or not. > > c09u01n1: [2021-01-13T19:52:33.506] error: unpackmem_xmalloc: Buffer to be > unpacked is too large (117260348 > 100000000) > > thanks Can you please send me your latest slurm.conf and topology.conf? Thanks Created attachment 17726 [details]
Config Files
I attached slurm.conf and topology.conf
The maintenance will be held Feb. 12 JST.
(In reply to issp2020support from comment #13) > Created attachment 17726 [details] > Config Files > > I attached slurm.conf and topology.conf > > The maintenance will be held Feb. 12 JST. Hi, Would it be possible to set SlurmdDebug to debug2, reconfigure, and try to reproduce the issue? Then send me back the slurmd logs please. I won't have much time in the maintenance. I would get debug log if I had time. (In reply to issp2020support from comment #15) > I won't have much time in the maintenance. > I would get debug log if I had time. Actually I was requesting to grab this info before you apply the plugin and do the maintenance, is that possible? I can't get the debug logs when the system is in production because the system is busy running users' jobs. Of course I have time to apply the patch and to check if the problem is resolved, but it's not certain whether I can get the logs before applying the patch. (In reply to issp2020support from comment #17) > I can't get the debug logs when the system is in production because the > system is busy running users' jobs. > Of course I have time to apply the patch and to check if the problem is > resolved, but it's not certain whether I can get the logs before applying > the patch. Ok, please, try to get the logs before applying the patch. Send me them back. Then try applying it and revert if it doesn't fix the issue. Does the issue happen very often? I will standby until 12th and then wait for any news. (In reply to Felip Moll from comment #18) > (In reply to issp2020support from comment #17) > > I can't get the debug logs when the system is in production because the > > system is busy running users' jobs. > > Of course I have time to apply the patch and to check if the problem is > > resolved, but it's not certain whether I can get the logs before applying > > the patch. > > Ok, please, try to get the logs before applying the patch. Send me them back. > Then try applying it and revert if it doesn't fix the issue. > > Does the issue happen very often? > > I will standby until 12th and then wait for any news. Hi, were you able to get the requested information? Thanks! Created attachment 18043 [details]
debug logs
I collected debug logs on the Batchhost:c09u01n1 and other compute node: c09u01n2.
Could you see them?
Unfortunately I could not update SLURM in the maintenance in Feb.
I will try it again in the maintenance held on Mar. 12.
Hi, That's to inform that I am continuing working on this. Despite the workaround I proposed will probably fix your issue, I am doing some profiling with OpenMPI and IntelMPI to understand how this 112MB buffer is originated in your system. This is the size of the buffer that exceeds the limits according to your logs. Will keep you posted. The problem is fixed by applying the workaround. Please keep investigating the root cause of too large buffer size. (In reply to issp2020support from comment #23) > The problem is fixed by applying the workaround. > Please keep investigating the root cause of too large buffer size. Thank you. I understand you're in production and limited to apply debug patches, and I don't have such a large system to test it, though I will try. In any case would you be willing to help me with the diagnosis and show me how the user did submit the job and which software was he running? I have the feeling the issue is an scalability problem with Intel. I don't think this problem depends on the program type. I used IMB (Intel MPI Benchmarks) for getting debug logs, but other programs also caused the same problem. The job was submitted as follows: $ sbatch -n XXXXX -c 1 -N 144 jobscript (In reply to issp2020support from comment #25) > I don't think this problem depends on the program type. > I used IMB (Intel MPI Benchmarks) for getting debug logs, but other programs > also caused the same problem. > > The job was submitted as follows: > $ sbatch -n XXXXX -c 1 -N 144 jobscript I was interested in the "jobscript", do you have one test I can run which causes this? And I was referring to IntelMPI combined with PMI2. Does it also happen with OpenMPI? Maybe I did a wrong assumption from comment 0, and it is running just fine with mpirun of openmpi but not with srun and openmpi: > I can run them when we use "mpirun" command provided by OpenMPI 4.0.4. You are correct. NG: srun (pmi2) under openmpi environment OK: mpirun of openmpi In the jobscript, I just run: srun ./IMB-MPI1 AlltoAll -npmin XXXXX The following parameters are set for Infiniband environment. pml = ucx osc = ucx spml = ucx btl = ^openib coll_hcoll_enable = 0 (In reply to issp2020support from comment #27) > You are correct. > NG: srun (pmi2) under openmpi environment > OK: mpirun of openmpi > > In the jobscript, I just run: > srun ./IMB-MPI1 AlltoAll -npmin XXXXX > > > The following parameters are set for Infiniband environment. > pml = ucx > osc = ucx > spml = ucx > btl = ^openib > coll_hcoll_enable = 0 I have been working a bit more with this issue. Your last example is a run of IMB-MPI1. This is a software compiled with IntelMPI libs, so in theory you say it is failing under openmpi, but in this case also fails with an intelmpi environment. Does it make sense? I run it like this.. Depending on which PMI I want: PMI-1 is: export I_MPI_PMI_LIBRARY=/nfs/slurm/20.11/inst/lib/libpmi.so PMI-2 is: export I_MPI_PMI_LIBRARY=/nfs/slurm/20.11/inst/lib/libpmi2.so Then I source the intel vars and run it. ]$ . ./bin/intelmpi/compilers_and_libraries/linux/bin/compilervars.sh intel64 ]$ export I_MPI_PMI_LIBRARY=/nfs/slurm/20.11/inst/lib/libpmi2.so ]$ export I_MPI_PMI2=yes ]$ salloc -N20 ]$ srun --mpi=pmi2 ./bin/intelmpi/compilers_and_libraries_2018.2.199/linux/mpi_2019/intel64/bin/IMB-MPI1 AlltoAll -npmin 10 What I see is that increasing the number of ranks, the buffer size of the KVS-Fence operation startup, grows. I even crashed my machine, but I do not have a way to test it and see the growth with more than 3000 ranks. I wanted to do a graph and see if it makes sense that in your case it is using a buffer of 117260348 bytes. My guess is that when using mpirun the pmi_proxy does a different management of the buffers than the one done by srun/libpmi2, but I have no conclusion yet and it is being difficult to figure out the exact differences which makes srun+libpmi2 to scale different. I think it is a bit out of the scope of Slurm code itself, but instead more a design issue. Some ideas I am investigating may be related to what was studied in this 'old (2015)' paper: "Non-blocking PMI Extensions for Fast MPI Startup" - S.Chakraborty,H.Subramoni,A.Moody,A. Venkatesh, J.Perkins, and D.K.Panda. https://souravzzz.github.io/papers/ccgrid_15.pdf which is the reason PMIx does exist. Let me know if you observe anything relevant from your side. Thanks >Your last example is a run of IMB-MPI1. This is a software compiled with IntelMPI libs, so in theory you say it is failing under openmpi, but in this case also fails with an intelmpi environment. Does it make sense?
I compiled IMB from source with Intel Compiler and OpenMPI.
The location of the source (2020u4) is as follows:
/InstallDir/compilers_and_libraries_2020.4.304/linux/mpi/benchmarks/imb/src_c
Hi, I apologize for the late response. Here there are my conclusions: The test used is IMB-MPI1 AlltoAll collective. I've found that this test uses explicitly PMI1. PMI-1 doesn't scale and one of the major handicaps is the memory footprint and having to pass a lot of information between processes. This could explain the 'large' 120MB buffer we're seeing here. Each node requires O(number of processes * processes per node) memory for storing remote endpoint information. For example, with SHMEM PMI this can be reduced to O(ppn). With PMIx, or PMI-2 this is improved in other ways. Solution 1: Keep the limits as is, and document that PMI-1 is deprecated and have a high-memory footprint which in collective all-to-all operations can exceed our network buffer limits. Recommend people not to use PMI-1 (which we already do). I thing this is the better option. Solution 2: Increase our limits, but how much? I think the current values are large enough for expected workloads. Note that during the investigation of the issue I was focused in pmi2 since the pmi2 related errors and MpiDefault=pmi2 confused me. But the test internally does use PMI1 methods. Since pmi2 plugin is compatible with pmi1 it does work. Besides other important limitations, the KVS database of pmi-1 doesn't have a way to limit the scope of the information it puts into the local KVS db. This means that all information is global, and when another process requires this information *everything* must be passed. The information stored in the KVS db can be of any sort, like system info: processors distribution, shared-memory keys, SMP, NUMA, and so on.. among other related MPI/program info. This makes that every process (rank) can have several KiB of data. In an All-Gather operation, all nodes will get the info. of all other processes, and having a big number of processes the cost is exponential. Besides that, the test that is failing is: ./IMB-MPI1 AlltoAll -npmin XXXXX This test is classified under: -------- Collective Benchmarks Collective benchmarks measure MPI collective operations. Each benchmark is run with varying message lengths. The timing is averaged over multiple samples. The basic MPI data type for all messages is MPI_BYTE for pure data movement functions and MPI_FLOAT for reductions. Collective benchmarks show bare timings. This introduces 'varying message lengths', that given a large number of procs and I guess can also lead to having a 120MB buffer. https://software.intel.com/content/www/us/en/develop/documentation/imb-user-guide/top/mpi-1-benchmarks/classification-of-mpi-1-benchmarks.html So my suggestion is to just ignore this error, and do recommend NOT using pmi-1 to users. Only if you really need a non-scalable software to run with PMI-1, increase the limits. Does it make sense to you? Note,
You can see how the test is using PMI-1 under the scene:
> [2021-02-12T10:18:20.490] [189177.0] mpi/pmi2: got client PMI1 init, version=2.0 <--------------
> [2021-02-12T10:18:20.490] [189177.0] debug2: mpi/pmi2: _tree_listen_readable
> [2021-02-12T10:18:20.490] [189177.0] debug2: mpi/pmi2: _task_readable
> [2021-02-12T10:18:20.490] [189177.0] debug2: mpi/pmi2: _task_readable
Thank you for update. I understand that PMI-1 should not be used. But I still don't know how. Does this code (or users' codes) request PMI-1 explicitly? Could you tell me where and how it should be re-written? In addition, is there any system settings to force PMI-2 regardless of user's request? (In reply to issp2020support from comment #35) > Thank you for update. > > I understand that PMI-1 should not be used. > But I still don't know how. > > Does this code (or users' codes) request PMI-1 explicitly? > Could you tell me where and how it should be re-written? > > In addition, is there any system settings to force PMI-2 regardless of > user's request? Hi, I found that the log was misleading. It was telling pmi-1 was being used but is incorrect. This must be changed in code.. so we're at the start point again, though I still think that's a scalability issue. Hi, This is just an update to confirm you that our problem is a scalability issue with PMI2. 100% confirmed. This extent is out of our (Slurm) control. We are now internally discussing if we have any better way to mitigate the issue that just increase the max buffer length. We'll keep you posted ASAP. Sorry for the delay on replying. Regards, Carlos. Thank you for update. I'll wait for info about the mitigation. Hi, Sorry that we haven't updated you before. This doesn't mean that we weren't working on the issue. We were internally developing a patch for this. We are, I hope, close to the end of the review process and I'll let you know once the commits have been pushed. Cheers, Carlos. Hi, Fixes for this error have been pushed to master branch (for future major release 22.05): * 3a4bf5e7f5 (HEAD -> master, origin/master, origin/HEAD) Merge branch 'bug10735' |\ | * 01977768f2 Remove size check from unpackstr_array | * 5104405728 Revert "Add unpack bounds check to eliminate Coverity errors" | * bd9f9b466e Revert "Add size checks on all unpack array functions" |/ I'm going to close this bug as fixed now. Have a good day, Carlos. Thank you for update. Will the same fix be included in 21.08.X ? I'm afraid not. The changes made are important enough to keep them only in master. As always, you can derive a patch from those commits. But bear in mind we don't provide official support for ad-hoc patches if they are applied outside the branches in which we have published those commits. Best regards, Carlos. I've internally discussed your needs. Let us a while, I hope to see these changes in 21.08 as well. I'll be back asap. Cheers, Carlos. Here it is. * 2ea5aaad66 (HEAD -> slurm-21.08, origin/slurm-21.08) Merge branch 'bug10735' into slurm-21.08 |\ | * 11f38c7933 (bug10735) Update 21.08 NEWS | * 5f8a025226 Remove size check from unpackstr_array | * 1d5a6cdf5a Revert "Add unpack bounds check to eliminate Coverity errors" | * cfb600cc6c Revert "Add size checks on all unpack array functions" |/ We expect the next 21.08 to happen in the next weeks, but you can apply these commits as a patch now if you want. Cheers, Carlos. Thank you. I'll wait for 21.08 release that includes the fix. |