Ticket 7781

Summary:	failed startup of large jobs, error: unpackmem_xmalloc: Buffer to be unpacked is too large (11138290 > 10000000)
Product:	Slurm	Reporter:	lenovosng
Component:	Other	Assignee:	Nate Rini <nate>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	mxander, nate, victor.gamayunov
Version:	18.08.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=1546 https://bugs.schedmd.com/show_bug.cgi?id=7815 https://bugs.schedmd.com/show_bug.cgi?id=7495 https://bugs.schedmd.com/show_bug.cgi?id=10735
Site:	LRZ	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Ticket Depends on:	7495
Ticket Blocks:
Attachments:	slurm.conf

Description lenovosng 2019-09-20 08:10:30 MDT

Yesterday our users tried to launch a job on 6250 nodes with 48tasks per node.
The job failed with the following error message:

> srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
> slurmstepd: error: *** STEP 194947.0 ON i01r01c01s01 CANCELLED AT 
> 2019-09-19T21:19:42 ***

The slurmdlog on the batchnode shows the following
[2019-09-19T21:21:33.051] error: unpackmem_xmalloc: Buffer to be unpacked is too large (11138290 > 10000000)
[2019-09-19T21:21:33.051] error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received
[2019-09-19T21:21:33.051] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2019-09-19T21:21:33.061] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2019-09-19T21:21:34.617] error: unpackmem_xmalloc: Buffer to be unpacked is too large (11138290 > 10000000)
[2019-09-19T21:21:34.617] error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received
[2019-09-19T21:21:34.617] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2019-09-19T21:21:34.627] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2019-09-19T21:21:37.180] error: unpackmem_xmalloc: Buffer to be unpacked is too large (11138290 > 10000000)
[2019-09-19T21:21:37.180] error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received
[2019-09-19T21:21:37.180] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2019-09-19T21:21:37.190] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2019-09-19T21:21:41.701] error: unpackmem_xmalloc: Buffer to be unpacked is too large (11138290 > 10000000)
[2019-09-19T21:21:41.701] error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received
[2019-09-19T21:21:41.701] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2019-09-19T21:21:41.711] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2019-09-19T21:21:50.233] error: unpackmem_xmalloc: Buffer to be unpacked is too large (11138290 > 10000000)
[2019-09-19T21:21:50.233] error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received
[2019-09-19T21:21:50.233] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2019-09-19T21:21:50.243] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2019-09-19T21:21:50.778] [194949.0] error: *** STEP 194949.0 ON i01r01c01s01 CANCELLED AT 2019-09-19T21:21:50 ***

Comment 1 Nate Rini 2019-09-20 10:52:57 MDT

Can we please get the output of the following:
> scontrol show job 194947
> scontrol show job 194949

Comment 2 Nate Rini 2019-09-20 12:39:00 MDT

(In reply to Nate Rini from comment #1)
> Can we please get the output of the following:
> > scontrol show job 194947
> > scontrol show job 194949

If the jobs are no longer known to slurmctld, please provide this output:
> sacct -plj 194947
> sacct -plj 194949

Can you also please verify if the job was using pmi2 or pmix? A current slurm.conf would also be nice to have.

Comment 3 Victor Gamayunov 2019-09-23 10:54:10 MDT

di52jap@login08:/hppfs/work/pr27ca/di52jap/tests/mpi_affinity/for_schedmd> sacct -plj 194947
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
194947|194947|HP_scan|tmp2||||||||||||||||||600000|00:02:48|FAILED|2:0||Unknown|Unknown|Unknown|80Gn|110.59M|||||||||||billing=300000,cpu=300000,node=6250|billing=600000,cpu=600000,energy=110589956,mem=500000G,node=6250||||||||||||||
194947.batch|194947.batch|batch||1915596K|i01r01c01s01|0|1915596K|457768K|i01r01c01s01|0|457768K|0|i01r01c01s01|0|0|00:00:04|i01r01c01s01|0|00:00:04|1|96|00:02:48|FAILED|2:0|316.18M|0|0|0|80Gn|17.48K|49.31M|i01r01c01s01|0|49.31M|2.84M|i01r01c01s01|0|2.84M||||cpu=96,mem=80G,node=1|cpu=00:00:04,energy=17482,fs/disk=51704977,mem=457768K,pages=0,vmem=1915596K|cpu=00:00:04,energy=15759,fs/disk=51704977,mem=457768K,pages=0,vmem=1915596K|cpu=i01r01c01s01,energy=i01r01c01s01,fs/disk=i01r01c01s01,mem=i01r01c01s01,pages=i01r01c01s01,vmem=i01r01c01s01|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:04,energy=15759,fs/disk=51704977,mem=457768K,pages=0,vmem=1915596K|cpu=i01r01c01s01,energy=i01r01c01s01,fs/disk=i01r01c01s01,mem=i01r01c01s01,pages=i01r01c01s01,vmem=i01r01c01s01|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:04,energy=17482,fs/disk=51704977,mem=457768K,pages=0,vmem=1915596K|energy=143,fs/disk=2977641|energy=i01r01c01s01,fs/disk=i01r01c01s01|fs/disk=0|energy=123,fs/disk=2977641|energy=123,fs/disk=2977641|
194947.extern|194947.extern|extern||175400K||6023|177169415|6416K|i04r03c02s12|2532|4512457|0||6023|0|00:00:00||6023|00:00:00|6250|600000|00:02:49|COMPLETED|0:0|1.49T|0|0|0|80Gn|110.59M|0.00M||6023|0.00M|0||6023|0||||billing=600000,cpu=600000,mem=500000G,node=6250|cpu=00:00:00,energy=17694,fs/disk=1948,mem=4512457,pages=0,vmem=177169415|cpu=00:00:00,energy=21485,fs/disk=1948,mem=6416K,pages=0,vmem=175400K|cpu=(null),energy=i02r04c03s01,fs/disk=(null),mem=i04r03c02s12,pages=(null),vmem=(null)|cpu=6023,fs/disk=6023,mem=2532,pages=6023,vmem=6023|cpu=00:00:00,energy=14826,fs/disk=1948,mem=660K,pages=0,vmem=4208K|cpu=(null),energy=i02r10c02s09,fs/disk=(null),mem=(null),pages=(null),vmem=(null)|cpu=6023,fs/disk=6023,mem=2750,pages=6023,vmem=6080|cpu=00:00:00,energy=110589956,fs/disk=12175000,mem=27541856K,pages=0,vmem=1081356296K|energy=266,fs/disk=0|energy=(null),fs/disk=(null)|fs/disk=6023|energy=113,fs/disk=0|energy=708731,fs/disk=0|
194947.0|194947.0|hemepure||4631472K|i03r06c01s02|92928|4732900785|68824K||198611|63908483|0||296544|0|00:00:00|i03r04c05s09|88674|00:00:00|300000|300000|00:03:35|CANCELLED by 3954087|0:9|9.35M|Unknown|Unknown|Unknown|80Gn|104.11M|2.20M||147982|2.20M|0.09M||296544|0.09M||||cpu=300000,mem=500000G,node=6250|cpu=00:00:00,energy=347,fs/disk=2311068,mem=63908483,pages=0,vmem=4732900785|cpu=00:00:00,energy=22632,fs/disk=2311198,mem=68824K,pages=0,vmem=4631472K|cpu=(null),energy=i02r04c03s01,fs/disk=(null),mem=(null),pages=(null),vmem=i03r06c01s02|cpu=241056,fs/disk=147982,mem=198611,pages=296544,vmem=92928|cpu=00:00:00,energy=15421,fs/disk=2294849,mem=58840K,pages=0,vmem=4620632K|cpu=i03r04c05s09,energy=(null),fs/disk=i01r01c01s01,mem=i01r01c01s01,pages=(null),vmem=(null)|cpu=88674,fs/disk=0,mem=27,pages=296544,vmem=296608|cpu=1-09:30:23,energy=104106869,fs/disk=693320622447,mem=18723188584K,pages=0,vmem=1386592026980K|energy=288,fs/disk=90205|energy=i02r04c03s01,fs/disk=(null)|fs/disk=296544|energy=2,fs/disk=90051|energy=641246,fs/disk=27015422592|
di52jap@login08:/hppfs/work/pr27ca/di52jap/tests/mpi_affinity/for_schedmd> sacct -plj 194949
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
194949|194949|HP_scan|tmp2||||||||||||||||||393216|00:01:04|FAILED|2:0||Unknown|Unknown|Unknown|80Gn|27.07M|||||||||||billing=196608,cpu=196608,node=4096|billing=393216,cpu=393216,energy=27069414,mem=320T,node=4096||||||||||||||
194949.batch|194949.batch|batch||1886548K|i01r01c01s01|0|1886548K|295100K|i01r01c01s01|0|295100K|0|i01r01c01s01|0|0|00:00:05|i01r01c01s01|0|00:00:05|1|96|00:01:04|FAILED|2:0|612.39M|0|0|0|80Gn|6.38K|47.47M|i01r01c01s01|0|47.47M|3.29M|i01r01c01s01|0|3.29M||||cpu=96,mem=80G,node=1|cpu=00:00:05,energy=6384,fs/disk=49777068,mem=295100K,pages=0,vmem=1886548K|cpu=00:00:05,energy=6384,fs/disk=49777068,mem=295100K,pages=0,vmem=1886548K|cpu=i01r01c01s01,energy=i01r01c01s01,fs/disk=i01r01c01s01,mem=i01r01c01s01,pages=i01r01c01s01,vmem=i01r01c01s01|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:05,energy=6384,fs/disk=49777068,mem=295100K,pages=0,vmem=1886548K|cpu=i01r01c01s01,energy=i01r01c01s01,fs/disk=i01r01c01s01,mem=i01r01c01s01,pages=i01r01c01s01,vmem=i01r01c01s01|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:05,energy=6384,fs/disk=49777068,mem=295100K,pages=0,vmem=1886548K|energy=117,fs/disk=3448396|energy=i01r01c01s01,fs/disk=i01r01c01s01|fs/disk=0|energy=117,fs/disk=3448396|energy=117,fs/disk=3448396|
194949.extern|194949.extern|extern||175308K||4089|175308K|6216K||2721|4364992|0||4089|0|00:00:00||4089|00:00:00|4096|393216|00:01:04|COMPLETED|0:0|830.87G|0|0|0|80Gn|27.07M|0.00M||4089|0.00M|0||4089|0||||billing=393216,cpu=393216,mem=320T,node=4096|cpu=00:00:00,energy=6608,fs/disk=1948,mem=4364992,pages=0,vmem=175308K|cpu=00:00:00,energy=6893,fs/disk=1948,mem=6216K,pages=0,vmem=175308K|cpu=(null),energy=i04r01c01s01,fs/disk=(null),mem=(null),pages=(null),vmem=(null)|cpu=4089,fs/disk=4089,mem=2721,pages=4089,vmem=4089|cpu=00:00:00,energy=5698,fs/disk=1948,mem=4M,pages=0,vmem=175308K|cpu=(null),energy=i03r01c04s08,fs/disk=(null),mem=i04r10c01s11,pages=(null),vmem=(null)|cpu=4089,fs/disk=4089,mem=2232,pages=4089,vmem=4089|cpu=00:00:00,energy=27069414,fs/disk=7792K,mem=17459968K,pages=0,vmem=701232M|energy=217,fs/disk=0|energy=i01r01c01s10,fs/disk=(null)|fs/disk=4089|energy=113,fs/disk=0|energy=464217,fs/disk=0|
194949.0|194949.0|hemepure||4621660K|i04r06c02s11|94032|4638251747|62544K|i04r06c02s11|94079|29515197|0||167137|0|00:00:00||149963|00:00:00|196608|196608|00:01:01|CANCELLED by 3954087|0:9|6.24M|Unknown|Unknown|Unknown|80Gn|23.13M|2.18M||165022|2.18M|0.07M||167136|0.07M||||cpu=196608,mem=320T,node=4096|cpu=00:00:00,energy=117,fs/disk=2284014,mem=29515197,pages=0,vmem=4638251747|cpu=00:00:00,energy=6142,fs/disk=2285063,mem=62544K,pages=0,vmem=4621660K|cpu=(null),energy=(null),fs/disk=(null),mem=i04r06c02s11,pages=(null),vmem=i04r06c02s11|cpu=196347,fs/disk=165022,mem=94079,pages=167137,vmem=94032|cpu=00:00:00,energy=5016,fs/disk=2259930,mem=26824K,pages=0,vmem=4528600K|cpu=(null),energy=i01r06c01s12,fs/disk=i01r01c01s02,mem=i03r04c04s12,pages=(null),vmem=(null)|cpu=149963,fs/disk=52,mem=50521,pages=167137,vmem=167142|cpu=19:19:52,energy=23132174,fs/disk=449055497029,mem=5666917964K,pages=0,vmem=890544335532K|energy=186,fs/disk=70189|energy=(null),fs/disk=(null)|fs/disk=167136|energy=2,fs/disk=69964|energy=421421,fs/disk=13755576836|

Comment 4 Victor Gamayunov 2019-09-23 11:01:26 MDT

> Can you also please verify if the job was using pmi2 or pmix? A current
> slurm.conf would also be nice to have.

slurm.conf:

MpiDefault=pmi2


di52jap@login08:/hppfs/work/pr27ca/di52jap/tests/mpi_affinity/for_schedmd> ls -l /usr/lib64/libpmi*
lrwxrwxrwx 1 root root     16 Jul 12 18:24 /usr/lib64/libpmi2.so -> libpmi2.so.0.0.0
lrwxrwxrwx 1 root root     16 Jul 12 18:24 /usr/lib64/libpmi2.so.0 -> libpmi2.so.0.0.0
-rwxr-xr-x 1 root root 226688 Dec 20  2018 /usr/lib64/libpmi2.so.0.0.0
lrwxrwxrwx 1 root root     15 Jul 12 18:24 /usr/lib64/libpmi.so -> libpmi.so.0.0.0
lrwxrwxrwx 1 root root     15 Jul 12 18:24 /usr/lib64/libpmi.so.0 -> libpmi.so.0.0.0
-rwxr-xr-x 1 root root 226328 Dec 20  2018 /usr/lib64/libpmi.so.0.0.0

Comment 6 Nate Rini 2019-09-23 11:49:00 MDT

Please provide a copy your slurm.conf. Is this a consistent error or is it only transient? Have only 2 jobs have been affected?

Comment 9 Máni Maríus Xander 2019-09-24 02:28:47 MDT

Created attachment 11671 [details]
slurm.conf

Comment 10 Nate Rini 2019-09-24 10:35:20 MDT

(In reply to Máni Maríus Xander from comment #9)
> Created attachment 11671 [details]
> slurm.conf

Is this a consistent error or is it only transient? Have only 2 jobs have been affected?

Comment 11 Máni Maríus Xander 2019-09-25 08:54:06 MDT

These are the only two jobs that have been reported with this error, which probably is related to this being unusually large jobs for the cluster.

Comment 12 Victor Gamayunov 2019-09-25 09:39:18 MDT

I believe only two such full machine jobs were submitted after last maintenance, and both failed.

Comment 15 Nate Rini 2019-09-25 18:58:09 MDT

Does the application call PMI2_KVS_Put() directly or is this done via a MPI layer?

Comment 17 Victor Gamayunov 2019-09-26 05:44:17 MDT

(In reply to Nate Rini from comment #15)
> Does the application call PMI2_KVS_Put() directly or is this done via a MPI
> layer?

The user application was scalasca-1.5 (binary name was 'scan').
Probably using only MPI calls.

Quick check of the source

[vgamayun@vgamayun-mac05 scalasca-2.5]$ grep pmi2 . -ri

... returns nothing

Comment 18 Nate Rini 2019-09-26 09:47:07 MDT

I believe I have recreated the issue based on your description. Calls to PMI2_KVS_Put() have a total limit for the dictionary that is being hit at these job scales. Looking now at how best to fix the issue.

--Nate

Comment 23 Máni Maríus Xander 2019-09-30 19:17:00 MDT

Dear sender,
thank you very much for your E-Mail. I am out of office and have limited capability to reply to your email. I will reply after my return

Comment 26 Nate Rini 2019-09-30 20:59:30 MDT

Victor et al.

This patch should correct the issue:
https://github.com/SchedMD/slurm/commit/f7bed728b5b63633079829da21274032187a5d0e

Please reply to this ticket if you have any issues or questions.

Thanks,
--Nate

*** This ticket has been marked as a duplicate of ticket 7495 ***