Yesterday our users tried to launch a job on 6250 nodes with 48tasks per node. The job failed with the following error message: > srun: error: mpi/pmi2: failed to send temp kvs to compute nodes > slurmstepd: error: *** STEP 194947.0 ON i01r01c01s01 CANCELLED AT > 2019-09-19T21:19:42 *** The slurmdlog on the batchnode shows the following [2019-09-19T21:21:33.051] error: unpackmem_xmalloc: Buffer to be unpacked is too large (11138290 > 10000000) [2019-09-19T21:21:33.051] error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received [2019-09-19T21:21:33.051] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2019-09-19T21:21:33.061] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2019-09-19T21:21:34.617] error: unpackmem_xmalloc: Buffer to be unpacked is too large (11138290 > 10000000) [2019-09-19T21:21:34.617] error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received [2019-09-19T21:21:34.617] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2019-09-19T21:21:34.627] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2019-09-19T21:21:37.180] error: unpackmem_xmalloc: Buffer to be unpacked is too large (11138290 > 10000000) [2019-09-19T21:21:37.180] error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received [2019-09-19T21:21:37.180] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2019-09-19T21:21:37.190] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2019-09-19T21:21:41.701] error: unpackmem_xmalloc: Buffer to be unpacked is too large (11138290 > 10000000) [2019-09-19T21:21:41.701] error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received [2019-09-19T21:21:41.701] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2019-09-19T21:21:41.711] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2019-09-19T21:21:50.233] error: unpackmem_xmalloc: Buffer to be unpacked is too large (11138290 > 10000000) [2019-09-19T21:21:50.233] error: Malformed RPC of type REQUEST_FORWARD_DATA(5029) received [2019-09-19T21:21:50.233] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2019-09-19T21:21:50.243] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2019-09-19T21:21:50.778] [194949.0] error: *** STEP 194949.0 ON i01r01c01s01 CANCELLED AT 2019-09-19T21:21:50 ***
Can we please get the output of the following: > scontrol show job 194947 > scontrol show job 194949
(In reply to Nate Rini from comment #1) > Can we please get the output of the following: > > scontrol show job 194947 > > scontrol show job 194949 If the jobs are no longer known to slurmctld, please provide this output: > sacct -plj 194947 > sacct -plj 194949 Can you also please verify if the job was using pmi2 or pmix? A current slurm.conf would also be nice to have.
di52jap@login08:/hppfs/work/pr27ca/di52jap/tests/mpi_affinity/for_schedmd> sacct -plj 194947 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot| 194947|194947|HP_scan|tmp2||||||||||||||||||600000|00:02:48|FAILED|2:0||Unknown|Unknown|Unknown|80Gn|110.59M|||||||||||billing=300000,cpu=300000,node=6250|billing=600000,cpu=600000,energy=110589956,mem=500000G,node=6250|||||||||||||| 194947.batch|194947.batch|batch||1915596K|i01r01c01s01|0|1915596K|457768K|i01r01c01s01|0|457768K|0|i01r01c01s01|0|0|00:00:04|i01r01c01s01|0|00:00:04|1|96|00:02:48|FAILED|2:0|316.18M|0|0|0|80Gn|17.48K|49.31M|i01r01c01s01|0|49.31M|2.84M|i01r01c01s01|0|2.84M||||cpu=96,mem=80G,node=1|cpu=00:00:04,energy=17482,fs/disk=51704977,mem=457768K,pages=0,vmem=1915596K|cpu=00:00:04,energy=15759,fs/disk=51704977,mem=457768K,pages=0,vmem=1915596K|cpu=i01r01c01s01,energy=i01r01c01s01,fs/disk=i01r01c01s01,mem=i01r01c01s01,pages=i01r01c01s01,vmem=i01r01c01s01|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:04,energy=15759,fs/disk=51704977,mem=457768K,pages=0,vmem=1915596K|cpu=i01r01c01s01,energy=i01r01c01s01,fs/disk=i01r01c01s01,mem=i01r01c01s01,pages=i01r01c01s01,vmem=i01r01c01s01|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:04,energy=17482,fs/disk=51704977,mem=457768K,pages=0,vmem=1915596K|energy=143,fs/disk=2977641|energy=i01r01c01s01,fs/disk=i01r01c01s01|fs/disk=0|energy=123,fs/disk=2977641|energy=123,fs/disk=2977641| 194947.extern|194947.extern|extern||175400K||6023|177169415|6416K|i04r03c02s12|2532|4512457|0||6023|0|00:00:00||6023|00:00:00|6250|600000|00:02:49|COMPLETED|0:0|1.49T|0|0|0|80Gn|110.59M|0.00M||6023|0.00M|0||6023|0||||billing=600000,cpu=600000,mem=500000G,node=6250|cpu=00:00:00,energy=17694,fs/disk=1948,mem=4512457,pages=0,vmem=177169415|cpu=00:00:00,energy=21485,fs/disk=1948,mem=6416K,pages=0,vmem=175400K|cpu=(null),energy=i02r04c03s01,fs/disk=(null),mem=i04r03c02s12,pages=(null),vmem=(null)|cpu=6023,fs/disk=6023,mem=2532,pages=6023,vmem=6023|cpu=00:00:00,energy=14826,fs/disk=1948,mem=660K,pages=0,vmem=4208K|cpu=(null),energy=i02r10c02s09,fs/disk=(null),mem=(null),pages=(null),vmem=(null)|cpu=6023,fs/disk=6023,mem=2750,pages=6023,vmem=6080|cpu=00:00:00,energy=110589956,fs/disk=12175000,mem=27541856K,pages=0,vmem=1081356296K|energy=266,fs/disk=0|energy=(null),fs/disk=(null)|fs/disk=6023|energy=113,fs/disk=0|energy=708731,fs/disk=0| 194947.0|194947.0|hemepure||4631472K|i03r06c01s02|92928|4732900785|68824K||198611|63908483|0||296544|0|00:00:00|i03r04c05s09|88674|00:00:00|300000|300000|00:03:35|CANCELLED by 3954087|0:9|9.35M|Unknown|Unknown|Unknown|80Gn|104.11M|2.20M||147982|2.20M|0.09M||296544|0.09M||||cpu=300000,mem=500000G,node=6250|cpu=00:00:00,energy=347,fs/disk=2311068,mem=63908483,pages=0,vmem=4732900785|cpu=00:00:00,energy=22632,fs/disk=2311198,mem=68824K,pages=0,vmem=4631472K|cpu=(null),energy=i02r04c03s01,fs/disk=(null),mem=(null),pages=(null),vmem=i03r06c01s02|cpu=241056,fs/disk=147982,mem=198611,pages=296544,vmem=92928|cpu=00:00:00,energy=15421,fs/disk=2294849,mem=58840K,pages=0,vmem=4620632K|cpu=i03r04c05s09,energy=(null),fs/disk=i01r01c01s01,mem=i01r01c01s01,pages=(null),vmem=(null)|cpu=88674,fs/disk=0,mem=27,pages=296544,vmem=296608|cpu=1-09:30:23,energy=104106869,fs/disk=693320622447,mem=18723188584K,pages=0,vmem=1386592026980K|energy=288,fs/disk=90205|energy=i02r04c03s01,fs/disk=(null)|fs/disk=296544|energy=2,fs/disk=90051|energy=641246,fs/disk=27015422592| di52jap@login08:/hppfs/work/pr27ca/di52jap/tests/mpi_affinity/for_schedmd> sacct -plj 194949 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot| 194949|194949|HP_scan|tmp2||||||||||||||||||393216|00:01:04|FAILED|2:0||Unknown|Unknown|Unknown|80Gn|27.07M|||||||||||billing=196608,cpu=196608,node=4096|billing=393216,cpu=393216,energy=27069414,mem=320T,node=4096|||||||||||||| 194949.batch|194949.batch|batch||1886548K|i01r01c01s01|0|1886548K|295100K|i01r01c01s01|0|295100K|0|i01r01c01s01|0|0|00:00:05|i01r01c01s01|0|00:00:05|1|96|00:01:04|FAILED|2:0|612.39M|0|0|0|80Gn|6.38K|47.47M|i01r01c01s01|0|47.47M|3.29M|i01r01c01s01|0|3.29M||||cpu=96,mem=80G,node=1|cpu=00:00:05,energy=6384,fs/disk=49777068,mem=295100K,pages=0,vmem=1886548K|cpu=00:00:05,energy=6384,fs/disk=49777068,mem=295100K,pages=0,vmem=1886548K|cpu=i01r01c01s01,energy=i01r01c01s01,fs/disk=i01r01c01s01,mem=i01r01c01s01,pages=i01r01c01s01,vmem=i01r01c01s01|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:05,energy=6384,fs/disk=49777068,mem=295100K,pages=0,vmem=1886548K|cpu=i01r01c01s01,energy=i01r01c01s01,fs/disk=i01r01c01s01,mem=i01r01c01s01,pages=i01r01c01s01,vmem=i01r01c01s01|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:05,energy=6384,fs/disk=49777068,mem=295100K,pages=0,vmem=1886548K|energy=117,fs/disk=3448396|energy=i01r01c01s01,fs/disk=i01r01c01s01|fs/disk=0|energy=117,fs/disk=3448396|energy=117,fs/disk=3448396| 194949.extern|194949.extern|extern||175308K||4089|175308K|6216K||2721|4364992|0||4089|0|00:00:00||4089|00:00:00|4096|393216|00:01:04|COMPLETED|0:0|830.87G|0|0|0|80Gn|27.07M|0.00M||4089|0.00M|0||4089|0||||billing=393216,cpu=393216,mem=320T,node=4096|cpu=00:00:00,energy=6608,fs/disk=1948,mem=4364992,pages=0,vmem=175308K|cpu=00:00:00,energy=6893,fs/disk=1948,mem=6216K,pages=0,vmem=175308K|cpu=(null),energy=i04r01c01s01,fs/disk=(null),mem=(null),pages=(null),vmem=(null)|cpu=4089,fs/disk=4089,mem=2721,pages=4089,vmem=4089|cpu=00:00:00,energy=5698,fs/disk=1948,mem=4M,pages=0,vmem=175308K|cpu=(null),energy=i03r01c04s08,fs/disk=(null),mem=i04r10c01s11,pages=(null),vmem=(null)|cpu=4089,fs/disk=4089,mem=2232,pages=4089,vmem=4089|cpu=00:00:00,energy=27069414,fs/disk=7792K,mem=17459968K,pages=0,vmem=701232M|energy=217,fs/disk=0|energy=i01r01c01s10,fs/disk=(null)|fs/disk=4089|energy=113,fs/disk=0|energy=464217,fs/disk=0| 194949.0|194949.0|hemepure||4621660K|i04r06c02s11|94032|4638251747|62544K|i04r06c02s11|94079|29515197|0||167137|0|00:00:00||149963|00:00:00|196608|196608|00:01:01|CANCELLED by 3954087|0:9|6.24M|Unknown|Unknown|Unknown|80Gn|23.13M|2.18M||165022|2.18M|0.07M||167136|0.07M||||cpu=196608,mem=320T,node=4096|cpu=00:00:00,energy=117,fs/disk=2284014,mem=29515197,pages=0,vmem=4638251747|cpu=00:00:00,energy=6142,fs/disk=2285063,mem=62544K,pages=0,vmem=4621660K|cpu=(null),energy=(null),fs/disk=(null),mem=i04r06c02s11,pages=(null),vmem=i04r06c02s11|cpu=196347,fs/disk=165022,mem=94079,pages=167137,vmem=94032|cpu=00:00:00,energy=5016,fs/disk=2259930,mem=26824K,pages=0,vmem=4528600K|cpu=(null),energy=i01r06c01s12,fs/disk=i01r01c01s02,mem=i03r04c04s12,pages=(null),vmem=(null)|cpu=149963,fs/disk=52,mem=50521,pages=167137,vmem=167142|cpu=19:19:52,energy=23132174,fs/disk=449055497029,mem=5666917964K,pages=0,vmem=890544335532K|energy=186,fs/disk=70189|energy=(null),fs/disk=(null)|fs/disk=167136|energy=2,fs/disk=69964|energy=421421,fs/disk=13755576836|
> Can you also please verify if the job was using pmi2 or pmix? A current > slurm.conf would also be nice to have. slurm.conf: MpiDefault=pmi2 di52jap@login08:/hppfs/work/pr27ca/di52jap/tests/mpi_affinity/for_schedmd> ls -l /usr/lib64/libpmi* lrwxrwxrwx 1 root root 16 Jul 12 18:24 /usr/lib64/libpmi2.so -> libpmi2.so.0.0.0 lrwxrwxrwx 1 root root 16 Jul 12 18:24 /usr/lib64/libpmi2.so.0 -> libpmi2.so.0.0.0 -rwxr-xr-x 1 root root 226688 Dec 20 2018 /usr/lib64/libpmi2.so.0.0.0 lrwxrwxrwx 1 root root 15 Jul 12 18:24 /usr/lib64/libpmi.so -> libpmi.so.0.0.0 lrwxrwxrwx 1 root root 15 Jul 12 18:24 /usr/lib64/libpmi.so.0 -> libpmi.so.0.0.0 -rwxr-xr-x 1 root root 226328 Dec 20 2018 /usr/lib64/libpmi.so.0.0.0
Please provide a copy your slurm.conf. Is this a consistent error or is it only transient? Have only 2 jobs have been affected?
Created attachment 11671 [details] slurm.conf
(In reply to Máni Maríus Xander from comment #9) > Created attachment 11671 [details] > slurm.conf Is this a consistent error or is it only transient? Have only 2 jobs have been affected?
These are the only two jobs that have been reported with this error, which probably is related to this being unusually large jobs for the cluster.
I believe only two such full machine jobs were submitted after last maintenance, and both failed.
Does the application call PMI2_KVS_Put() directly or is this done via a MPI layer?
(In reply to Nate Rini from comment #15) > Does the application call PMI2_KVS_Put() directly or is this done via a MPI > layer? The user application was scalasca-1.5 (binary name was 'scan'). Probably using only MPI calls. Quick check of the source [vgamayun@vgamayun-mac05 scalasca-2.5]$ grep pmi2 . -ri ... returns nothing
I believe I have recreated the issue based on your description. Calls to PMI2_KVS_Put() have a total limit for the dictionary that is being hit at these job scales. Looking now at how best to fix the issue. --Nate
Dear sender, thank you very much for your E-Mail. I am out of office and have limited capability to reply to your email. I will reply after my return
Victor et al. This patch should correct the issue: https://github.com/SchedMD/slurm/commit/f7bed728b5b63633079829da21274032187a5d0e Please reply to this ticket if you have any issues or questions. Thanks, --Nate *** This ticket has been marked as a duplicate of ticket 7495 ***