Ticket 13614

Summary:	error: unpackmem_xmalloc: Buffer to be unpacked is too large (1735288180 > 100000000)
Product:	Slurm	Reporter:	Thomas Hoeffel <hoeffet1>
Component:	slurmctld	Assignee:	Carlos Tripiana Montes <tripiana>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	tripiana
Version:	20.11.7
Hardware:	Linux
OS:	Linux
Site:	Genentech (Roche)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld log slurm.conf slurmdbd log

Description Thomas Hoeffel 2022-03-14 17:51:40 MDT

Created attachment 23863 [details]
slurmctld log

Over the weekend, slurmctld crashed with this in the logs.

"error: unpackmem_xmalloc: Buffer to be unpacked is too large (1735288180 > 100000000)"

The cluster load was fairly pedestrian ( see logs ).

Our conditions don't, at first glance, appear related to other bugs reporting the same error. 

Please see if you see something we are overlooking.

Comment 1 Thomas Hoeffel 2022-03-14 17:52:11 MDT

Created attachment 23864 [details]
slurm.conf

Comment 2 Thomas Hoeffel 2022-03-14 17:52:32 MDT

Created attachment 23865 [details]
slurmdbd log

Comment 3 Carlos Tripiana Montes 2022-03-15 03:20:03 MDT

Hi Thomas,

I don't see correlation between slurmdbd.log and slurmctld.log. The last line of dbd log is older than:

[2022-03-13T11:45:02.545] error: unpackmem_xmalloc: Buffer to be unpacked is too large (1735288180 > 100000000)

Additionally, I only see the above error of this type in the slurmctld.log, and it doesn't seem to me like it's connected to any other log lines.

It's very very difficult then to give you an explanation for this. I know some scenarios where this could be triggered. All them are in situations where the ctld is unpacking an incoming RPC, and realizes it is just to big that it doesn't make any sense.

We need then to move our focus out of ctld, which is doing expected behaviour, and seek for the one who's sending that big message. A user trying to send a sbatch with a very big script is just one example.

It would be a good idea to have a look at slurmd.log as well. I know there's plenty of them, and we don't really know the nodes to look at, but a pdsh command looking for the number 1735288180 in logs would be a good starting point.

Finally, let's lower to sev-3 the issue. I don't think this is making a continuous impact in the cluster operations. If I'm wrong please, consider extending the explanation of the issue by providing information on how it's affecting the cluster.

Tell us if you find something else.

Regards,
Carlos.

Comment 4 Thomas Hoeffel 2022-03-17 16:28:42 MDT

I'll check teh slurmd.logs and report back

Comment 5 Thomas Hoeffel 2022-03-21 12:32:31 MDT

Hi Carlos, Sadly, while we did find a few OOM errors like this, they were in much older logs. Nothing to correspond to the death of slurmctld last week. Any other ideas?

Comment 6 Carlos Tripiana Montes 2022-03-22 02:38:01 MDT

> we did find a few OOM errors like this

I don't expect an OOM to cause this particular error. As I said, this error is not a controller's real error (it doesn't break anything). Controller is only telling that it has tried to unpack something that is just too big to be a correct RPC, so it just gives away thit RPC.

> Nothing to correspond to the death of slurmctld last week.
> Any other ideas?

I can imagine a user or script with a bug. Something like:

batch [big file]

I know it seems quite illogic, but think of some HPC application like AMBER, GROMACS, WRF, etc. They use sometimes quite big input files. If they're called like:

sbatch --wrap '[APP] [params]'

I can imagine by chance someone having a mistake and sbatch ending up treating a big input file as [APP].

Other option in sbatch while reading script from STDIN, scripted, and some unbound loop writing there.

If it's just 1 error, once in the past, don't pay much attention on this. Specially if you don't see any corresponding problem in any of the slurmd logs across the cluster.

Cheers,
Carlos.

Comment 7 Carlos Tripiana Montes 2022-04-04 02:25:24 MDT

Closing now as info give. Please reopen if needed.

Cheers,
Carlos.