Ticket 13614

Summary: error: unpackmem_xmalloc: Buffer to be unpacked is too large (1735288180 > 100000000)
Product: Slurm Reporter: Thomas Hoeffel <hoeffet1>
Component: slurmctldAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: tripiana
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: Genentech (Roche) Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurmctld log
slurm.conf
slurmdbd log

Description Thomas Hoeffel 2022-03-14 17:51:40 MDT
Created attachment 23863 [details]
slurmctld log

Over the weekend, slurmctld crashed with this in the logs.

"error: unpackmem_xmalloc: Buffer to be unpacked is too large (1735288180 > 100000000)"

The cluster load was fairly pedestrian ( see logs ).

Our conditions don't, at first glance, appear related to other bugs reporting the same error. 

Please see if you see something we are overlooking.
Comment 1 Thomas Hoeffel 2022-03-14 17:52:11 MDT
Created attachment 23864 [details]
slurm.conf
Comment 2 Thomas Hoeffel 2022-03-14 17:52:32 MDT
Created attachment 23865 [details]
slurmdbd log
Comment 3 Carlos Tripiana Montes 2022-03-15 03:20:03 MDT
Hi Thomas,

I don't see correlation between slurmdbd.log and slurmctld.log. The last line of dbd log is older than:

[2022-03-13T11:45:02.545] error: unpackmem_xmalloc: Buffer to be unpacked is too large (1735288180 > 100000000)

Additionally, I only see the above error of this type in the slurmctld.log, and it doesn't seem to me like it's connected to any other log lines.

It's very very difficult then to give you an explanation for this. I know some scenarios where this could be triggered. All them are in situations where the ctld is unpacking an incoming RPC, and realizes it is just to big that it doesn't make any sense.

We need then to move our focus out of ctld, which is doing expected behaviour, and seek for the one who's sending that big message. A user trying to send a sbatch with a very big script is just one example.

It would be a good idea to have a look at slurmd.log as well. I know there's plenty of them, and we don't really know the nodes to look at, but a pdsh command looking for the number 1735288180 in logs would be a good starting point.

Finally, let's lower to sev-3 the issue. I don't think this is making a continuous impact in the cluster operations. If I'm wrong please, consider extending the explanation of the issue by providing information on how it's affecting the cluster.

Tell us if you find something else.

Regards,
Carlos.
Comment 4 Thomas Hoeffel 2022-03-17 16:28:42 MDT
I'll check teh slurmd.logs and report back
Comment 5 Thomas Hoeffel 2022-03-21 12:32:31 MDT
Hi Carlos, Sadly, while we did find a few OOM errors like this, they were in much older logs. Nothing to correspond to the death of slurmctld last week. Any other ideas?
Comment 6 Carlos Tripiana Montes 2022-03-22 02:38:01 MDT
> we did find a few OOM errors like this

I don't expect an OOM to cause this particular error. As I said, this error is not a controller's real error (it doesn't break anything). Controller is only telling that it has tried to unpack something that is just too big to be a correct RPC, so it just gives away thit RPC.

> Nothing to correspond to the death of slurmctld last week.
> Any other ideas?

I can imagine a user or script with a bug. Something like:

batch [big file]

I know it seems quite illogic, but think of some HPC application like AMBER, GROMACS, WRF, etc. They use sometimes quite big input files. If they're called like:

sbatch --wrap '[APP] [params]'

I can imagine by chance someone having a mistake and sbatch ending up treating a big input file as [APP].

Other option in sbatch while reading script from STDIN, scripted, and some unbound loop writing there.

If it's just 1 error, once in the past, don't pay much attention on this. Specially if you don't see any corresponding problem in any of the slurmd logs across the cluster.

Cheers,
Carlos.
Comment 7 Carlos Tripiana Montes 2022-04-04 02:25:24 MDT
Closing now as info give. Please reopen if needed.

Cheers,
Carlos.