Ticket 2066

Summary:	Buffer size limit exceeded
Product:	Slurm	Reporter:	Josko Plazonic <plazonic>
Component:	Database	Assignee:	Brian Christiansen <brian>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, brian, da, tim
Version:	15.08.1
Hardware:	Linux
OS:	Linux
Site:	Princeton (PICSciE)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Josko Plazonic 2015-10-27 00:26:31 MDT

For some reason our slurmdbd went nuts with messages like:

[2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760)
[2015-10-19T10:28:55.973] error: pack_time: Buffer size limit exceeded (4294913067 > 4294901760)
[2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760)
[2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760)
[2015-10-19T10:28:55.973] error: packmem: Buffer size limit exceeded (4294913080 > 4294901760)
[2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760)
[2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760)
[2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760)
[2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760)

which ended up generating some 32GBs of logs in less then an hour.  Any hints on what is going on here?

Comment 1 David Bigagli 2015-10-27 00:37:16 MDT

Hi Josko,
         when doing network IO all Slurm messages are packed/unpack in 
the network byte order to deal with possible different endian type.
The buffer in which data are packed has a maximum value of 4294901760
which for some reason has been exceeded. Any hint at what was the 
slurmdbd doing? Can we have a look at some of the log around these
error messages? If it happens again can you use gstack to print
the stack or better attach the gdb and print the stack.

David

Comment 2 Josko Plazonic 2015-10-27 00:44:22 MDT

There isn't much there, this is what we have right before flood started:

[2015-10-19T10:00:00.201] error: We have more time than is possible (37382400+250048+0)(37632448) > 37382400 for cluster tiger(10384) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 1
[2015-10-19T10:00:00.201] error: We have more allocated time than is possible (575039403600 > 172339200000) for cluster tiger(47872000) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 2
[2015-10-19T10:00:00.201] error: We have more time than is possible (172339200000+1000192000+0)(173339392000) > 172339200000 for cluster tiger(47872000) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 2
[2015-10-19T10:00:04.624] error: We have more allocated time than is possible (294831035 > 12441600) for cluster della(3456) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 1
[2015-10-19T10:00:04.624] error: We have more allocated time than is possible (1039157963292 > 77414400000) for cluster della(21504000) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 2
[2015-10-19T10:28:55.915] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760)
[2015-10-19T10:28:55.915] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760)
[2015-10-19T10:28:55.915] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760)
[2015-10-19T10:28:55.915] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760)

Yes, we have questions about those other messages as well but that will be for separate tickets... Sadly I restarted it in the meantime so not much else I can offer at this time.

Comment 3 Josko Plazonic 2015-10-27 00:52:40 MDT

BTW we did see anomalies in accounting - ticket 2067 - not sure if that could be related to this issue but probably best mentioned.

Comment 4 Brian Christiansen 2015-10-27 10:52:27 MDT

Josko,

It looks like it's cleared itself up and from the logs it's difficult to pinpoint what happened. If you see it again, will you will you attach gdb to the slurmdbd process and get a backtrace from all of the threads?

gdb attach <pid>
thread apply all bt
quit

Are you ok closing the ticket and reopening if you see it again?

Thanks,
Brian

Comment 5 Josko Plazonic 2015-10-27 13:04:46 MDT

Yup, we'll try to keep an eye on it and see if we can catch it.  I'll reopen if it happens again.

Comment 6 Brian Christiansen 2015-10-27 13:35:09 MDT

Cool. Let us know.