For some reason our slurmdbd went nuts with messages like: [2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.973] error: pack_time: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.973] error: packmem: Buffer size limit exceeded (4294913080 > 4294901760) [2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.973] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) which ended up generating some 32GBs of logs in less then an hour. Any hints on what is going on here?
Hi Josko, when doing network IO all Slurm messages are packed/unpack in the network byte order to deal with possible different endian type. The buffer in which data are packed has a maximum value of 4294901760 which for some reason has been exceeded. Any hint at what was the slurmdbd doing? Can we have a look at some of the log around these error messages? If it happens again can you use gstack to print the stack or better attach the gdb and print the stack. David
There isn't much there, this is what we have right before flood started: [2015-10-19T10:00:00.201] error: We have more time than is possible (37382400+250048+0)(37632448) > 37382400 for cluster tiger(10384) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 1 [2015-10-19T10:00:00.201] error: We have more allocated time than is possible (575039403600 > 172339200000) for cluster tiger(47872000) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 2 [2015-10-19T10:00:00.201] error: We have more time than is possible (172339200000+1000192000+0)(173339392000) > 172339200000 for cluster tiger(47872000) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 2 [2015-10-19T10:00:04.624] error: We have more allocated time than is possible (294831035 > 12441600) for cluster della(3456) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 1 [2015-10-19T10:00:04.624] error: We have more allocated time than is possible (1039157963292 > 77414400000) for cluster della(21504000) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 2 [2015-10-19T10:28:55.915] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.915] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.915] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.915] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) Yes, we have questions about those other messages as well but that will be for separate tickets... Sadly I restarted it in the meantime so not much else I can offer at this time.
BTW we did see anomalies in accounting - ticket 2067 - not sure if that could be related to this issue but probably best mentioned.
Josko, It looks like it's cleared itself up and from the logs it's difficult to pinpoint what happened. If you see it again, will you will you attach gdb to the slurmdbd process and get a backtrace from all of the threads? gdb attach <pid> thread apply all bt quit Are you ok closing the ticket and reopening if you see it again? Thanks, Brian
Yup, we'll try to keep an eye on it and see if we can catch it. I'll reopen if it happens again.
Cool. Let us know.