| Summary: | Buffer size limit exceeded | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Josko Plazonic <plazonic> |
| Component: | Database | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex, brian, da, tim |
| Version: | 15.08.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Princeton (PICSciE) | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Josko Plazonic
2015-10-27 00:26:31 MDT
Hi Josko,
when doing network IO all Slurm messages are packed/unpack in
the network byte order to deal with possible different endian type.
The buffer in which data are packed has a maximum value of 4294901760
which for some reason has been exceeded. Any hint at what was the
slurmdbd doing? Can we have a look at some of the log around these
error messages? If it happens again can you use gstack to print
the stack or better attach the gdb and print the stack.
David
There isn't much there, this is what we have right before flood started: [2015-10-19T10:00:00.201] error: We have more time than is possible (37382400+250048+0)(37632448) > 37382400 for cluster tiger(10384) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 1 [2015-10-19T10:00:00.201] error: We have more allocated time than is possible (575039403600 > 172339200000) for cluster tiger(47872000) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 2 [2015-10-19T10:00:00.201] error: We have more time than is possible (172339200000+1000192000+0)(173339392000) > 172339200000 for cluster tiger(47872000) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 2 [2015-10-19T10:00:04.624] error: We have more allocated time than is possible (294831035 > 12441600) for cluster della(3456) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 1 [2015-10-19T10:00:04.624] error: We have more allocated time than is possible (1039157963292 > 77414400000) for cluster della(21504000) from 2015-10-19T09:00:00 - 2015-10-19T10:00:00 tres 2 [2015-10-19T10:28:55.915] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.915] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.915] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) [2015-10-19T10:28:55.915] error: pack32: Buffer size limit exceeded (4294913067 > 4294901760) Yes, we have questions about those other messages as well but that will be for separate tickets... Sadly I restarted it in the meantime so not much else I can offer at this time. BTW we did see anomalies in accounting - ticket 2067 - not sure if that could be related to this issue but probably best mentioned. Josko, It looks like it's cleared itself up and from the logs it's difficult to pinpoint what happened. If you see it again, will you will you attach gdb to the slurmdbd process and get a backtrace from all of the threads? gdb attach <pid> thread apply all bt quit Are you ok closing the ticket and reopening if you see it again? Thanks, Brian Yup, we'll try to keep an eye on it and see if we can catch it. I'll reopen if it happens again. Cool. Let us know. |