Ticket 3624 - spammed with "pack32: Buffer size limit exceeded"
Summary: spammed with "pack32: Buffer size limit exceeded"
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 17.02.1
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
: 3488 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2017-03-27 12:50 MDT by James Botts
Modified: 2017-08-22 23:49 MDT (History)
4 users (show)

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.0-pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description James Botts 2017-03-27 12:50:40 MDT
Hi - 

Twice in the last few days NERSC has run into the same issue as described in bugs 2346 and 3488.

I believe that someone is submitting a large query that causes one of the functions in pack.c to fail and causing the slurmdbd.log to fill with entries

[2017-03-25T10:47:48.109] error: pack32: Buffer size limit exceeded (4294912770 > 4294901760)
[2017-03-25T10:47:48.109] error: pack32: Buffer size limit exceeded (4294912770 > 4294901760)
[2017-03-25T10:47:48.109] error: pack32: Buffer size limit exceeded (4294912770 > 4294901760)
[2017-03-25T10:47:48.109] error: pack32: Buffer size limit exceeded (4294912770 > 4294901760)
[2017-03-25T10:47:48.109] error: pack32: Buffer size limit exceeded (4294912770 > 4294901760)

which then fill the underlying partition and causing slurmdbd to try to core.  This has happened on both our edison and cori systems and both are running 17.02.1.  We move the log file out of the way, restart slurmdbd and everything returns to normal.

If you have any advice for mitigating this sort of behavior, it would be greatly appreciated.  Thanks
Comment 1 Tim Wickberg 2017-03-27 16:41:17 MDT
I don't have anything at the moment unfortunately; we've been debating internally one how best to solve this long-term and a solution hasn't emerged yet.

I'll revisit this again internally - at the very least we should make sure that function bails out early instead of spamming the log with garbage.
Comment 3 Tim Wickberg 2017-04-04 15:45:24 MDT
*** Ticket 3488 has been marked as a duplicate of this ticket. ***
Comment 9 Tim Wickberg 2017-05-03 11:34:49 MDT
*** Ticket 2346 has been marked as a duplicate of this ticket. ***
Comment 11 Tim Wickberg 2017-08-07 18:52:14 MDT
Commit 390da8cf963291 addresses this on master, and will be in 17.11-pre2 and up.

That should apply cleanly to 17.02 if you'd like to back port it.

This does not prevent excessively large queries from being constructed; other bugs track that issue. Rather, it checks that the response messages being generated is not >3GB; if so it discards the results, and returns a new error code to the client, rather than filling the log file up.

One caveat - this can potentially block queries that used to work from completing that are returning between 3-4GB.

- Tim
Comment 13 Moe Jette 2017-08-09 16:14:44 MDT
The original patch appears to be flawed. I believe that I have a fix for that, but we're reviewing it now.
Comment 19 Tim Wickberg 2017-08-22 23:49:25 MDT
Commit 6d15591f2cde5 handles the failed return in a different manner and ensures an error message gets printed and an appropriate non-zero return code comes from sacct, and avoids problems introduced in the original commit. (Commit 8cf1835c6ad was a partial reversion of the original.)

You're welcome to back-port the results of all that, or I can make a consolidated patch available if you'd rather not run through all the work.

- Tim