Hi - Twice in the last few days NERSC has run into the same issue as described in bugs 2346 and 3488. I believe that someone is submitting a large query that causes one of the functions in pack.c to fail and causing the slurmdbd.log to fill with entries [2017-03-25T10:47:48.109] error: pack32: Buffer size limit exceeded (4294912770 > 4294901760) [2017-03-25T10:47:48.109] error: pack32: Buffer size limit exceeded (4294912770 > 4294901760) [2017-03-25T10:47:48.109] error: pack32: Buffer size limit exceeded (4294912770 > 4294901760) [2017-03-25T10:47:48.109] error: pack32: Buffer size limit exceeded (4294912770 > 4294901760) [2017-03-25T10:47:48.109] error: pack32: Buffer size limit exceeded (4294912770 > 4294901760) which then fill the underlying partition and causing slurmdbd to try to core. This has happened on both our edison and cori systems and both are running 17.02.1. We move the log file out of the way, restart slurmdbd and everything returns to normal. If you have any advice for mitigating this sort of behavior, it would be greatly appreciated. Thanks
I don't have anything at the moment unfortunately; we've been debating internally one how best to solve this long-term and a solution hasn't emerged yet. I'll revisit this again internally - at the very least we should make sure that function bails out early instead of spamming the log with garbage.
*** Ticket 3488 has been marked as a duplicate of this ticket. ***
*** Ticket 2346 has been marked as a duplicate of this ticket. ***
Commit 390da8cf963291 addresses this on master, and will be in 17.11-pre2 and up. That should apply cleanly to 17.02 if you'd like to back port it. This does not prevent excessively large queries from being constructed; other bugs track that issue. Rather, it checks that the response messages being generated is not >3GB; if so it discards the results, and returns a new error code to the client, rather than filling the log file up. One caveat - this can potentially block queries that used to work from completing that are returning between 3-4GB. - Tim
The original patch appears to be flawed. I believe that I have a fix for that, but we're reviewing it now.
Commit 6d15591f2cde5 handles the failed return in a different manner and ensures an error message gets printed and an appropriate non-zero return code comes from sacct, and avoids problems introduced in the original commit. (Commit 8cf1835c6ad was a partial reversion of the original.) You're welcome to back-port the results of all that, or I can make a consolidated patch available if you'd rather not run through all the work. - Tim