Ticket 3488 - Handling of "buffer size limit exceeded" errors
Summary: Handling of "buffer size limit exceeded" errors
Status: RESOLVED DUPLICATE of ticket 3624
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 15.08.13
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-02-21 13:15 MST by Goran Pocina
Modified: 2017-04-04 15:45 MDT (History)
1 user (show)

See Also:
Site: D E Shaw Research
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmdbd.conf (2.37 KB, text/plain)
2017-02-21 13:19 MST, Goran Pocina
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Goran Pocina 2017-02-21 13:15:51 MST
We suspect that our daily utilization reports kicked off a query more than 4GB in size.   I'll try to determine what the query was doing, but this ticket is to ask whether this error condition can be handled more gracefully.   

Our config file is attached.  

Notes:

slurmdbd version is 15.08.13.10

Currently, slurdb logs about 79K lines, or 10 MB of errors, per second, continuously, until all space on /var/log/slurmdbd.log's filesystem is consumed.

Here are some statistics on the messages logged for a 1 second period:

drdslurm0001:log$ sed -e 's/(.*>/( LARGENUMBER >/' /tmp/slurmdbd.log | sort | uniq -c
   1267 Feb 20 09:52:20 drdslurm0001.en.desres.deshaw.com slurmdbd[13469]: error: pack16: Buffer size limit exceeded ( LARGENUMBER > 4294901760)
  46211 Feb 20 09:52:20 drdslurm0001.en.desres.deshaw.com slurmdbd[13469]: error: pack32: Buffer size limit exceeded ( LARGENUMBER > 4294901760)
   3798 Feb 20 09:52:20 drdslurm0001.en.desres.deshaw.com slurmdbd[13469]: error: pack64: Buffer size limit exceeded ( LARGENUMBER > 4294901760)
  12660 Feb 20 09:52:20 drdslurm0001.en.desres.deshaw.com slurmdbd[13469]: error: packdouble: Buffer size limit exceeded ( LARGENUMBER > 4294901760)
  11395 Feb 20 09:52:20 drdslurm0001.en.desres.deshaw.com slurmdbd[13469]: error: packmem: Buffer size limit exceeded ( LARGENUMBER > 4294901760)
   3798 Feb 20 09:52:20 drdslurm0001.en.desres.deshaw.com slurmdbd[13469]: error: pack_time: Buffer size limit exceeded ( LARGENUMBER > 4294901760)
Comment 1 Goran Pocina 2017-02-21 13:19:06 MST
Created attachment 4080 [details]
slurmdbd.conf
Comment 2 Tim Wickberg 2017-02-21 13:35:29 MST
I'm assuming you meant "15.08.13"? That maintenance release wasn't in the list previously, that's been corrected now.

Is this query being constantly re-run, or does this recur after restarting slurmdbd?
Comment 3 Goran Pocina 2017-02-21 14:38:06 MST
The problem has not recurred since restarting slurmdbd, however I've also not yet confirm that our daily utilization query kicked it off.   

Looking at old slurmdbd log files it seems to have happened 1 week and 1 hour prior to yesterday's event, so I have a pretty good chance of finding the script responsible.   I'll update here once I do.
Comment 4 Tim Wickberg 2017-02-21 14:40:40 MST
Okay, I just wanted to make sure this wasn't blocking slurmdbd from normal service.

There's an enhancement bug 2346 open that covers adding some configuration options to help prevent these from triggering, although we haven't made any commitment to addressing this just yet.
Comment 5 Tim Wickberg 2017-04-04 15:45:24 MDT
Goran -

I'm marking this closed as a duplicate of 3624. We'll try to get some mitigation in place to keep the log level spam to a minimum. As mentioned, bug 2346 discusses longer-term plans to mitigate this issue with some configuration options to limit the queries directly.

- Tim

*** This ticket has been marked as a duplicate of ticket 3624 ***