Ticket 5817

Summary:	Explore slurmdbd designs to avoid memory usage peaks due to overly-aggressive queries
Product:	Slurm	Reporter:	Alejandro Sanchez <alex>
Component:	slurmdbd	Assignee:	Unassigned Developer <dev-unassigned>
Status:	OPEN ---	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	cheung.nevin, csc-slurm-tickets, marshall
Version:	19.05.x
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7317 https://bugs.schedmd.com/show_bug.cgi?id=9453 https://bugs.schedmd.com/show_bug.cgi?id=19051
Site:	SchedMD	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Alejandro Sanchez 2018-10-05 06:06:46 MDT

As tracked in bug 5632, overly-aggressive queries (i.e. using sacct on large job/step tables and retrieving huge amounts of records) can lead to slurmdbd memory usage grow up to ~80GiB, potentially triggering oom-killer depending on node available memory and system limits.

While MaxQueryTimeRange and/or REASONABLE_BUF_SIZE (currently 3GiB) can alleviate the response size back to the requesting client, the previous step where slurmdbd queries the underlying storage database for huge amounts of data can trigger the memory usage peak.

This bug is intended to explore ideas / architectural changes to avoid increasing slurmdbd memory usage for this scenario.

Suggestions so far:

1. mysql_query() API function doesn't accept a maximum return size as a parameter. Discarded.

2. The LIMIT clause can be used to constrain the number of rows returned by the SELECT statement. We could make some math (I've seen this in another bug) to estimate the size of a job/step row, then divide 3GiB by such size to set LIMIT and append it to the query used for sacct, although that would impact whatever calling jobacct_storage_p_get_jobs_cond.

3. Instead of receiving all data and once finished, check MaxQueryTimeRange/3GiB as soon as we have rows available in slurmdbd and start packing them in a consumer/producer fashion. Not sure if feasible at all.

4. Make a procedure and then use CALL to retrieve jobs in chunks (different result sets).

5. Use a loop to seek over the result set by using mysql_data_seek(), and perhaps combine this with option 3, since mysql_fetch_row() might be too slow.

6. Perhaps there's some benefit from partitioning[1] the job/step tables? LIST or RANGE partitioning seem feasible, and then use PARTITION statement when SELECT'ing job/steps.

7. Other ideas?

[1] https://dev.mysql.com/doc/refman/8.0/en/partitioning.html