16603 – Job table size is causing DBD problems

Ticket 16603 - Job table size is causing DBD problems

Summary: Job table size is causing DBD problems

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Database (show other tickets)
Version:	21.08.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Scott Hilton
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-04-27 10:35 MDT by Paul Peltz
Modified:	2023-05-18 09:46 MDT (History)
CC List:	4 users (show)

See Also:	16746
Site:	ORNL-OLCF
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:	AFW Miller/Fawbush
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Peltz 2023-04-27 10:35:42 MDT

Since I enabled the job_script and job_env options to save this to the database, our database size has exploded in size to almost 1TB now in the last couple of years. We are trying to plan our update to 22.05, but now looking to go to 23.02, but the time it would take to do the dbd upgrade is 24 hours, even on only 6 months worth of job data. We did this test on a backup of the db to simulate the amount of time it would take and we can't go that long with a downtime due to this being a federated cluster system.

We enabled the purge/archive of job tables on our production database and set it to 180 days, but it seems as if the prod database can't handle the purge and also writing current data to the database. It keeps timing out and restarting every 15 minutes which is the innodb timeout setting. So now we are having production issues and need to disable the archiving and purging, but we can't do an update with the full database the way it is now if the 180 day took 24 hours.

Would it be possible to just manually purge the job_script and job_env information out of the table without any negative consequences? I'm trying to get out of this mess, but the size of the database is keeping us from being able to update. If there is anything else we could possibly do that would be helpful to know as well. BTW, the slurmdbd and mysql are running on the same server, but the db itself resides on an NFS share.

Comment 1 Scott Hilton 2023-04-27 11:36:33 MDT

Paul,

Good news, job_script and job_env are stored in a much more efficient way in 22.05 and 23.02.

Also, this issue has happened before. See bug 14514.

The only negative consequences to purging the job_script and job_env information would be the loss of the job_script and job_env information. 

See https://bugs.schedmd.com/show_bug.cgi?id=14514#c9 on how to do it.

Let me know if you have any more questions or run into any issues.

-Scott

Comment 2 Paul Peltz 2023-04-27 11:51:40 MDT

Thanks! I had tried some searching but couldn't find what that bug. We'll give that a try.

Comment 3 Scott Hilton 2023-05-01 16:16:02 MDT

Paul,

Did that work out for you? Any questions?

-Scott

Comment 4 Paul Peltz 2023-05-02 13:48:46 MDT

We just started the purge of the script and env data this afternoon and are awaiting it to complete to see if we can continue purging or not. I believe it reduced the db size by about 25%, but not as much as we expected it to in our test instance.

Comment 5 Scott Hilton 2023-05-11 09:26:55 MDT

Paul,

How did the upgrade go?

-Scott

Comment 6 Paul Peltz 2023-05-18 08:59:05 MDT

We just enabled the purge last night after dropping the job_env and job_script rows from the DB and it was able to finish the purge down to 180 days within 5 hours. It was taking hours for just a single day to complete and most of the time it failed due to timeouts. So it has helped to reduce it down drastically. We are going to reduce the purge to 90 days and test the upgrade again to time how long it will take. So I think we can resolve this issue as the db purge and archive actually works now and we can effectively plan for the update to 23.02.2 now. Thanks!

Comment 7 Scott Hilton 2023-05-18 09:46:50 MDT

Paul,

I'm glad we could help.

-Scott