Since I enabled the job_script and job_env options to save this to the database, our database size has exploded in size to almost 1TB now in the last couple of years. We are trying to plan our update to 22.05, but now looking to go to 23.02, but the time it would take to do the dbd upgrade is 24 hours, even on only 6 months worth of job data. We did this test on a backup of the db to simulate the amount of time it would take and we can't go that long with a downtime due to this being a federated cluster system. We enabled the purge/archive of job tables on our production database and set it to 180 days, but it seems as if the prod database can't handle the purge and also writing current data to the database. It keeps timing out and restarting every 15 minutes which is the innodb timeout setting. So now we are having production issues and need to disable the archiving and purging, but we can't do an update with the full database the way it is now if the 180 day took 24 hours. Would it be possible to just manually purge the job_script and job_env information out of the table without any negative consequences? I'm trying to get out of this mess, but the size of the database is keeping us from being able to update. If there is anything else we could possibly do that would be helpful to know as well. BTW, the slurmdbd and mysql are running on the same server, but the db itself resides on an NFS share.
Paul, Good news, job_script and job_env are stored in a much more efficient way in 22.05 and 23.02. Also, this issue has happened before. See bug 14514. The only negative consequences to purging the job_script and job_env information would be the loss of the job_script and job_env information. See https://bugs.schedmd.com/show_bug.cgi?id=14514#c9 on how to do it. Let me know if you have any more questions or run into any issues. -Scott
Thanks! I had tried some searching but couldn't find what that bug. We'll give that a try.
Paul, Did that work out for you? Any questions? -Scott
We just started the purge of the script and env data this afternoon and are awaiting it to complete to see if we can continue purging or not. I believe it reduced the db size by about 25%, but not as much as we expected it to in our test instance.
Paul, How did the upgrade go? -Scott
We just enabled the purge last night after dropping the job_env and job_script rows from the DB and it was able to finish the purge down to 180 days within 5 hours. It was taking hours for just a single day to complete and most of the time it failed due to timeouts. So it has helped to reduce it down drastically. We are going to reduce the purge to 90 days and test the upgrade again to time how long it will take. So I think we can resolve this issue as the db purge and archive actually works now and we can effectively plan for the update to 23.02.2 now. Thanks!
Paul, I'm glad we could help. -Scott