Ticket 14929

Summary: Archiving and purging old jobs.
Product: Slurm Reporter: lhuang
Component: DatabaseAssignee: Albert Gil <albert.gil>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cblack
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: NY Genome Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description lhuang 2022-09-09 11:40:08 MDT
Hi,

We just started archiving/purging records from a testing database that contains 27 million job records. Initially we tested purging around 2 months of records from jobs that completed in 2019. This took around 1-2 hours to complete. Looks like archiving / purging works.

I've now increased the archive/purge to any jobs older than December 2020. I'm unsure how many records there are but it's 1 year worth of jobs. It's been running for close to 24 hours but it's slowly chugging along. I can see that it's completed around 6 months of job records.

Is there anything we can do to speed it up? This is our innodb settings.

[root@dev-slurm01 ~]# cat /etc/my.cnf.d/innodb.cnf 
[mysqld]
innodb_buffer_pool_size=2048M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900

I'm also a little concerned as we will need to do this from the production cluster soon. From my testing, it does look like we can still continue to use the slurm cluster including any sacct commands. Do you foresee any issues while we archive/purge the records?

Regards,
Luis
Comment 1 Albert Gil 2022-09-12 04:49:56 MDT
Hi Luis,

> We just started archiving/purging records from a testing database that
> contains 27 million job records. Initially we tested purging around 2 months
> of records from jobs that completed in 2019. This took around 1-2 hours to
> complete. Looks like archiving / purging works.
> 
> I've now increased the archive/purge to any jobs older than December 2020.
> I'm unsure how many records there are but it's 1 year worth of jobs. It's
> been running for close to 24 hours but it's slowly chugging along. I can see
> that it's completed around 6 months of job records.

I think that you are doing a great job be doing the archiving/purguing incrementally and monitoring it close!

> Is there anything we can do to speed it up? This is our innodb settings.
> 
> [root@dev-slurm01 ~]# cat /etc/my.cnf.d/innodb.cnf 
> [mysqld]
> innodb_buffer_pool_size=2048M
> innodb_log_file_size=64M
> innodb_lock_wait_timeout=900

We actually recommend innodb_buffer_pool_size=4096M.
See https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build

> I'm also a little concerned as we will need to do this from the production
> cluster soon. From my testing, it does look like we can still continue to
> use the slurm cluster including any sacct commands. Do you foresee any
> issues while we archive/purge the records?

No, you shouldn't face any issue, but some notes to take into account:
- Keep doing it incrementally, so you have better control/monitorization
- Keep an eye on possible runaway jobs to avoid them being to close to purging time
  - Note that fixing very old runaways trigger an internal rollup operation that will take more time the older the runaways is
- Similary, keep a purge time great enough to avoid trying to purge jobs already in the system

And finally, although you should be able to restore the archived records into a newer DB/slurmdbd for future inspection, I would also recommend you to also keep a SQL backup.

Regards,
Albert
Comment 2 lhuang 2022-09-12 10:11:53 MDT
Hi Albert,

Looks like it completed successfully and thank you for the tips and suggestions. Although we can size the db has shrunk in size. The ibdata1 file did not decrease. Is it suppose to reduce in size?
 

MariaDB [(none)]> SELECT table_schema "DB Name",
    ->         ROUND(SUM(data_length + index_length) / 1024 / 1024, 1) "DB Size in MB" 
    -> FROM information_schema.tables 
    -> GROUP BY table_schema; 
+--------------------+---------------+
| DB Name            | DB Size in MB |
+--------------------+---------------+
| information_schema |           0.1 |
| mysql              |           0.6 |
| performance_schema |           0.0 |
| slurm_acct_db      |       30676.8 |
+--------------------+---------------+


[root@dev-slurm01 ~]# du -skh /var/lib/mysql/ibdata1 
49G	/var/lib/mysql/ibdata1

Regards,
Luis
Comment 3 Albert Gil 2022-09-12 10:37:34 MDT
Hi Luis,

> Looks like it completed successfully

Great!

> Although we can size the db has shrunk in size. The ibdata1
> file did not decrease. Is it suppose to reduce in size?

Well, this is something in the realm of MariaDB.
AFAIK internally MariaDB does a rationale like "ok, now the DB has less records but it had more in the past, so lets keep the disk space because most probably we'll need it soon and access to it will be faster if the file is already that big".
There are someways to reduce the disk usage, though, but I won't recommend any.

Regards,
Albert
Comment 4 Albert Gil 2022-09-21 09:40:04 MDT
Hi Luis,

If this is ok for you I'm closing this ticket as infogiven, but please don't hesitate to reopen it if you need further related support.

Regards,
Albert