| Summary: | jobs not purged | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Michael Gutteridge <mrg> |
| Component: | Database | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | felip.moll |
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | FHCRC - Fred Hutchinson Cancer Research Center | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmdbd configuration file
slurm configuration file Latest dbd log |
||
Created attachment 6381 [details]
slurm configuration file
Before doing anything by hand, let's try purging the database bit by bit and see if that works. It sounds like the purging is taking too long, so it isn't purging at all. Try setting the Purge*After values to a few months less than the number of months of the oldest records. For example, say you have records dating back 36 months, set all the Purge*After values to 33 months, restart the slurmdbd, force the rollup, and see if it purges. Then set the Purge*After values to 30, restart slurmdbd, rollup/purge. Rinse and repeat. You might be able to get away with more than 3-month increments, or might have to do less than 3-month increments, depending on how many jobs are in that time period. Can you let us know how that goes? On upgrading, there are two very important points: 1. We've discovered a problem with mySQL 5.1 that makes the upgrade to 17.11 EXTREMELY long. If you're running mySQL 5.1, please upgrade it before upgrading Slurm. We've been running an upgrade on mySQL 5.1 for a particular site's database for 5 or 6 days now, I think? I'm not sure if it has finished or not. But it finished very quickly on an upgraded (>5.1) version of mySQL. 2. The database cannot be upgraded more than 2 versions at a time. Since you're running 15.08, you cannot upgrade directly to 17.11, which is 3 versions higher. You'll need to upgrade to either 16.05 or 17.02 first, then upgrade to 17.11 (In reply to Marshall Garey from comment #3) > Before doing anything by hand, let's try purging the database bit by bit and > see if that works. It sounds like the purging is taking too long, so it > isn't purging at all. > > Try setting the Purge*After values to a few months less than the number of > months of the oldest records. For example, say you have records dating back > 36 months, set all the Purge*After values to 33 months, restart the > slurmdbd, force the rollup, and see if it purges. Then set the Purge*After > values to 30, restart slurmdbd, rollup/purge. Rinse and repeat. > > You might be able to get away with more than 3-month increments, or might > have to do less than 3-month increments, depending on how many jobs are in > that time period. > > Can you let us know how that goes? Ah, did not consider that. Sounds like a great plan, I'll get started and update you on how that goes. > On upgrading, there are two very important points: > > 1. We've discovered a problem with mySQL 5.1 that makes the upgrade to 17.11 > EXTREMELY long. If you're running mySQL 5.1, please upgrade it before > upgrading Slurm. We've been running an upgrade on mySQL 5.1 for a particular > site's database for 5 or 6 days now, I think? I'm not sure if it has > finished or not. But it finished very quickly on an upgraded (>5.1) version > of mySQL. Thanks for the heads up. I'll make sure we have a more current version of MySQL on the new controller (we're migrating to new hardware and updating to Ubuntu 16.04 as well) > 2. The database cannot be upgraded more than 2 versions at a time. Since > you're running 15.08, you cannot upgrade directly to 17.11, which is 3 > versions higher. You'll need to upgrade to either 16.05 or 17.02 first, then > upgrade to 17.11 Yup... we really let things go 8-/. I'd planned on going from 15.08 to 16.05 before going to 17... but if 17.02 is an option that might be a better intermediate step. Thanks for the advice... I'll be in touch. Michael (In reply to Marshall Garey from comment #3) > Before doing anything by hand, let's try purging the database bit by bit and > see if that works. It sounds like the purging is taking too long, so it > isn't purging at all. > > Try setting the Purge*After values to a few months less than the number of > months of the oldest records. For example, say you have records dating back > 36 months, set all the Purge*After values to 33 months, restart the > slurmdbd, force the rollup, and see if it purges. Then set the Purge*After > values to 30, restart slurmdbd, rollup/purge. Rinse and repeat. Not having good luck, I'm afraid. I'm using `sacctmgr rollup`, but getting pretty consistent errors: gadget[~]: sudo sacctmgr rollup 3/30/15 8/1/15 && date sacctmgr: error: slurmdbd: Getting response to message type 1440 sacctmgr: SUCCESS Wed Mar 14 18:00:08 PDT 2018 gadget[~]: sudo sacctmgr rollup 3/30/15 5/1/15 && date sacctmgr: error: slurmdbd: Getting response to message type 1440 sacctmgr: SUCCESS Wed Mar 14 18:15:48 PDT 2018 Querying the database shows that there are still jobs within those date ranges. Have I done it wrong? Michael Ah, I didn't realize you need to change your <cluster>_last_ran_table times in order to force the archive/purge. You don't need to change it to all the way back, then, however. I got it to work by changing the times to yesterday morning:
update <clustername>_last_ran_table SET hourly_rollup = UNIX_TIMESTAMP('2018-03-14 00:00:00'), daily_rollup = UNIX_TIMESTAMP('2018-03-14 00:00:00'), monthly_rollup = UNIX_TIMESTAMP('2018-03-14 00:00:00');
Then you'll need to restart the slurmdbd.
I thought sacctmgr rollup would work, but it appears it doesn't.
If that doesn't work, let's try the following. It's ideas from bug 4847 - they were having very similar problems as you.
1. Tuning. What are the values of the following variables?
innodb_buffer_pool_size
innodb_log_file_size
innodb_lock_wait_timeout
2. How much free space is on your disk? And what kind of storage device is it using (HDD, Sata or M.2 or NVMe SSD, ...)?
3. Can you upload a slurmdbd log file? I'd like to see if there are any errors besides just the one you posted in comment 5.
4. Your database hasn't crashed at all, correct? You're just trying to trim it down to help with the upgrade time?
5. You could try purging during downtime. Quoting Felip from 4847 comment 11:
"Waiting until the downtime is something I was gonna suggest also. Reducing the possible calls and interaction with the database is worth to do.
There's another option but less safe and is to change the slurmdbd port to avoid communication while it is archiving."
Also, sacctmgr archive dump is actually buggy at the moment. We have an internal ticket open to fix it, but don't use that option.
> I thought sacctmgr rollup would work, but it appears it doesn't. I was trying it again this afternoon, with even smaller intervals and was at least getting "success" without the error messages. However, it doesn't seem to be removing any entries from the job table: mysql> select count(*) from gizmo_job_table where time_start between unix_timestamp('2016-01-04T00:00:00') and unix_timestamp('2016-01-04T23:59:59'); +----------+ | count(*) | +----------+ | 19881 | +----------+ 1 row in set (6.66 sec) gadget[~]: sudo sacctmgr --immediate rollup 1/3/16 1/5/16 sacctmgr: SUCCESS mysql> select count(*) from gizmo_job_table where time_start between unix_timestamp('2016-01-04T00:00:00') and unix_timestamp('2016-01-04T23:59:59'); +----------+ | count(*) | +----------+ | 19881 | +----------+ 1 row in set (4.30 sec) (In reply to Marshall Garey from comment #7) > Ah, I didn't realize you need to change your <cluster>_last_ran_table times > in order to force the archive/purge. You don't need to change it to all the > way back, then, however. I got it to work by changing the times to yesterday > morning: > > update <clustername>_last_ran_table SET hourly_rollup = > UNIX_TIMESTAMP('2018-03-14 00:00:00'), daily_rollup = > UNIX_TIMESTAMP('2018-03-14 00:00:00'), monthly_rollup = > UNIX_TIMESTAMP('2018-03-14 00:00:00'); > > Then you'll need to restart the slurmdbd. > > I thought sacctmgr rollup would work, but it appears it doesn't. > > > > If that doesn't work, let's try the following. It's ideas from bug 4847 - > they were having very similar problems as you. > > 1. Tuning. What are the values of the following variables? > > innodb_buffer_pool_size > innodb_log_file_size > innodb_lock_wait_timeout innodb_buffer_pool_size | 2147483648 innodb_log_file_size | 536870912 innodb_lock_wait_timeout | 900 > > 2. How much free space is on your disk? And what kind of storage device is > it using (HDD, Sata or M.2 or NVMe SSD, ...)? /dev/sdb1 259G 36G 211G 15% / none 4.0K 0 4.0K 0% /sys/fs/cgroup udev 7.8G 4.0K 7.8G 1% /dev tmpfs 1.6G 1.6M 1.6G 1% /run none 5.0M 0 5.0M 0% /run/lock none 7.8G 4.0K 7.8G 1% /run/shm none 100M 0 100M 0% /run/user /dev/sda1 184G 42G 141G 23% /var/lib/mysql It appears to be sata > 3. Can you upload a slurmdbd log file? I'd like to see if there are any > errors besides just the one you posted in comment 5. will do > 4. Your database hasn't crashed at all, correct? You're just trying to trim > it down to help with the upgrade time? I don't believe it has. And yes, just to help with the upgrade. > 5. You could try purging during downtime. Quoting Felip from 4847 comment 11: > "Waiting until the downtime is something I was gonna suggest also. Reducing > the possible calls and interaction with the database is worth to do. > There's another option but less safe and is to change the slurmdbd port to > avoid communication while it is archiving." I may try the latter just to see if I can actually get it to delete records, but until I can confirm that, I'm not super hopeful that waiting for downtime is a good idea. > Also, sacctmgr archive dump is actually buggy at the moment. We have an > internal ticket open to fix it, but don't use that option. Roger that. I planned on doing a sql dump/load. Thanks Created attachment 6399 [details]
Latest dbd log
> > 2. How much free space is on your disk? And what kind of storage device is
> > it using (HDD, Sata or M.2 or NVMe SSD, ...)?
>
> /dev/sdb1 259G 36G 211G 15% /
> none 4.0K 0 4.0K 0%
> /sys/fs/cgroup
> udev 7.8G 4.0K 7.8G 1% /dev
> tmpfs 1.6G 1.6M 1.6G 1% /run
> none 5.0M 0 5.0M 0% /run/lock
> none 7.8G 4.0K 7.8G 1% /run/shm
> none 100M 0 100M 0% /run/user
> /dev/sda1 184G 42G 141G 23%
> /var/lib/mysql
>
> It appears to be sata
correction- it is a ssd drive.
Thanks for the info, I'll look through it and see what I can find. Just to clarify, did you try changing the times int the <cluster>_last_ran_table and restarting the slurmdbd? (In reply to Marshall Garey from comment #11) > Thanks for the info, I'll look through it and see what I can find. > > Just to clarify, did you try changing the times int the > <cluster>_last_ran_table and restarting the slurmdbd? No. I will do that next. > innodb_buffer_pool_size | 2147483648 > > innodb_log_file_size | 536870912 > > innodb_lock_wait_timeout | 900 I think these should be fine. > /dev/sdb1 259G 36G 211G 15% / > none 4.0K 0 4.0K 0% > /sys/fs/cgroup > udev 7.8G 4.0K 7.8G 1% /dev > tmpfs 1.6G 1.6M 1.6G 1% /run > none 5.0M 0 5.0M 0% /run/lock > none 7.8G 4.0K 7.8G 1% /run/shm > none 100M 0 100M 0% /run/user > /dev/sda1 184G 42G 141G 23% > /var/lib/mysql > correction- it is a ssd drive. Looks great - plenty of space and fast storage. Looking through the slurmdbd log file: I see lots of these: [2018-03-15T13:43:24.307] error: We have more time than is possible (2102400+64800+0)(2167200) > 2102400 for cluster gizmo(584) from 2016-05-12T21:00:00 - 2016-05-12T22:00:00 tres 2 Can you run sacctmgr show runawayjobs? (But don’t “fix” them just yet.) Actually I don't even remember of sacctmgr show runawayjobs is in 15.08 or not - you can try it and find out. I also saw these: [2018-03-15T13:49:06.587] Warning: Note very large processing time from hourly_rollup for gizmo: usec=7861678 began=13:48:58.726 [2018-03-15T13:49:06.587] error: Cluster gizmo rollup failed [2018-03-15T13:49:06.588] error: Processing last message from connection 18(127.0.0.1) uid(0) [2018-03-15T13:49:06.588] error: Connection 18 experienced an error [2018-03-15T13:49:14.195] error: mysql_query failed: 1317 Query execution was interrupted I wonder if that’s why your sacctmgr rollup commands were failing - other queries are getting in the way. When purging, definitely try small time frames first just to make sure it won't get interrupted and records are getting purged. I think it will work. As an aside, even very large databases should upgrade fine on a mySQL version greater than 5.1. I don't think we've tested anything as large as 40 GB, but based on the ones we have tested, we expect it to take about 15 minutes for a fairly large database. I'll get some actual numbers so you can have a good idea of how long it should take. The conversion to 16.05 or 17.02 should be faster than the one to 17.11 as well. Hi Michael, How are things going? Have you been able to successfully purge and/or upgrade? (In reply to Marshall Garey from comment #14) > Hi Michael, > > How are things going? Have you been able to successfully purge and/or > upgrade? Hi again- I've been trying the approach recommended: reducing the archive times, updating the update table, then restarting slurmdbd. These jobs _appear_ to finish, but I'm not seeing a reduction in rows. I've got the replacement controller/dbd host running and have been trying a few things on this host with the hope that the improved performance will help. Looks OK- about 2 hours to do the database update, with the jobs table taking 90 minutes or so (33 million rows). At this point I am planning to move the database, upgrade, and then work on cleaning up. I think with the newer version and the "runaway" subcommand I will be in good shape for trimming the database after the upgrade. So let's close this issue for now and if I run into trouble later, we'll be on a supported version and in a much better place for resolution. Thanks for the help Michael Sounds good. I'm glad you're able to update without too much trouble in a somewhat reasonable amount of time. Closing as resolved/infogiven. |
Created attachment 6380 [details] slurmdbd configuration file We're way back on 15.08.7, getting ready to upgrade to 17.11 towards the end of the month. To that end, I've been trying to get the database cleaned up so that the upgrade doesn't take forever and a day. Currently ibdata1 is 40G, though I suspect a dump/reload will help that. The problem I'm running into is that some records don't seem to be getting purged that I would expect should have been. This may be due to records that have "0" for the start or eligible times. slurmdbd.conf has: ArchiveEvents=yes PurgeEventAfter=3 ArchiveJobs=yes PurgeJobAfter=3 ArchiveResvs=yes PurgeResvAfter=3 ArchiveSteps=yes PurgeStepAfter=3 ArchiveSuspend=yes PurgeSuspendAfter=3 The process ran this morning, but I still have job records dating back to nearly three years ago: $ sacct --allusers -S 2016-01-01 -E 2016-01-15 -o JobID,State,Submit,Start,Elapsed JobID State Submit Start Elapsed ------------ ---------- ------------------- ------------------- ---------- 28856186 NODE_FAIL 2015-12-19T18:52:02 2015-12-19T18:52:32 23-09:42:59 28985551 TIMEOUT 2015-12-24T14:47:20 2015-12-24T14:48:47 8-12:00:29 29088106 TIMEOUT 2015-12-27T16:12:24 2015-12-27T16:12:25 5-12:00:24 29099180 TIMEOUT 2015-12-28T10:49:13 2015-12-28T10:49:45 4-12:00:00 29099180.ba+ CANCELLED 2015-12-28T10:49:45 2015-12-28T10:49:45 4-12:00:01 29099180.0 CANCELLED 2015-12-28T10:49:55 2015-12-28T10:49:55 4-11:59:51 According to mysql, there are 29 million rows with an end time before December 2017. I am getting files created by the archive process in ArchiveDir, so I think that's working OK, though the log for slurmdbd did flag this error: [2018-03-14T09:51:42.603] Warning: Note very large processing time from daily_rollup for gizmo: usec=29034794 began=09:51:13.568 [2018-03-14T10:09:15.561] error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction update "gizmo_event_table" set time_end=1521046454 where time_end=0 and state=1 and node_name=''; [2018-03-14T10:09:15.561] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Given the size of the database this would seem to be expected. I'm hoping you have some advice on how to clean this up. At this point, I'd be fine removing old records by hand, but I'm not sure what all needs to be cleaned up- minimally I'd think the job table and step table, but is there anything else? Thank you Michael