Ticket 4925

Summary:	jobs not purged
Product:	Slurm	Reporter:	Michael Gutteridge <mrg>
Component:	Database	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	felip.moll
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	FHCRC - Fred Hutchinson Cancer Research Center	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmdbd configuration file slurm configuration file Latest dbd log

Description Michael Gutteridge 2018-03-14 12:28:15 MDT

Created attachment 6380 [details]
slurmdbd configuration file

We're way back on 15.08.7, getting ready to upgrade to 17.11 towards the end of the month.  To that end, I've been trying to get the database cleaned up so that the upgrade doesn't take forever and a day.  Currently ibdata1 is 40G, though I suspect a dump/reload will help that.

The problem I'm running into is that some records don't seem to be getting purged that I would expect should have been.  This may be due to records that have "0" for the start or eligible times.

slurmdbd.conf has:

ArchiveEvents=yes
PurgeEventAfter=3

ArchiveJobs=yes
PurgeJobAfter=3

ArchiveResvs=yes
PurgeResvAfter=3

ArchiveSteps=yes
PurgeStepAfter=3

ArchiveSuspend=yes
PurgeSuspendAfter=3

The process ran this morning, but I still have job records dating back to nearly three years ago:

$ sacct --allusers -S 2016-01-01 -E 2016-01-15 -o JobID,State,Submit,Start,Elapsed
       JobID      State              Submit               Start    Elapsed
------------ ---------- ------------------- ------------------- ----------
28856186      NODE_FAIL 2015-12-19T18:52:02 2015-12-19T18:52:32 23-09:42:59
28985551        TIMEOUT 2015-12-24T14:47:20 2015-12-24T14:48:47 8-12:00:29
29088106        TIMEOUT 2015-12-27T16:12:24 2015-12-27T16:12:25 5-12:00:24
29099180        TIMEOUT 2015-12-28T10:49:13 2015-12-28T10:49:45 4-12:00:00
29099180.ba+  CANCELLED 2015-12-28T10:49:45 2015-12-28T10:49:45 4-12:00:01
29099180.0    CANCELLED 2015-12-28T10:49:55 2015-12-28T10:49:55 4-11:59:51

According to mysql, there are 29 million rows with an end time before December 2017.  I am getting files created by the archive process in ArchiveDir, so I think that's working OK, though the log for slurmdbd did flag this error:

[2018-03-14T09:51:42.603] Warning: Note very large processing time from daily_rollup for gizmo: usec=29034794 began=09:51:13.568
[2018-03-14T10:09:15.561] error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
update "gizmo_event_table" set time_end=1521046454 where time_end=0 and state=1 and node_name='';
[2018-03-14T10:09:15.561] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program

Given the size of the database this would seem to be expected.

I'm hoping you have some advice on how to clean this up.  At this point, I'd be fine removing old records by hand, but I'm not sure what all needs to be cleaned up-
 minimally I'd think the job table and step table, but is there anything else?

Thank you

Michael

Comment 1 Michael Gutteridge 2018-03-14 12:28:40 MDT

Created attachment 6381 [details]
slurm configuration file

Comment 3 Marshall Garey 2018-03-14 15:25:14 MDT

Before doing anything by hand, let's try purging the database bit by bit and see if that works. It sounds like the purging is taking too long, so it isn't purging at all.

Try setting the Purge*After values to a few months less than the number of months of the oldest records. For example, say you have records dating back 36 months, set all the Purge*After values to 33 months, restart the slurmdbd, force the rollup, and see if it purges. Then set the Purge*After values to 30, restart slurmdbd, rollup/purge. Rinse and repeat.

You might be able to get away with more than 3-month increments, or might have to do less than 3-month increments, depending on how many jobs are in that time period.

Can you let us know how that goes?


On upgrading, there are two very important points:

1. We've discovered a problem with mySQL 5.1 that makes the upgrade to 17.11 EXTREMELY long. If you're running mySQL 5.1, please upgrade it before upgrading Slurm. We've been running an upgrade on mySQL 5.1 for a particular site's database for 5 or 6 days now, I think? I'm not sure if it has finished or not. But it finished very quickly on an upgraded (>5.1) version of mySQL.

2. The database cannot be upgraded more than 2 versions at a time. Since you're running 15.08, you cannot upgrade directly to 17.11, which is 3 versions higher. You'll need to upgrade to either 16.05 or 17.02 first, then upgrade to 17.11

Comment 4 Michael Gutteridge 2018-03-14 15:37:51 MDT

(In reply to Marshall Garey from comment #3)
> Before doing anything by hand, let's try purging the database bit by bit and
> see if that works. It sounds like the purging is taking too long, so it
> isn't purging at all.
> 
> Try setting the Purge*After values to a few months less than the number of
> months of the oldest records. For example, say you have records dating back
> 36 months, set all the Purge*After values to 33 months, restart the
> slurmdbd, force the rollup, and see if it purges. Then set the Purge*After
> values to 30, restart slurmdbd, rollup/purge. Rinse and repeat.
> 
> You might be able to get away with more than 3-month increments, or might
> have to do less than 3-month increments, depending on how many jobs are in
> that time period.
> 
> Can you let us know how that goes?

Ah, did not consider that.  Sounds like a great plan, I'll get started and update you on how that goes.

> On upgrading, there are two very important points:
> 
> 1. We've discovered a problem with mySQL 5.1 that makes the upgrade to 17.11
> EXTREMELY long. If you're running mySQL 5.1, please upgrade it before
> upgrading Slurm. We've been running an upgrade on mySQL 5.1 for a particular
> site's database for 5 or 6 days now, I think? I'm not sure if it has
> finished or not. But it finished very quickly on an upgraded (>5.1) version
> of mySQL.

Thanks for the heads up.  I'll make sure we have a more current version of MySQL on the new controller (we're migrating to new hardware and updating to Ubuntu 16.04 as well)

> 2. The database cannot be upgraded more than 2 versions at a time. Since
> you're running 15.08, you cannot upgrade directly to 17.11, which is 3
> versions higher. You'll need to upgrade to either 16.05 or 17.02 first, then
> upgrade to 17.11

Yup... we really let things go 8-/.  I'd planned on going from 15.08 to 16.05 before going to 17... but if 17.02 is an option that might be a better intermediate step.

Thanks for the advice... I'll be in touch.

Michael

Comment 5 Michael Gutteridge 2018-03-14 19:30:20 MDT

(In reply to Marshall Garey from comment #3)
> Before doing anything by hand, let's try purging the database bit by bit and
> see if that works. It sounds like the purging is taking too long, so it
> isn't purging at all.
> 
> Try setting the Purge*After values to a few months less than the number of
> months of the oldest records. For example, say you have records dating back
> 36 months, set all the Purge*After values to 33 months, restart the
> slurmdbd, force the rollup, and see if it purges. Then set the Purge*After
> values to 30, restart slurmdbd, rollup/purge. Rinse and repeat.

Not having good luck, I'm afraid.  I'm using `sacctmgr rollup`, but getting pretty consistent errors:

gadget[~]: sudo sacctmgr rollup 3/30/15 8/1/15 && date
sacctmgr: error: slurmdbd: Getting response to message type 1440
sacctmgr: SUCCESS
Wed Mar 14 18:00:08 PDT 2018
gadget[~]: sudo sacctmgr rollup 3/30/15 5/1/15 && date
sacctmgr: error: slurmdbd: Getting response to message type 1440
sacctmgr: SUCCESS
Wed Mar 14 18:15:48 PDT 2018

Querying the database shows that there are still jobs within those date ranges.  Have I done it wrong?

Michael

Comment 7 Marshall Garey 2018-03-15 15:14:05 MDT

Ah, I didn't realize you need to change your <cluster>_last_ran_table times in order to force the archive/purge. You don't need to change it to all the way back, then, however. I got it to work by changing the times to yesterday morning:

update <clustername>_last_ran_table SET hourly_rollup = UNIX_TIMESTAMP('2018-03-14 00:00:00'), daily_rollup = UNIX_TIMESTAMP('2018-03-14 00:00:00'), monthly_rollup = UNIX_TIMESTAMP('2018-03-14 00:00:00');

Then you'll need to restart the slurmdbd.

I thought sacctmgr rollup would work, but it appears it doesn't.

If that doesn't work, let's try the following. It's ideas from bug 4847 - they were having very similar problems as you.

1. Tuning. What are the values of the following variables?

innodb_buffer_pool_size
innodb_log_file_size
innodb_lock_wait_timeout

2. How much free space is on your disk? And what kind of storage device is it using (HDD, Sata or M.2 or NVMe SSD, ...)?

3. Can you upload a slurmdbd log file? I'd like to see if there are any errors besides just the one you posted in comment 5.

4. Your database hasn't crashed at all, correct? You're just trying to trim it down to help with the upgrade time?

5. You could try purging during downtime. Quoting Felip from 4847 comment 11:
"Waiting until the downtime is something I was gonna suggest also. Reducing the possible calls and interaction with the database is worth to do.
There's another option but less safe and is to change the slurmdbd port to avoid communication while it is archiving."

Also, sacctmgr archive dump is actually buggy at the moment. We have an internal ticket open to fix it, but don't use that option.

Comment 8 Michael Gutteridge 2018-03-15 16:08:20 MDT

> I thought sacctmgr rollup would work, but it appears it doesn't.

I was trying it again this afternoon, with even smaller intervals and was at least getting "success" without the error messages.  However, it doesn't seem to be removing any entries from the job table:

mysql> select count(*) from gizmo_job_table where time_start between unix_timestamp('2016-01-04T00:00:00') and unix_timestamp('2016-01-04T23:59:59');
+----------+
| count(*) |
+----------+
|    19881 |
+----------+
1 row in set (6.66 sec)

gadget[~]: sudo sacctmgr --immediate rollup 1/3/16 1/5/16
sacctmgr: SUCCESS

mysql> select count(*) from gizmo_job_table where time_start between unix_timestamp('2016-01-04T00:00:00') and unix_timestamp('2016-01-04T23:59:59');
+----------+
| count(*) |
+----------+
|    19881 |
+----------+
1 row in set (4.30 sec)


(In reply to Marshall Garey from comment #7)
> Ah, I didn't realize you need to change your <cluster>_last_ran_table times
> in order to force the archive/purge. You don't need to change it to all the
> way back, then, however. I got it to work by changing the times to yesterday
> morning:
> 
> update <clustername>_last_ran_table SET     hourly_rollup =
> UNIX_TIMESTAMP('2018-03-14 00:00:00'),      daily_rollup =
> UNIX_TIMESTAMP('2018-03-14 00:00:00'),     monthly_rollup =
> UNIX_TIMESTAMP('2018-03-14 00:00:00');
> 
> Then you'll need to restart the slurmdbd.
> 
> I thought sacctmgr rollup would work, but it appears it doesn't.
> 
> 
> 
> If that doesn't work, let's try the following. It's ideas from bug 4847 -
> they were having very similar problems as you.
> 
> 1. Tuning. What are the values of the following variables?
> 
> innodb_buffer_pool_size
> innodb_log_file_size
> innodb_lock_wait_timeout

  innodb_buffer_pool_size   | 2147483648                                                                                                              
  innodb_log_file_size      | 536870912                                                                                                               
  innodb_lock_wait_timeout  | 900
                                                                                                                    
> 
> 2. How much free space is on your disk? And what kind of storage device is
> it using (HDD, Sata or M.2 or NVMe SSD, ...)?

/dev/sdb1                                     259G   36G  211G  15% /
none                                          4.0K     0  4.0K   0% /sys/fs/cgroup
udev                                          7.8G  4.0K  7.8G   1% /dev
tmpfs                                         1.6G  1.6M  1.6G   1% /run
none                                          5.0M     0  5.0M   0% /run/lock
none                                          7.8G  4.0K  7.8G   1% /run/shm
none                                          100M     0  100M   0% /run/user
/dev/sda1                                     184G   42G  141G  23% /var/lib/mysql

It appears to be sata

> 3. Can you upload a slurmdbd log file? I'd like to see if there are any
> errors besides just the one you posted in comment 5.

will do

> 4. Your database hasn't crashed at all, correct? You're just trying to trim
> it down to help with the upgrade time?

I don't believe it has.  And yes, just to help with the upgrade.

> 5. You could try purging during downtime. Quoting Felip from 4847 comment 11:
> "Waiting until the downtime is something I was gonna suggest also. Reducing
> the possible calls and interaction with the database is worth to do.
> There's another option but less safe and is to change the slurmdbd port to
> avoid communication while it is archiving."

I may try the latter just to see if I can actually get it to delete records, but until I can confirm that, I'm not super hopeful that waiting for downtime is a good idea.

> Also, sacctmgr archive dump is actually buggy at the moment. We have an
> internal ticket open to fix it, but don't use that option.

Roger that.  I planned on doing a sql dump/load.

Thanks

Comment 9 Michael Gutteridge 2018-03-15 16:09:14 MDT

Created attachment 6399 [details]
Latest dbd log

Comment 10 Michael Gutteridge 2018-03-15 16:21:27 MDT

> > 2. How much free space is on your disk? And what kind of storage device is
> > it using (HDD, Sata or M.2 or NVMe SSD, ...)?
> 
> /dev/sdb1                                     259G   36G  211G  15% /
> none                                          4.0K     0  4.0K   0%
> /sys/fs/cgroup
> udev                                          7.8G  4.0K  7.8G   1% /dev
> tmpfs                                         1.6G  1.6M  1.6G   1% /run
> none                                          5.0M     0  5.0M   0% /run/lock
> none                                          7.8G  4.0K  7.8G   1% /run/shm
> none                                          100M     0  100M   0% /run/user
> /dev/sda1                                     184G   42G  141G  23%
> /var/lib/mysql
> 
> It appears to be sata

correction- it is a ssd drive.

Comment 11 Marshall Garey 2018-03-15 16:23:58 MDT

Thanks for the info, I'll look through it and see what I can find.

Just to clarify, did you try changing the times int the <cluster>_last_ran_table and restarting the slurmdbd?

Comment 12 Michael Gutteridge 2018-03-15 16:25:34 MDT

(In reply to Marshall Garey from comment #11)
> Thanks for the info, I'll look through it and see what I can find.
> 
> Just to clarify, did you try changing the times int the
> <cluster>_last_ran_table and restarting the slurmdbd?

No.  I will do that next.

Comment 13 Marshall Garey 2018-03-15 17:26:48 MDT

>   innodb_buffer_pool_size   | 2147483648                                    
> 
>   innodb_log_file_size      | 536870912                                     
> 
>   innodb_lock_wait_timeout  | 900
I think these should be fine.


> /dev/sdb1                                     259G   36G  211G  15% /
> none                                          4.0K     0  4.0K   0%
> /sys/fs/cgroup
> udev                                          7.8G  4.0K  7.8G   1% /dev
> tmpfs                                         1.6G  1.6M  1.6G   1% /run
> none                                          5.0M     0  5.0M   0% /run/lock
> none                                          7.8G  4.0K  7.8G   1% /run/shm
> none                                          100M     0  100M   0% /run/user
> /dev/sda1                                     184G   42G  141G  23%
> /var/lib/mysql
> correction- it is a ssd drive.
Looks great - plenty of space and fast storage.



Looking through the slurmdbd log file:

I see lots of these:

[2018-03-15T13:43:24.307] error: We have more time than is possible (2102400+64800+0)(2167200) > 2102400 for cluster gizmo(584) from 2016-05-12T21:00:00 - 2016-05-12T22:00:00 tres 2

Can you run sacctmgr show runawayjobs? (But don’t “fix” them just yet.) Actually I don't even remember of sacctmgr show runawayjobs is in 15.08 or not - you can try it and find out.

I also saw these:

[2018-03-15T13:49:06.587] Warning: Note very large processing time from hourly_rollup for gizmo: usec=7861678 began=13:48:58.726
[2018-03-15T13:49:06.587] error: Cluster gizmo rollup failed
[2018-03-15T13:49:06.588] error: Processing last message from connection 18(127.0.0.1) uid(0)
[2018-03-15T13:49:06.588] error: Connection 18 experienced an error
[2018-03-15T13:49:14.195] error: mysql_query failed: 1317 Query execution was interrupted

I wonder if that’s why your sacctmgr rollup commands were failing - other queries are getting in the way.

When purging, definitely try small time frames first just to make sure it won't get interrupted and records are getting purged. I think it will work.




As an aside, even very large databases should upgrade fine on a mySQL version greater than 5.1. I don't think we've tested anything as large as 40 GB, but based on the ones we have tested, we expect it to take about 15 minutes for a fairly large database. I'll get some actual numbers so you can have a good idea of how long it should take. The conversion to 16.05 or 17.02 should be faster than the one to 17.11 as well.

Comment 14 Marshall Garey 2018-03-23 09:48:40 MDT

Hi Michael,

How are things going? Have you been able to successfully purge and/or upgrade?

Comment 15 Michael Gutteridge 2018-03-23 11:02:11 MDT

(In reply to Marshall Garey from comment #14)
> Hi Michael,
> 
> How are things going? Have you been able to successfully purge and/or
> upgrade?

Hi again-

I've been trying the approach recommended: reducing the archive times, updating the update table, then restarting slurmdbd.  These jobs _appear_ to finish, but I'm not seeing a reduction in rows.

I've got the replacement controller/dbd host running and have been trying a few things on this host with the hope that the improved performance will help.  Looks OK- about 2 hours to do the database update, with the jobs table taking 90 minutes or so (33 million rows).  At this point I am planning to move the database, upgrade, and then work on cleaning up.

I think with the newer version and the "runaway" subcommand I will be in good shape for trimming the database after the upgrade.  So let's close this issue for now and if I run into trouble later, we'll be on a supported version and in a much better place for resolution.

Thanks for the help

Michael

Comment 16 Marshall Garey 2018-03-23 11:04:11 MDT

Sounds good. I'm glad you're able to update without too much trouble in a somewhat reasonable amount of time. Closing as resolved/infogiven.