Ticket 5689

Summary: slurmdbd hang on ER_LOCK_WAIT_TIMEOUT
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: slurmdbdAssignee: Alejandro Sanchez <alex>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: alex
Version: 17.11.8   
Hardware: Linux   
OS: Linux   
Site: Stanford Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: primary slurmdbd log
backup slurmdbd log
mysql innodb status
slurmdbd logs
db query output

Description Kilian Cavalotti 2018-09-08 15:49:36 MDT
Hi, 

We just hit a new slurmdbd error, since we converted the Purge intervals in hours instead of months. Slurmdbd logs an error about a lock wait timeout, and exists without anything else (no core dump). Here's the whole log, from start to end:

Sep  8 08:52:51 sh-sl01 slurmdbd[173470]: Accounting storage MYSQL plugin loaded
Sep  8 08:52:51 sh-sl01 slurmdbd[173502]: chdir to /var/tmp
Sep  8 08:52:55 sh-sl01 slurmdbd[173502]: slurmdbd version 17.11.8 started
Sep  8 08:52:58 sh-sl01 slurmdbd[173502]: DBD_JOB_START: cluster not registered
Sep  8 08:58:38 sh-sl01 slurmdbd[173502]: Warning: Note very large processing time from hourly_rollup for sherlock: usec=342369413 began=08:52:55.932
Sep  8 09:00:00 sh-sl01 slurmdbd[173502]: error: We have more time than is possible (82560259+3225600+0)(85785859) > 85521600 for cluster sherlock(23756) from 2018-09-08T08:00:00 - 2018-09-08T09:00:00 tres 1
Sep  8 09:05:35 sh-sl01 slurmdbd[173502]: Warning: Note very large processing time from hourly_rollup for sherlock: usec=335822388 began=09:00:00.146
Sep  8 10:05:38 sh-sl01 slurmdbd[173502]: Warning: Note very large processing time from hourly_rollup for sherlock: usec=337914644 began=10:00:00.973
Sep  8 11:00:11 sh-sl01 slurmdbd[173502]: error: We have more time than is possible (82471752+3225952+0)(85697704) > 85521600 for cluster sherlock(23756) from 2018-09-08T05:00:00 - 2018-09-08T06:00:00 tres 1
Sep  8 11:00:11 sh-sl01 slurmdbd[173502]: error: We have more time than is possible (82588212+3225600+0)(85813812) > 85521600 for cluster sherlock(23756) from 2018-09-08T06:00:00 - 2018-09-08T07:00:00 tres 1
Sep  8 11:00:11 sh-sl01 slurmdbd[173502]: error: We have more time than is possible (82666578+3225600+0)(85892178) > 85521600 for cluster sherlock(23756) from 2018-09-08T07:00:00 - 2018-09-08T08:00:00 tres 1
Sep  8 11:00:12 sh-sl01 slurmdbd[173502]: error: We have more time than is possible (82560259+3225600+0)(85785859) > 85521600 for cluster sherlock(23756) from 2018-09-08T08:00:00 - 2018-09-08T09:00:00 tres 1
Sep  8 11:19:49 sh-sl01 slurmdbd[173502]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_job_table" where time_submit <= 1504893599 && time_end != 0 LIMIT 50000
Sep  8 11:19:49 sh-sl01 slurmdbd[173502]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program

It happened twice today already, but never before, as far as I can remember.

I tried to change the Purge intervals back to 365 days instead of hours, since that's the only recent change in our slurmdbd.conf (based on recommendation from bug 5662).
And since the failing transaction is a DELETE command, I assume it's related to purges.

Do you have any recommendation to avoid this problem?

Thanks,
-- 
Kilian
Comment 1 Alejandro Sanchez 2018-09-10 02:15:57 MDT
Two ideas come to my mind at first:

1. Temporarily increase innodb_lock_wait_timeout from 900 to 3600 or higher, at least until after a first hourly purge completes.

2. I'm not so convinced about this second idea, but perhaps setting the purge to hourly but start setting a value equivalent to close to 36 month and after a first purge completes reduce it progressively until you set it to 8760hours (12 month). In any case I think the delete statement will traverse all rows to check their time and see if they fall into the selected for deletion, so that's why I'm not so convinced that this would speedup the purge operation.

3. I can discuss internally if there's a safe way to manually purge old jobs if none of the first two can be accomplished through Slurm, and/or if it's possible to do something in the Slurm code.
Comment 3 Alejandro Sanchez 2018-09-10 10:34:45 MDT
Hi Kilian. After some internal discussion, we wouldn't expect the delete to take anywhere close to the 900 seconds. We'd like though to ask you to increase to debug2 level and attach the logs. Also if you see any potential MariaDB traces that could help us pinpoint lock problems or such that'd be great:

MariaDB [(none)]> show engine innodb status\G

I think this reports information on semaphores, latest deadlocks detected, transactions, who holds what locks and who is waiting for a lock to be granted and things like that.
Comment 4 Kilian Cavalotti 2018-09-10 18:52:09 MDT
HI Alejandro, 

(In reply to Alejandro Sanchez from comment #3)
> Hi Kilian. After some internal discussion, we wouldn't expect the delete to
> take anywhere close to the 900 seconds. We'd like though to ask you to
> increase to debug2 level and attach the logs. Also if you see any potential
> MariaDB traces that could help us pinpoint lock problems or such that'd be
> great:
> 
> MariaDB [(none)]> show engine innodb status\G
> 
> I think this reports information on semaphores, latest deadlocks detected,
> transactions, who holds what locks and who is waiting for a lock to be
> granted and things like that.


Thanks for the suggestions. I re-converted the purge intervals in hours, and enabled debug2. I'll attach logs if the timeout happens again, and post the output of "show engine innodb status" too.

Thanks
-- 
Kilian
Comment 5 Kilian Cavalotti 2018-09-12 10:22:25 MDT
> Thanks for the suggestions. I re-converted the purge intervals in hours, and
> enabled debug2. I'll attach logs if the timeout happens again, and post the
> output of "show engine innodb status" too.

It happened again. I'm attaching the logs for the primary slurmdbd (sh-sl01) which hanged at 02:17am on this ER_LOCK_WAIT_TIMEOUT, and for the backup controller (sh-sl02), which took over and then stopped the same way about 6 hours later, when processing the same purge transaction (delete from _step_table).

I'm also attaching the output of "show engine innodb status\G" from MySQL on the backup controller, since it's the last one that was active.

Cheers,
-- 
Kilian
Comment 6 Kilian Cavalotti 2018-09-12 10:22:55 MDT
Created attachment 7813 [details]
primary slurmdbd log
Comment 7 Kilian Cavalotti 2018-09-12 10:23:15 MDT
Created attachment 7814 [details]
backup slurmdbd log
Comment 8 Kilian Cavalotti 2018-09-12 10:23:34 MDT
Created attachment 7815 [details]
mysql innodb status
Comment 9 Kilian Cavalotti 2018-09-13 08:44:09 MDT
Hi all, 

Both slurmdbd failed again last night. 
I changed back the Purge intervals to days (365days) for now.

Cheers,
-- 
Kilian
Comment 10 Alejandro Sanchez 2018-09-13 08:55:57 MDT
Do you have logs from the newest fatal?
Comment 11 Alejandro Sanchez 2018-09-17 10:20:00 MDT
Hi Kilian,

I'm still looking at any potential cause for the deadlock/timeout. In any case any further relevant logs/info would be appreciated.

Thanks.
Comment 12 Kilian Cavalotti 2018-09-17 10:34:54 MDT
Created attachment 7842 [details]
slurmdbd logs

Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #11)
> I'm still looking at any potential cause for the deadlock/timeout. In any
> case any further relevant logs/info would be appreciated.

Thanks! We changed back our slurmdbd config to use days instead days instead of hours, so we haven't seen the issue again. Maybe, because the purges haven't kicked in yet?

Which is a bit confusing to me, by the way: when we had those purge intervals in hours in slurmdbd.conf, the timeout (so, I assume the purges also), were occurring more or less on a daily basis, not on a hourly one.

Now that we have them set to days, they haven't seem to be happening at all over the week-end. Is that expected? Shouldn't those purges happen every hour when the intervals are set in hours, and every day when they're set in days?

Cheers,
--
Kilian
Comment 13 Alejandro Sanchez 2018-09-18 10:58:34 MDT
(In reply to Kilian Cavalotti from comment #12)
> Which is a bit confusing to me, by the way: when we had those purge
> intervals in hours in slurmdbd.conf, the timeout (so, I assume the purges
> also), were occurring more or less on a daily basis, not on a hourly one.

Strange and interesting.
 
> Now that we have them set to days, they haven't seem to be happening at all
> over the week-end. Is that expected? 

No, it's not. I'm wondering if you set DebugFlags=DB_ARCHIVE if you see relevant info related to this.

> Shouldn't those purges happen every
> hour when the intervals are set in hours, and every day when they're set in
> days?

Indeed. I'll revise the logic around this...
Comment 14 Kilian Cavalotti 2018-09-18 11:22:03 MDT
(In reply to Alejandro Sanchez from comment #13)
> > Now that we have them set to days, they haven't seem to be happening at all
> > over the week-end. Is that expected? 
> 
> No, it's not. I'm wondering if you set DebugFlags=DB_ARCHIVE if you see
> relevant info related to this.

Good point. I've set the debug flags and moved the purge intervals back to hours. I'll let you know what happens.

Thanks!
-- 
Kilian
Comment 15 Kilian Cavalotti 2018-09-18 19:21:19 MDT
(In reply to Kilian Cavalotti from comment #14)
> Good point. I've set the debug flags and moved the purge intervals back to
> hours. I'll let you know what happens.

So, apparently, the purges are correctly started at roughly the expected interval (hourly when the Purge values are set in hours):

Sep 18 13:10:39 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 18 13:15:25 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_step_table" where time_start <= 1505764799 && time_end != 0 LIMIT 50000
Sep 18 13:18:46 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_job_table" where time_submit <= 1505764799 && time_end != 0 LIMIT 50000
Sep 18 13:20:46 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_assoc_usage_hour_table" where time_start <= 1505764799 LIMIT 50000
Sep 18 13:20:50 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_usage_hour_table" where time_start <= 1505764799 LIMIT 50000
Sep 18 14:12:42 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_event_table" where time_start <= 1505768399 && time_end != 0 LIMIT 50000
Sep 18 14:13:33 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 18 14:17:17 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_step_table" where time_start <= 1505768399 && time_end != 0 LIMIT 50000
Sep 18 14:21:02 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_job_table" where time_submit <= 1505768399 && time_end != 0 LIMIT 50000
Sep 18 14:23:18 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_assoc_usage_hour_table" where time_start <= 1505768399 LIMIT 50000
Sep 18 14:23:23 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_usage_hour_table" where time_start <= 1505768399 LIMIT 50000
Sep 18 15:12:16 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_event_table" where time_start <= 1505771999 && time_end != 0 LIMIT 50000
Sep 18 15:13:02 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 18 15:16:30 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_step_table" where time_start <= 1505771999 && time_end != 0 LIMIT 50000
Sep 18 15:19:57 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_job_table" where time_submit <= 1505771999 && time_end != 0 LIMIT 50000
Sep 18 15:22:06 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_assoc_usage_hour_table" where time_start <= 1505771999 LIMIT 50000
Sep 18 15:22:10 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_usage_hour_table" where time_start <= 1505771999 LIMIT 50000
Sep 18 16:14:04 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 18 16:19:04 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_step_table" where time_start <= 1505775599 && time_end != 0 LIMIT 50000
Sep 18 16:23:57 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_job_table" where time_submit <= 1505775599 && time_end != 0 LIMIT 50000
Sep 18 16:26:02 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_assoc_usage_hour_table" where time_start <= 1505775599 LIMIT 50000
Sep 18 16:26:06 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_usage_hour_table" where time_start <= 1505775599 LIMIT 50000
Sep 18 17:00:14 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_event_table" where time_start <= 1505779199 && time_end != 0 LIMIT 50000
Sep 18 17:00:59 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 18 17:04:46 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_step_table" where time_start <= 1505779199 && time_end != 0 LIMIT 50000
Sep 18 17:08:25 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_job_table" where time_submit <= 1505779199 && time_end != 0 LIMIT 50000
Sep 18 17:10:35 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_assoc_usage_hour_table" where time_start <= 1505779199 LIMIT 50000
Sep 18 17:10:39 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_usage_hour_table" where time_start <= 1505779199 LIMIT 50000
Sep 18 18:17:29 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_event_table" where time_start <= 1505782799 && time_end != 0 LIMIT 50000
Sep 18 18:18:15 sh-sl01 slurmdbd[89111]: 0(as_mysql_archive.c:3218) query#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000

which also shows that the timeout doesn't happen each time a purge starts.

Cheers,
-- 
Kilian
Comment 16 Alejandro Sanchez 2018-09-19 02:59:40 MDT
Looks promising. Can you corroborate job/steps were actually deleted successfully with a select count? In case of delete failure, you should be seeing messages like these:

error("Couldn't remove old data from %s table", sql_table);

or

error("Couldn't commit cluster (%s) purge", cluster_name);

but I guess you would have let us know if so. Assuming the size of the tables has been reduced, I think we could give another stab at upgrading.
Comment 17 Kilian Cavalotti 2018-09-19 11:08:31 MDT
Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #16)
> Looks promising. Can you corroborate job/steps were actually deleted
> successfully with a select count? 

I'm not entirely sure because I didn't check the number of records before setting the data retention to 1 year, but it seems to me like there's still more than year worth of job info in the database.

I'm gonna try to actually reduce the purge threshold to 6 months and check how many records are kept in the DB in the end.


> In case of delete failure, you should be
> seeing messages like these:
> 
> error("Couldn't remove old data from %s table", sql_table);
> 
> or
> 
> error("Couldn't commit cluster (%s) purge", cluster_name);
> 
> but I guess you would have let us know if so. 

Ok, I'll keep an eye out for those messages.

> Assuming the size of the
> tables has been reduced, I think we could give another stab at upgrading.

From your latest update in bug #5662, do I understand correctly that the upgrade took about 4 hours to complete? I will set up a test environment to practice the DB upgrade beforehand, because that will require scheduling a downtime for the whole system, and I'd be more comfortable knowing that the conversion worked on a copy of the DB before doing it on the production one.


Cheers,
-- 
Kilian
Comment 18 Alejandro Sanchez 2018-09-21 05:03:28 MDT
[2018-09-07T06:10:10.570] pre-converting step table for sherlock
[2018-09-07T11:23:42.922] pre-converting job table for sherlock
[2018-09-07T11:35:08.658] converting step table for sherlock
[2018-09-07T11:35:08.658] converting job table for sherlock
[2018-09-07T11:35:08.659] converting resv table for sherlock
[2018-09-07T11:35:08.659] converting cluster tables for sherlock
[2018-09-07T11:35:08.659] converting assoc table for sherlock
[2018-09-07T11:35:08.681] converting QOS table
[2018-09-07T11:35:08.681] Conversion done: success!
[2018-09-07T11:35:13.121] slurmdbd version 18.08.0 started

It took ~4h25m to upgrade, yes. Trying the conversion in a test environment sounds like a good idea. I'm very interested in your results. Please, let me know how it goes, thanks.
Comment 20 Kilian Cavalotti 2018-09-27 09:06:52 MDT
Hi,

We're still seeing those ER_LOCK_WAIT_TIMEOUT issues happening from time to time. Clearly not each time a purge occurs, but once every 2-3 days, more or less. Here are the latest occurrences:

Sep 12 08:17:31 sh-sl02 slurmdbd[118408]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_step_table" where time_start <= 1505228399 && time_end != 0 LIMIT 50000
Sep 12 08:17:31 sh-sl02 slurmdbd[118408]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Sep 12 20:18:23 sh-sl01 slurmdbd[36512]: debug2: DBD_GET_JOBS_COND: called
Sep 12 20:18:27 sh-sl01 slurmdbd[36512]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 12 20:18:27 sh-sl01 slurmdbd[36512]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Sep 13 04:17:16 sh-sl02 slurmdbd[157728]: debug2: Error connecting slurm stream socket at 10.10.0.11:6819: Connection refused
Sep 13 04:17:17 sh-sl02 slurmdbd[157728]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 13 04:17:17 sh-sl02 slurmdbd[157728]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Sep 14 00:17:18 sh-sl01 slurmdbd[109277]: debug2: DBD_GET_JOBS_COND: called
Sep 14 00:17:26 sh-sl01 slurmdbd[109277]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 14 00:17:26 sh-sl01 slurmdbd[109277]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Sep 18 11:19:45 sh-sl01 slurmdbd[21701]: debug2: Closed connection 113 uid(398)
Sep 18 11:19:50 sh-sl01 slurmdbd[21701]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 18 11:19:50 sh-sl01 slurmdbd[21701]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Sep 20 20:17:42 sh-sl01 slurmdbd[123796]: debug2: Closed connection 115 uid(398)
Sep 20 20:17:57 sh-sl01 slurmdbd[123796]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 20 20:17:57 sh-sl01 slurmdbd[123796]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Sep 24 04:17:45 sh-sl01 slurmdbd[59989]: debug2: DBD_GET_JOBS_COND: called
Sep 24 04:18:10 sh-sl01 slurmdbd[59989]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_job_table" where time_submit <= 1522234799 && time_end != 0 LIMIT 50000
Sep 24 04:18:10 sh-sl01 slurmdbd[59989]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Sep 24 14:16:25 sh-sl01 slurmdbd[93751]: debug2: Closed connection 41 uid(398)
Sep 24 14:16:25 sh-sl01 slurmdbd[93751]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 24 14:16:25 sh-sl01 slurmdbd[93751]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Sep 25 23:16:50 sh-sl01 slurmdbd[155936]: debug2: Closed connection 75 uid(398)
Sep 25 23:17:00 sh-sl01 slurmdbd[155936]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_step_table" where time_start <= 1522389599 && time_end != 0 LIMIT 50000
Sep 25 23:17:00 sh-sl01 slurmdbd[155936]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Sep 26 12:16:48 sh-sl01 slurmdbd[119382]: debug2: DBD_GET_JOBS_COND: called
Sep 26 12:16:59 sh-sl01 slurmdbd[119382]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_step_table" where time_start <= 1522436399 && time_end != 0 LIMIT 50000
Sep 26 12:16:59 sh-sl01 slurmdbd[119382]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Sep 26 21:16:34 sh-sl01 slurmdbd[71425]: debug2: DBD_GET_JOBS_COND: called
Sep 26 21:17:00 sh-sl01 slurmdbd[71425]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 26 21:17:00 sh-sl01 slurmdbd[71425]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program


One thing I notice is that they seem to be all happening for the same time_start/time_submit thresholds (2880 and 1522436399). Is that expected?

I guess 28800 (01/01/1970) is probably hard-coded, but the other one is March 30, 2018, and it doesn't seem to be matching anything in our config (all our slurmdbd purge intervals are 4320 hours). Even if it were, the purge threshold on time_submit should be increased at each purge interval, shouldn't it?

Thanks,
-- 
Kilian
Comment 21 Alejandro Sanchez 2018-09-27 11:19:25 MDT
From Epoch Converter:

Convert epoch to human readable date and vice versa

1522436399
 Timestamp to Human date  [batch convert timestamps to human dates]
GMT: Friday, March 30, 2018 6:59:59 PM
Your time zone: Friday, March 30, 2018 8:59:59 PM GMT+02:00 DST
Relative: 6 months ago

which seems to match your configured 4320 hours if I still can do math.

4320 hours * (1 day / 24 hours) * (1 month / 30 day) = 6 month which matches the relative 6 months ago off the configured interval.

I am not so sure if time should be incremented at each purge interval. Inspecting the code I see this before generating the DELETE query:

                if (curr_end - record_start > MAX_ARCHIVE_AGE) {
                        time_t begin_next = _get_begin_next_month(record_start);
                        /* old stuff, catch up by archiving by month */
                        tmp_archive_period = SLURMDB_PURGE_MONTHS;
                        tmp_end = MIN(curr_end, begin_next);
                } else
                        tmp_end = curr_end;

This macros is defined like

#define MAX_ARCHIVE_AGE (60 * 60 * 24 * 60) /* If archive data is older than
                                               this then archive by month to
                                               handle large datasets. */

It looks like stuff older than 1440 hours (24 * 60) which seems to be your case then the logic goes and MIN's the curr_end with the beginning of the next month from the oldest purgable record if I interpret things correctly.
Comment 22 Kilian Cavalotti 2018-09-27 11:38:58 MDT
Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #21)

> 1522436399
> Your time zone: Friday, March 30, 2018 8:59:59 PM GMT+02:00 DST
> Relative: 6 months ago
> 
> which seems to match your configured 4320 hours if I still can do math.
> 
> 4320 hours * (1 day / 24 hours) * (1 month / 30 day) = 6 month which matches
> the relative 6 months ago off the configured interval.

Right, but shouldn't those 4320 hours be subtracted from the date at which the purge starts?
4320 hours ago today is not the same as 4320 ago last week. :) 
So to me, the purge threshold timestamps should be different each time the purge runs, since the time delta should be counted from NOW(), aren't they?

From the logs, that same SQL query with "time_submit <= 1522234799" happened at the following dates:
* Sep 24 04:18:10
* Sep 25 23:17:00
* Sep 26 12:16:59

1522234799 + 4320 hours = 1537786799, which is "Sep 24 03:59:59 PDT 2018". So it looks like the purge started on "Sep 24 ~4am" is tried again several times, on Sep 25 and Sep 26.


> I am not so sure if time should be incremented at each purge interval.
> Inspecting the code I see this before generating the DELETE query:
> 
>                 if (curr_end - record_start > MAX_ARCHIVE_AGE) {
>                         time_t begin_next =
> _get_begin_next_month(record_start);
>                         /* old stuff, catch up by archiving by month */
>                         tmp_archive_period = SLURMDB_PURGE_MONTHS;
>                         tmp_end = MIN(curr_end, begin_next);
>                 } else
>                         tmp_end = curr_end;
> 
> This macros is defined like
> 
> #define MAX_ARCHIVE_AGE (60 * 60 * 24 * 60) /* If archive data is older than
>                                                this then archive by month to
>                                                handle large datasets. */
> 
> It looks like stuff older than 1440 hours (24 * 60) which seems to be your
> case then the logic goes and MIN's the curr_end with the beginning of the
> next month from the oldest purgable record if I interpret things correctly.

Aaah, that would explain why hourly purges are using the same threshold, then.

But does that mean that for any purge interval larger than 60 days, there's no point in specifying it in hours, since it will default to months anyway, is that correct?

If that's the case, right now, we're just triggering the timeout issue way more often that we would with Purges set in months, since they start each hour, but are not purging anything more than they would do if set in months, right?

Cheers,
-- 
Kilian
Comment 23 Alejandro Sanchez 2018-10-04 06:06:45 MDT
I've been playing with the purge process this morning and looking at the code and now I better understand this MAX_ARCHIVE_AGE 60-day thing. Let's see if I help you understand it by explaining an example.

I've configured this:

PurgeJobAfter=4320hour (6 months ago counting from the time the purge process happens. Since I have 'hour' suffix, purge process happens every hour as documented).

So before next hour, I submitted 3 bugs and manually UPDATE'd the time_submit which is the column used as a reference for PurgeJob case. This way I can emulate jobs that were submitted more than 6 months ago. So initially I had these 3 jobs:

alex@ibiza:~/t$ sacct -o jobid,submit
       JobID              Submit
------------ -------------------
20019        2018-10-03T11:55:32 <- not purgable (not older than 6 month)
20020        2018-02-04T10:19:11 <- purgable (older than 6 month)
20021        2017-10-04T11:55:32 <- purgable (older than 6 month)
alex@ibiza:~/t$

Theoretically, the next hour the purge process should be executed as a nested process inside the rollup one. So I wait for the next hour and look at the logs:

[2018-10-04T13:00:00.486] debug2: running rollup at Thu Oct 04 13:00:00 2018 
# OK, next hour rollup process starts and purge/archive logs should follow

[2018-10-04T13:00:00.507] 0(as_mysql_archive.c:3012) query
select time_submit from "ibiza_job_table" where time_submit <= 1523098799 && time_end != 0 order by time_submit asc LIMIT 1
# Get the oldest purgable record job time_submit, meaning the oldest time_submit for any job where time_submit <= 6 month (configured period). This returns job 20021 Submit 2017-10-04T11:55:32

[2018-10-04T13:00:00.507] debug:  Purging ibiza_job_table before 1509490800
[2018-10-04T13:00:00.507] 0(as_mysql_archive.c:3218) query
delete from "ibiza_job_table" where time_submit <= 1509490800 && time_end != 0 LIMIT 50000
# This is the tricky step. curr_end is currently 6 month ago from now. If the difference between the oldest purgable record job time_submit (2017-10-04T11:55:32) and curr_end is higher than 60 day, then instead of purging in this iteration everything older than 6 month, we will purge everything older than the next month from the oldest record. Since the oldest record is submit_time is Oct 4th 2017, we will purge in this iteration everything (with a LIMIT 5000 record) older than the beginning of the next month from that, so 1509490800 as logged in the message, which is November 1st, 2017. This is done to avoid purging/archiving a number or records than span a period higher than 60-day to handle large datasets.


[2018-10-04T13:00:00.512] 0(as_mysql_archive.c:3012) query
select time_submit from "ibiza_job_table" where time_submit <= 1523098799 && time_end != 0 order by time_submit asc LIMIT 1
# This is the next iteration. After the purge in the previous one, Slurm gets the oldest purgable record job time_submit again. For this iteration, the oldest purgable record is jobid's 20020 2018-02-04T10:19:11, since job 20021 was already purged in the previous iteration and job 20020 time_submit is the oldest one being older than 6 month.

[2018-10-04T13:00:00.512] debug:  Purging ibiza_job_table before 1519858800
[2018-10-04T13:00:00.512] 0(as_mysql_archive.c:3218) query
delete from "ibiza_job_table" where time_submit <= 1519858800 && time_end != 0 LIMIT 50000
# Tricky part again. Since the difference between jobid's 20020 2018-02-04T10:19:11 and 1523098799 (6 month ago, April 7, 2018) is higher than 60-day, we are going to get the beginning of the next month taking as a reference jobid's 20020 (1519858800 as logged in the message, March 1, 2018) and purge everything older than that. So job 20020 is now wiped out.

[2018-10-04T13:00:00.520] 0(as_mysql_archive.c:3012) query
select time_submit from "ibiza_job_table" where time_submit <= 1523098799 && time_end != 0 order by time_submit asc LIMIT 1
# Next iteration we can't find the any oldest record older than 6 month ago, so the purge process finishes.

[2018-10-04T13:00:00.520] debug2: No need to roll cluster ibiza this day 1538604000 <= 1538604000
[2018-10-04T13:00:00.520] debug2: No need to roll cluster ibiza this month 1538344800 <= 1538344800
[2018-10-04T13:00:00.528] debug2: Got 1 of 1 rolled up
[2018-10-04T13:00:00.528] debug2: Everything rolled up

I hope this now makes sense to you and you better understand the configured intervals. If you have more than 5000 records in one month, you could see more than one message specifying the next month's timestamp as reference for selected purgable records.
Comment 24 Alejandro Sanchez 2018-10-04 06:10:34 MDT
(In reply to Alejandro Sanchez from comment #23)
...
> So before next hour, I submitted 3 bugs and ...

I submitted 3 jobs, not 3 bugs... :)
Comment 25 Kilian Cavalotti 2018-10-04 09:31:30 MDT
Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #24)
> (In reply to Alejandro Sanchez from comment #23)
> ...
> > So before next hour, I submitted 3 bugs and ...
> 
> I submitted 3 jobs, not 3 bugs... :)

That's an awesome lapsus :)
But not as awesome as the detailed explanation you provided. That's extremely valuable, and indeed makes things much clearer now, thank you very much.

I know how difficult writing documentation could be, but this kind of description of the purges process would definitely have its place in the slurmdbd docs, especially the part about the 60-day threshold and next-month "rounding".

Right now, since we've still been seeing the ER_LOCK_WAIT_TIMEOUT from time to time with Purge intervals in hours, we've changed them back to months, in order to minimize the frequency at which purges could fail and bring the whole slurmdbd down.

Plus, since we're archiving old records, having one archive file generated per hour was not really sustainable. :)

Cheers,
-- 
Kilian
Comment 26 Alejandro Sanchez 2018-10-05 07:00:29 MDT
(In reply to Kilian Cavalotti from comment #25)
> But not as awesome as the detailed explanation you provided. That's
> extremely valuable, and indeed makes things much clearer now, thank you very
> much.
> 
> I know how difficult writing documentation could be, but this kind of
> description of the purges process would definitely have its place in the
> slurmdbd docs, especially the part about the 60-day threshold and next-month
> "rounding".

I'll discuss documenting this internally, although at the end of the day it is just an implementation-specific thing and all records older than the configured value should be purged, no matter if they are purged in monthly chunks or potentially in a different way in the future.
 
> Right now, since we've still been seeing the ER_LOCK_WAIT_TIMEOUT from time
> to time with Purge intervals in hours, we've changed them back to months, in
> order to minimize the frequency at which purges could fail and bring the
> whole slurmdbd down.

OK. The original comment fatal message states

Sep  8 11:19:49 sh-sl01 slurmdbd[173502]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_job_table" where time_submit <= 1504893599 && time_end != 0 LIMIT 50000

but newer attached logs state

Sep 12 02:17:06 sh-sl01 slurmdbd[146907]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012delete from "sherlock_step_table" where time_start <= 28800 && time_end != 0 LIMIT 50000
Sep 12 02:17:06 sh-sl01 slurmdbd[146907]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program

Note time_start <= 28800

Your time zone: Thursday, January 1, 1970 9:00:00 AM GMT+01:00

I'm wondering if there's a miscalculation with curr_end variable, leading the loop condition never being evaluated to false and the loop never ending while holding the locks. The condition is

do {
...
} while (tmp_end < curr_end);

Or if there's any step with a time_start incorrectly set to a value close to Epoch Time like 28800 or something.

> Plus, since we're archiving old records, having one archive file generated
> per hour was not really sustainable. :)

Let's see what I find.
Comment 33 Alejandro Sanchez 2018-10-11 10:09:27 MDT
Kilian, could you attach the output of these?

MariaDB [slurm_acct_db_5662]> select job_db_inx,id_step,time_start from sherlock_step_table order by time_start desc limit 100;

MariaDB [slurm_acct_db_5662]> select job_db_inx,id_job,time_submit from sherlock_job_table order by time_submit asc limit 100;

Thanks.
Comment 34 Alejandro Sanchez 2018-10-11 10:10:09 MDT
(In reply to Alejandro Sanchez from comment #33)
> Kilian, could you attach the output of these?
> 
> MariaDB [slurm_acct_db_5662]> select job_db_inx,id_step,time_start from
> sherlock_step_table order by time_start desc limit 100;

order by ASC in both queries, sorry.
Comment 36 Kilian Cavalotti 2018-10-11 10:22:26 MDT
(In reply to Alejandro Sanchez from comment #34)
> (In reply to Alejandro Sanchez from comment #33)
> > Kilian, could you attach the output of these?
> > 
> > MariaDB [slurm_acct_db_5662]> select job_db_inx,id_step,time_start from
> > sherlock_step_table order by time_start desc limit 100;
> 
> order by ASC in both queries, sorry.

Sure:

MariaDB [slurm_acct_db]> select job_db_inx,id_step,time_start from sherlock_step_table order by time_start asc limit 100;
+------------+---------+------------+
| job_db_inx | id_step | time_start |
+------------+---------+------------+
|  506953033 |      -2 |          0 |
|  506455957 |      -2 |          0 |
|  506457917 |      -2 |          0 |
|  506458069 |      -2 |          0 |
|  506460885 |      -2 |          0 |
|  506497929 |      -2 |          0 |
|  506497993 |      -2 |          0 |
|  506566429 |      -2 |          0 |
|  506677481 |      -2 |          0 |
|  506727193 |      -2 |          0 |
|  506727197 |      -2 |          0 |
|  506414293 |      -2 |          0 |
|  506412737 |      -2 |          0 |
|  507083229 |      -2 |          0 |
|  506917357 |      -2 |          0 |
|  506977305 |      -2 |          0 |
|  506990065 |      -2 |          0 |
|  507020305 |      -2 |          0 |
|  507034205 |      -2 |          0 |
|  507083437 |      -2 |          0 |
|  507059321 |      -2 |          0 |
|  507034201 |      -2 |          0 |
|  506919249 |      -2 |          0 |
|  505873369 |      -2 |          0 |
|  506730917 |      -2 |          0 |
|  506734041 |      -2 |          0 |
|  506868781 |      -2 |          0 |
|  506872193 |      -2 |          0 |
|  506872209 |      -2 |          0 |
|  506884801 |      -2 |          0 |
|  506898505 |      -2 |          0 |
|  506913329 |      -2 |          0 |
|  506913333 |      -2 |          0 |
|  506919801 |      -2 |          0 |
|  506933705 |      -2 |          0 |
|  506936549 |      -2 |          0 |
|  506866065 |      -2 |          0 |
|  506801741 |      -2 |          0 |
|  506801733 |      -2 |          0 |
|  506734049 |      -2 |          0 |
|  506736233 |      -2 |          0 |
|  506736297 |      -2 |          0 |
|  506736305 |      -2 |          0 |
|  506779277 |      -2 |          0 |
|  506784337 |      -2 |          0 |
|  506784349 |      -2 |          0 |
|  506793217 |      -2 |          0 |
|  506793225 |      -2 |          0 |
|  506801697 |      -2 |          0 |
|  506938917 |      -2 |          0 |
|  506964397 |      -2 |          0 |
|  507127081 |      -2 |          0 |
|  507127069 |      -2 |          0 |
|  507127029 |      -2 |          0 |
|  507127021 |      -2 |          0 |
|  507127017 |      -2 |          0 |
|  507127013 |      -2 |          0 |
|  507083441 |      -2 |          0 |
|  507085365 |      -2 |          0 |
|  507085573 |      -2 |          0 |
|  507085577 |      -2 |          0 |
|  507118941 |      -2 |          0 |
|  507127065 |      -2 |          0 |
|  507127073 |      -2 |          0 |
|  507127161 |      -2 |          0 |
|  507127089 |      -2 |          0 |
|  507127153 |      -2 |          0 |
|  507127097 |      -2 |          0 |
|  507127101 |      -2 |          0 |
|  507127085 |      -2 |          0 |
|  507127105 |      -2 |          0 |
|  507127093 |      -2 |          0 |
|  507127077 |      -2 |          0 |
|  507127025 |      -2 |          0 |
|  507101445 |      -2 |          0 |
|  507125905 |      -2 |          0 |
|  507125913 |      -2 |          0 |
|  506801749 |      -2 |          0 |
|  506891509 |      -2 |          0 |
|  506873145 |      -2 |          0 |
|  506841957 |      -2 |          0 |
|  506766217 |      -2 |          0 |
|  506454985 |      -2 |          0 |
|  506944281 |      -2 |          0 |
|  506913465 |      -2 |          0 |
|  506944677 |      -2 |          0 |
|  506953969 |      -2 |          0 |
|  506755421 |      -2 |          0 |
|  506793221 |      -2 |          0 |
|  506801629 |      -2 |          0 |
|  507126685 |      -2 |          0 |
|  507127001 |      -2 |          0 |
|  507127005 |      -2 |          0 |
|  507127009 |      -2 |          0 |
|  506734045 |      -2 |          0 |
|  506734053 |      -2 |          0 |
|  506726825 |      -2 |          0 |
|  506414325 |      -2 |          0 |
|  506726829 |      -2 |          0 |
|  506435717 |      -2 |          0 |
+------------+---------+------------+
100 rows in set (37.53 sec)

And

MariaDB [slurm_acct_db]> select job_db_inx,id_job,time_submit from sherlock_job_table order by time_submit asc limit 100;
+------------+----------+-------------+
| job_db_inx | id_job   | time_submit |
+------------+----------+-------------+
|  332580904 | 13125706 |  1522602001 |
|  332580988 | 13125712 |  1522602007 |
|  332580985 | 13125711 |  1522602007 |
|  332580982 | 13125710 |  1522602007 |
|  332580979 | 13125709 |  1522602007 |
|  332580976 | 13125708 |  1522602007 |
|  332580973 | 13125707 |  1522602007 |
|  332580991 | 13125713 |  1522602009 |
|  332581132 | 13125715 |  1522602018 |
|  332581201 | 13125716 |  1522602024 |
|  332581216 | 13125721 |  1522602025 |
|  332581213 | 13125720 |  1522602025 |
|  332581210 | 13125719 |  1522602025 |
|  332581207 | 13125718 |  1522602025 |
|  332581204 | 13125717 |  1522602025 |
|  332581219 | 13125722 |  1522602026 |
|  332581360 | 13125724 |  1522602036 |
|  332581447 | 13125731 |  1522602042 |
|  332581444 | 13125730 |  1522602042 |
|  332581441 | 13125729 |  1522602042 |
|  332581438 | 13125728 |  1522602042 |
|  332581435 | 13125727 |  1522602042 |
|  332581432 | 13125726 |  1522602042 |
|  332581381 | 13008848 |  1522602044 |
|  332581513 | 13125732 |  1522602053 |
|  332581582 | 13125734 |  1522602058 |
|  332581579 | 13125733 |  1522602058 |
|  332581585 | 13125735 |  1522602059 |
|  332581588 | 13125736 |  1522602059 |
|  332581591 | 13125737 |  1522602059 |
|  332581594 | 13125738 |  1522602059 |
|  332581597 | 13125739 |  1522602060 |
|  332581729 | 13125740 |  1522602069 |
|  332581795 | 13125742 |  1522602076 |
|  332581798 | 13125743 |  1522602076 |
|  332581801 | 13125744 |  1522602076 |
|  332581804 | 13125745 |  1522602076 |
|  332581807 | 13125746 |  1522602076 |
|  332581810 | 13125747 |  1522602076 |
|  332581813 | 13125748 |  1522602078 |
|  332581942 | 13125749 |  1522602087 |
|  332582023 | 13125755 |  1522602093 |
|  332582020 | 13125754 |  1522602093 |
|  332582017 | 13125753 |  1522602093 |
|  332582014 | 13125752 |  1522602093 |
|  332582011 | 13125751 |  1522602093 |
|  332582008 | 13125750 |  1522602093 |
|  332582026 | 13125756 |  1522602095 |
|  332582155 | 13125757 |  1522602104 |
|  332582236 | 13125763 |  1522602110 |
|  332582233 | 13125762 |  1522602110 |
|  332582230 | 13125761 |  1522602110 |
|  332582227 | 13125760 |  1522602110 |
|  332582224 | 13125759 |  1522602110 |
|  332582221 | 13125758 |  1522602110 |
|  332582239 | 13125764 |  1522602112 |
|  332582431 | 13125765 |  1522602142 |
|  332582446 | 13125770 |  1522602145 |
|  332582443 | 13125769 |  1522602145 |
|  332582440 | 13125768 |  1522602145 |
|  332582437 | 13125767 |  1522602145 |
|  332582434 | 13125766 |  1522602145 |
|  332582449 | 13125771 |  1522602146 |
|  332582578 | 13125772 |  1522602156 |
|  332582644 | 13125773 |  1522602162 |
|  332582647 | 13125774 |  1522602162 |
|  332582650 | 13125775 |  1522602162 |
|  332582659 | 13125778 |  1522602163 |
|  332582656 | 13125777 |  1522602163 |
|  332582653 | 13125776 |  1522602163 |
|  332582662 | 13125779 |  1522602164 |
|  332582728 | 13125780 |  1522602173 |
|  332582809 | 13125786 |  1522602179 |
|  332582806 | 13125785 |  1522602179 |
|  332582803 | 13125784 |  1522602179 |
|  332582800 | 13125783 |  1522602179 |
|  332582797 | 13125782 |  1522602179 |
|  332582794 | 13125781 |  1522602179 |
|  332582875 | 13125787 |  1522602180 |
|  332582941 | 13125788 |  1522602190 |
|  332583022 | 13125794 |  1522602196 |
|  332583019 | 13125793 |  1522602196 |
|  332583016 | 13125792 |  1522602196 |
|  332583013 | 13125791 |  1522602196 |
|  332583010 | 13125790 |  1522602196 |
|  332583007 | 13125789 |  1522602196 |
|  332583025 | 13125795 |  1522602198 |
|  332583154 | 13125796 |  1522602207 |
|  332583235 | 13125802 |  1522602213 |
|  332583232 | 13125801 |  1522602213 |
|  332583229 | 13125800 |  1522602213 |
|  332583226 | 13125799 |  1522602213 |
|  332583223 | 13125798 |  1522602213 |
|  332583220 | 13125797 |  1522602213 |
|  332583238 | 13125803 |  1522602215 |
|  332583367 | 13125804 |  1522602224 |
|  332583442 | 13125808 |  1522602230 |
|  332583439 | 13125807 |  1522602230 |
|  332583436 | 13125806 |  1522602230 |
|  332583445 | 13125809 |  1522602230 |
+------------+----------+-------------+
100 rows in set (31.64 sec)

Cheers,
-- 
Kilian
Comment 37 Alejandro Sanchez 2018-10-11 10:28:26 MDT
I find it very strange that you have batch steps with time_start 0. I suspect there's a bug somewhere deriving in such steps with no time_start and potentially this having an impact on the do while statements conditions mentioned in comment 25. Let's see what we find.
Comment 38 Kilian Cavalotti 2018-10-11 10:37:21 MDT
(In reply to Alejandro Sanchez from comment #37)
> I find it very strange that you have batch steps with time_start 0. I
> suspect there's a bug somewhere deriving in such steps with no time_start
> and potentially this having an impact on the do while statements conditions
> mentioned in comment 25. Let's see what we find.

Indeed. Shouldn't they be detected (and potentially cleaned up) by "sacctmgr list runaway"?
Comment 39 Alejandro Sanchez 2018-10-12 05:51:16 MDT
(In reply to Kilian Cavalotti from comment #38)
> (In reply to Alejandro Sanchez from comment #37)
> > I find it very strange that you have batch steps with time_start 0. I
> > suspect there's a bug somewhere deriving in such steps with no time_start
> > and potentially this having an impact on the do while statements conditions
> > mentioned in comment 25. Let's see what we find.
> 
> Indeed. Shouldn't they be detected (and potentially cleaned up) by "sacctmgr
> list runaway"?

No, since the runawayjobs logic doesn't inspect the step table but the job table.
Comment 40 Kilian Cavalotti 2018-11-16 15:43:33 MST
Hi!

Just a quick update that we're still seeing that error, and the resulting slurmdbd crash, on a regular basis (now monthly based on our current purge interval settings).

Would manually removing job steps with start_time = 0 from the steps table help avoid those errors? 


Cheers,
-- 
Kilian
Comment 41 Alejandro Sanchez 2018-11-19 07:54:17 MST
(In reply to Kilian Cavalotti from comment #40)
> Hi!
> 
> Just a quick update that we're still seeing that error, and the resulting
> slurmdbd crash, on a regular basis (now monthly based on our current purge
> interval settings).

Yeah, I guess no matter the periodicity of the purges, eventually slurmdbd is hitting this.
 
> Would manually removing job steps with start_time = 0 from the steps table
> help avoid those errors? 

I'm not 100% confident about that since I did tests by manually setting finished steps start_time to 0, and then re-running the purge process and that didn't cause the ER_LOCK_WAIT_TIMEOUT. So far I've not been able to reproduce, so I'm not yet ready to prepare a fix which couldn't be verified. First I need to work further on this and identify who/what is holding the access locks to the tables and making the purge process timeout.
 
> 
> Cheers,
> -- 
> Kilian
Comment 42 Alejandro Sanchez 2019-01-03 05:46:29 MST
Hi Kilian. In the other bug 5662 Galera was impeding proper slurmdbd upgrade. I'm wondering if Galera is causing issues with locks to the job/step table while the purging/archive logic is running. Do you think it'll be too much of an effort for you to try to temporarily/definitely remove Galera from your setup and see if that lets the purging process finish successfully? if problem still persists, at least we know it's not due to this Galera setup this time. Thanks and sorry for the delay on this one.
Comment 43 Kilian Cavalotti 2019-01-11 12:34:29 MST
Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #42)
> Hi Kilian. In the other bug 5662 Galera was impeding proper slurmdbd
> upgrade. I'm wondering if Galera is causing issues with locks to the
> job/step table while the purging/archive logic is running. 

That's a good point, indeed.

> Do you think
> it'll be too much of an effort for you to try to temporarily/definitely
> remove Galera from your setup and see if that lets the purging process
> finish successfully? if problem still persists, at least we know it's not
> due to this Galera setup this time. 

I can try to do this, but then we would basically loose the base for our primary/secondary SlurmDBD setup.

Which makes me wonder how this is supposed to be implemented in the first place? Since using Galera clusters is the default and easiest way to implement active/active MySQL setups, and given that it apparently causes a variety of problems with Slurm, what's the recommended architecture for redundant SlurmDBD?

To give more context, our primary goal is to have an environment where we can service servers (updating kernels, firmwares, rebooting...) without impacting production. Which is why we need a way to fail over Slurm services to another node while we're servicing the primary one, and Galera was the obvious choice for SlurmDBD.
And our users rely pretty heavily on sacct, so we unfortunately can't afford to have the slurmdbd daemon unavailable for extended periods of time.

Thanks!
-- 
Kilian
Comment 44 Alejandro Sanchez 2019-01-14 00:32:30 MST
Our recommendation for slurmdbd HA is paradoxically no HA at all[1]:

For slurmdbd, the critical element in the failure domain is MySQL/MariaDB, not slurmdbd. slurmdbd itself is stateless.

slurmctld will cache accounting records (up to a limit) if slurmdbd is unavailable. This can be hours+ to days+ depending on your system without data loss. More specifically, the amount of ctld cached messages to dbd is:

MAX(10000, ((MaxJobCount * 2) + (#Nodes * 4)));

IMNSHO, the additional complexity of a redundant MySQL/MariaDB deployment is more likely to cause an outage than it is to prevent one. So don’t bother setting up a redundant slurmdbd, keep slurmdbd + MySQL/MariaDB local to a single server.

If you still need to do live maintenance on the host where slurmdbd/MySQL/MariaDB runs, you can either let ctld cache the messages to the dbd or use a temporal host to run dbd/database and change [Accounting]StorageHost accordingly.

With respect to Galera, I've seen at least 3-4 bugs where it was the root cause of issues. I'm curious to see if the purge/archive process works properly without it. If not, it'd require more investigation from our side.

[1] https://slurm.schedmd.com/SLUG18/field_notes2.pdf
Comment 45 Kilian Cavalotti 2019-01-15 11:03:55 MST
Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #44)
> Our recommendation for slurmdbd HA is paradoxically no HA at all[1]:

:)

> For slurmdbd, the critical element in the failure domain is MySQL/MariaDB,
> not slurmdbd. slurmdbd itself is stateless.

Indeed. Which is why a MySQL Galera cluster is a pretty easy way to make the db part more resilient. And it works great in many cases, we use it routinely for other things like shared session cache for load balanced Shibboleth SPs without any issue.

> slurmctld will cache accounting records (up to a limit) if slurmdbd is
> unavailable. 
> This can be hours+ to days+ depending on your system without
> data loss. More specifically, the amount of ctld cached messages to dbd is:
> MAX(10000, ((MaxJobCount * 2) + (#Nodes * 4)));

Accounting record caching is definitely helpful, but our users rely on the Slurm database for many other use cases, such as using sacct to get information about their past jobs, sreport for stats and reporting, or wanting to check the various QOS settings with sacctmgr.

> IMNSHO, the additional complexity of a redundant MySQL/MariaDB deployment is
> more likely to cause an outage than it is to prevent one. So don’t bother
> setting up a redundant slurmdbd, keep slurmdbd + MySQL/MariaDB local to a
> single server.

I tend to disagree: the additional complexity of setting up a Galera cluster is literally a 5-line configuration file. And it's pretty reliable overall, the only problematic interaction we've seen so far is with slurmdbd, unfortunately.

> If you still need to do live maintenance on the host where
> slurmdbd/MySQL/MariaDB runs, you can either let ctld cache the messages to
> the dbd or use a temporal host to run dbd/database and change
> [Accounting]StorageHost accordingly.

Using another temporal dbd/db host would mean synchronizing the actual database to that host before switching, and guess what's the easiest way to do this? Setting up a Galera cluster. ;)

> With respect to Galera, I've seen at least 3-4 bugs where it was the root
> cause of issues. I'm curious to see if the purge/archive process works
> properly without it. If not, it'd require more investigation from our side.

Right now, we're still running SlurmDBD on top of our MariaDB/Galera cluster, and we haven't seen the ER_LOCK_WAIT_TIMEOUT error since we upgraded to 18.08 on Nov. 28th. Since we still have our Purge* settings set to "6months" in slurmdbd.conf, the purge should be happening every month, and they likely occurred once already since the Nov. 28th upgrade. 

So bottom line, we would love to keep our Galera setup in place, and would appreciate if it could become supported more officially, as it provides a lots of benefits (replication, high-availability, etc) for a very minimal installation effort. 

Thanks,
--
Kilian
Comment 46 Alejandro Sanchez 2019-01-16 03:09:09 MST
(In reply to Kilian Cavalotti from comment #45)
> I tend to disagree: the additional complexity of setting up a Galera cluster
> is literally a 5-line configuration file. And it's pretty reliable overall,
> the only problematic interaction we've seen so far is with slurmdbd,
> unfortunately.
> 
> > If you still need to do live maintenance on the host where
> > slurmdbd/MySQL/MariaDB runs, you can either let ctld cache the messages to
> > the dbd or use a temporal host to run dbd/database and change
> > [Accounting]StorageHost accordingly.
> 
> Using another temporal dbd/db host would mean synchronizing the actual
> database to that host before switching, and guess what's the easiest way to
> do this? Setting up a Galera cluster. ;)

I'm not saying it's not a good technology. Just from our experience with previous bugs (i.e. bug 5662) it's yet another external to Slurm layer sometimes making slurmdbd not working properly.
 
> > With respect to Galera, I've seen at least 3-4 bugs where it was the root
> > cause of issues. I'm curious to see if the purge/archive process works
> > properly without it. If not, it'd require more investigation from our side.
> 
> Right now, we're still running SlurmDBD on top of our MariaDB/Galera
> cluster, and we haven't seen the ER_LOCK_WAIT_TIMEOUT error since we
> upgraded to 18.08 on Nov. 28th. Since we still have our Purge* settings set
> to "6months" in slurmdbd.conf, the purge should be happening every month,
> and they likely occurred once already since the Nov. 28th upgrade. 

So can you confirm this issue is gone currently? Are jobs older than 6month purged?

> So bottom line, we would love to keep our Galera setup in place, and would
> appreciate if it could become supported more officially, as it provides a
> lots of benefits (replication, high-availability, etc) for a very minimal
> installation effort. 

You can keep whatever setup fits better your needs. We just suggest config/arch recommendations based upon our experience, not necessarily accepted as a universal truth that works for everybody.

While I've made efforts to set up a Galera cluster with a replica of your db and I try to debug an HA approach similar to yours, it's yet another framework/layer adding more complexity to the whole system. The config might be 5 lines, but the underlying interactions with the database are something that fall a bit outside our Slurm support scope.
Comment 47 Kilian Cavalotti 2019-01-16 09:24:42 MST
Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #46)
> So can you confirm this issue is gone currently? Are jobs older than 6month
> purged?

I believe that the last purge cycle ran without any issue yes: the archive files have been created on Jan 1st, and no ER_LOCK_WAIT_TIMEOUT has been logged by slurmdbd.

Is there a SQL query I could run on the database to make sure the old records have been purged?

> While I've made efforts to set up a Galera cluster with a replica of your db
> and I try to debug an HA approach similar to yours, it's yet another
> framework/layer adding more complexity to the whole system. The config might
> be 5 lines, but the underlying interactions with the database are something
> that fall a bit outside our Slurm support scope.

I completely understand, no worries. My main point is that Galera is an easy way to get HA and replication for a MariaDB/MySQL database, and that it could be an interesting enhancement to support this config for user sites.

Cheers,
-- 
Kilian
Comment 48 Alejandro Sanchez 2019-01-16 10:05:18 MST
(In reply to Kilian Cavalotti from comment #47)
> I believe that the last purge cycle ran without any issue yes: the archive
> files have been created on Jan 1st, and no ER_LOCK_WAIT_TIMEOUT has been
> logged by slurmdbd.

Not sure if you had DebugFlags=DB_ARCHIVE set by Jan 1st, but if so and if you still have the logs from that day you could inspect the messages related to purge/archive.
 
> Is there a SQL query I could run on the database to make sure the old
> records have been purged?

If you had Purge[Job|Step]After=6months configured by Jan 1st, then I guess the SQL query could be something like this, assuming I've properly calculated the unix timestamp 6month before Jan 1st on Stanford's timezone...

alex@polaris:~$ TZ='America/Tijuana' date --date="2018-07-01 00:00:00" +"%s"
1530428400
alex@polaris:~$

select count(*) from sherlock_job_table where time_submit < 1530428400;

and

select count(*) from client001_step_table where time_start < 1530428400;

They both should report 0 (again assuming the numbers are OK and a purge happened on Jan 1st).
Comment 49 Kilian Cavalotti 2019-01-16 10:56:55 MST
(In reply to Alejandro Sanchez from comment #48)
> Not sure if you had DebugFlags=DB_ARCHIVE set by Jan 1st, but if so and if
> you still have the logs from that day you could inspect the messages related
> to purge/archive.

We unfortunately didn't keep the DebugFlags active.

> They both should report 0 (again assuming the numbers are OK and a purge
> happened on Jan 1st).

Thanks!
Here's what I got:

MariaDB [slurm_acct_db]> SELECT COUNT(*) from sherlock_job_table where time_submit < UNIX_TIMESTAMP('2018-07-01 00:00:00');
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+
1 row in set (5.58 sec)

MariaDB [slurm_acct_db]> SELECT COUNT(*) from sherlock_step_table where time_start < UNIX_TIMESTAMP('2018-07-01 00:00:00');
+----------+
| COUNT(*) |
+----------+
|     1835 |
+----------+
1 row in set (36.32 sec)

So jobs are ok, but it looks like we have leftover step records.

Looking more closely at them, they all seem to have a time_start of 0, and they all seem pretty recent jobs. Maybe they're just the recent step records that are in the process of being written to the DB?

If they're not, please let me know if I should worry about them. :)

Cheers,
-- 
Kilian
Comment 50 Alejandro Sanchez 2019-01-17 04:42:42 MST
A time_start of 0 for the steps is very strange. Can you please attach the output of the following query? let's see if there's a common pattern that clearly identifies those steps with time_start 0, and potentially be able to reproduce.

select job.job_db_inx, job.mod_time, job.exit_code, job.state, job.time_submit, job.time_eligible, job.time_start, job.time_end, job.time_suspended, step.job_db_inx, step.deleted, step.exit_code, step.id_step, step.kill_requid, step.state, step.time_start, step.time_end, step.time_suspended from sherlock_job_table as job left outer join sherlock_step_table as step on job.job_db_inx=step.job_db_inx where step.time_start=0;
Comment 51 Kilian Cavalotti 2019-01-17 09:46:50 MST
Created attachment 8954 [details]
db query output

Hi Alejandro,

Thanks for looking into this. Here's the output of the SQL query.

Cheers,
-- 
Kilian
Comment 52 Alejandro Sanchez 2019-01-17 10:23:35 MST
Mmmm, interestingly most of the steps with time_start 0 happen to belong to a job with state 8 (JOB_PREEMPTED) with exit_code 0 or in a fewer count state 7 (JOB_NODE_FAIL) with exit_code 256. Let's see if I find something.
Comment 53 Alejandro Sanchez 2019-03-20 03:14:05 MDT
(In reply to Kilian Cavalotti from comment #47)
> I believe that the last purge cycle ran without any issue yes: the archive
> files have been created on Jan 1st, and no ER_LOCK_WAIT_TIMEOUT has been
> logged by slurmdbd.

Hey Kilian, given the timeout hasn't reappeared would you be fine moving forward and closing this bug? Thanks.
Comment 54 Kilian Cavalotti 2019-03-20 09:17:53 MDT
(In reply to Alejandro Sanchez from comment #53)
> (In reply to Kilian Cavalotti from comment #47)
> > I believe that the last purge cycle ran without any issue yes: the archive
> > files have been created on Jan 1st, and no ER_LOCK_WAIT_TIMEOUT has been
> > logged by slurmdbd.
> 
> Hey Kilian, given the timeout hasn't reappeared would you be fine moving
> forward and closing this bug? Thanks.

Absolutely, thanks!

Cheers,
-- 
Kilian