2339 – slurmdbd crash with er_lock_wait_timeout

Ticket 2339 - slurmdbd crash with er_lock_wait_timeout

Summary: slurmdbd crash with er_lock_wait_timeout

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Database (show other tickets)
Version:	15.08.4
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-01-12 20:59 MST by Yann
Modified:	2016-01-13 19:15 MST (History)
CC List:	1 user (show)

See Also:
Site:	Université de Genève
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Yann 2016-01-12 20:59:11 MST

Dear team,

When I launch slurmdbd, it stop after some time.

The two last entries in the log are here:


[2016-01-13T11:34:51.474] error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction
update "baobab_step_table" set time_end=1451801815, state=3, kill_requid=-1, exit_code=0, user_sec=4345412, user_usec=266923, sys_sec=1325725, sys_usec=625070, max_disk_read=1742.529670, max_disk_read_task=0, max_disk_read_node=0, ave_disk_read=1742.529670, max_disk_write=5966.815731, max_disk_write_task=0, max_disk_write_node=0, ave_disk_write=5966.815731, max_vsize=1088148, max_vsize_task=0, max_vsize_node=0, ave_vsize=797464.000000, max_rss=162116, max_rss_task=0, max_rss_node=0, ave_rss=140488.000000, max_pages=5, max_pages_task=0, max_pages_node=0, ave_pages=5.000000, min_cpu=617, min_cpu_task=0, min_cpu_node=0, ave_cpu=495.880000, act_cpufreq=282348, consumed_energy=0 where job_db_inx=2266880 and id_step=0
[2016-01-13T11:34:51.474] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program

What does that mean? Is there a bug or should I increase the lock timeout of mysql (what would be the option and time)?

I don't know if this is related but the slurmdbd log is filled with entries like this one:

[2016-01-13T11:33:55.376] error: We have more allocated time than is possible (49906800000 > 34705591200) for cluster baobab(9640442) from 2016-01-13T10:00:00 - 2016-01-13T11:00:00 tres 2



Thanks

Comment 1 Alejandro Sanchez 2016-01-12 21:05:52 MST

As the message implies, the database wasn't responding for a long time,  normally for 15 minutes. When this happens the only way to fix the issue is to restart the calling program,  this is the reason for the fatal.  I don't know if there is much we can do here since this is a mysql issue.

It is highly advised you run the slurmdbd on top to database instead of the slurmctld.  In addition to the other numerous advantages the slurmdbd offers it would prevent the slurmctld getting a fatal here.

Comment 2 Alejandro Sanchez 2016-01-12 21:10:18 MST

I see you already use slurmdbd on top to database. Is the database used for anything else? Please, check if there is anything locking the database. A common cause of this is mysqldump running or something.  Running mysqldump will sometimes cause all sorts of issues on a live database.  If this is the situation please make sure you have the following mysqldump options...

    --single-transaction
    --quick
    --lock-tables=false

Without them the database will lock up and you will get these kind of issues.  I am hoping this is the case, or something like it was locking the tables messing things up.

Comment 3 Yann 2016-01-12 23:01:41 MST

I'm using mysqldump but it wasn't running when the crash occured.

I think slurmdbd is the only active user of mysql.

In mysqld log, I saw this entry:

160101  0:01:51  InnoDB: ERROR: the age of the last checkpoint is 9433858,
InnoDB: which exceeds the log group capacity 9433498.
InnoDB: If you are using big BLOB or TEXT rows, you must set the
InnoDB: combined size of log files at least 10 times bigger than the
InnoDB: largest such row.

I don't know if this is a problem.

I have set the innodb_log_file_size to 64M as it wasn't set before.


I'll let you know if the problem still occurs.

Comment 4 Alejandro Sanchez 2016-01-12 23:46:15 MST

All right. Maybe you could also consider taking a look at these other specific mysql parameters:

innodb_log_buffer_size
innodb_buffer_pool_size

But let's see if the one that you modified solves the issue.

Comment 6 Yann 2016-01-13 18:24:46 MST

It seems that innodb_log_file_size set to 64M did the trick as no crash happened anymore.

Thanks

Comment 7 Alejandro Sanchez 2016-01-13 19:15:29 MST

Glad to hear it solved the issue. Closing the ticket, if you have more issues please file a new bug.