Dear team, When I launch slurmdbd, it stop after some time. The two last entries in the log are here: [2016-01-13T11:34:51.474] error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction update "baobab_step_table" set time_end=1451801815, state=3, kill_requid=-1, exit_code=0, user_sec=4345412, user_usec=266923, sys_sec=1325725, sys_usec=625070, max_disk_read=1742.529670, max_disk_read_task=0, max_disk_read_node=0, ave_disk_read=1742.529670, max_disk_write=5966.815731, max_disk_write_task=0, max_disk_write_node=0, ave_disk_write=5966.815731, max_vsize=1088148, max_vsize_task=0, max_vsize_node=0, ave_vsize=797464.000000, max_rss=162116, max_rss_task=0, max_rss_node=0, ave_rss=140488.000000, max_pages=5, max_pages_task=0, max_pages_node=0, ave_pages=5.000000, min_cpu=617, min_cpu_task=0, min_cpu_node=0, ave_cpu=495.880000, act_cpufreq=282348, consumed_energy=0 where job_db_inx=2266880 and id_step=0 [2016-01-13T11:34:51.474] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program What does that mean? Is there a bug or should I increase the lock timeout of mysql (what would be the option and time)? I don't know if this is related but the slurmdbd log is filled with entries like this one: [2016-01-13T11:33:55.376] error: We have more allocated time than is possible (49906800000 > 34705591200) for cluster baobab(9640442) from 2016-01-13T10:00:00 - 2016-01-13T11:00:00 tres 2 Thanks
As the message implies, the database wasn't responding for a long time, normally for 15 minutes. When this happens the only way to fix the issue is to restart the calling program, this is the reason for the fatal. I don't know if there is much we can do here since this is a mysql issue. It is highly advised you run the slurmdbd on top to database instead of the slurmctld. In addition to the other numerous advantages the slurmdbd offers it would prevent the slurmctld getting a fatal here.
I see you already use slurmdbd on top to database. Is the database used for anything else? Please, check if there is anything locking the database. A common cause of this is mysqldump running or something. Running mysqldump will sometimes cause all sorts of issues on a live database. If this is the situation please make sure you have the following mysqldump options... --single-transaction --quick --lock-tables=false Without them the database will lock up and you will get these kind of issues. I am hoping this is the case, or something like it was locking the tables messing things up.
I'm using mysqldump but it wasn't running when the crash occured. I think slurmdbd is the only active user of mysql. In mysqld log, I saw this entry: 160101 0:01:51 InnoDB: ERROR: the age of the last checkpoint is 9433858, InnoDB: which exceeds the log group capacity 9433498. InnoDB: If you are using big BLOB or TEXT rows, you must set the InnoDB: combined size of log files at least 10 times bigger than the InnoDB: largest such row. I don't know if this is a problem. I have set the innodb_log_file_size to 64M as it wasn't set before. I'll let you know if the problem still occurs.
All right. Maybe you could also consider taking a look at these other specific mysql parameters: innodb_log_buffer_size innodb_buffer_pool_size But let's see if the one that you modified solves the issue.
It seems that innodb_log_file_size set to 64M did the trick as no crash happened anymore. Thanks
Glad to hear it solved the issue. Closing the ticket, if you have more issues please file a new bug.