| Summary: | slurmdbd crash with er_lock_wait_timeout | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Yann <yann.sagon> |
| Component: | Database | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | alex |
| Version: | 15.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Université de Genève | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Yann
2016-01-12 20:59:11 MST
As the message implies, the database wasn't responding for a long time, normally for 15 minutes. When this happens the only way to fix the issue is to restart the calling program, this is the reason for the fatal. I don't know if there is much we can do here since this is a mysql issue. It is highly advised you run the slurmdbd on top to database instead of the slurmctld. In addition to the other numerous advantages the slurmdbd offers it would prevent the slurmctld getting a fatal here. I see you already use slurmdbd on top to database. Is the database used for anything else? Please, check if there is anything locking the database. A common cause of this is mysqldump running or something. Running mysqldump will sometimes cause all sorts of issues on a live database. If this is the situation please make sure you have the following mysqldump options...
--single-transaction
--quick
--lock-tables=false
Without them the database will lock up and you will get these kind of issues. I am hoping this is the case, or something like it was locking the tables messing things up.
I'm using mysqldump but it wasn't running when the crash occured. I think slurmdbd is the only active user of mysql. In mysqld log, I saw this entry: 160101 0:01:51 InnoDB: ERROR: the age of the last checkpoint is 9433858, InnoDB: which exceeds the log group capacity 9433498. InnoDB: If you are using big BLOB or TEXT rows, you must set the InnoDB: combined size of log files at least 10 times bigger than the InnoDB: largest such row. I don't know if this is a problem. I have set the innodb_log_file_size to 64M as it wasn't set before. I'll let you know if the problem still occurs. All right. Maybe you could also consider taking a look at these other specific mysql parameters: innodb_log_buffer_size innodb_buffer_pool_size But let's see if the one that you modified solves the issue. It seems that innodb_log_file_size set to 64M did the trick as no crash happened anymore. Thanks Glad to hear it solved the issue. Closing the ticket, if you have more issues please file a new bug. |