| Summary: | Deadlock found when trying to get lock | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | PDT Partners <customer-pdt> |
| Component: | slurmdbd | Assignee: | Director of Support <support> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | nate |
| Version: | 18.08.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=7161 | ||
| Site: | PDT Partners | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurmdbd log | ||
Please call this in mysql:
> SHOW ENGINE INNODB STATUS;
(In reply to PDT Partners from comment #1) > Sorry, by mistake I submitted the same case twice. Please delete the case > 8105. I will upload logs here. *** This ticket has been marked as a duplicate of ticket 8106 *** |
Created attachment 12324 [details] slurmdbd log Hello, This case is for our cluster cloud. We have slurmdbd setup on aws instances and database is setup on aws rds instances (aurora-mysql). Out of nowhere we are seeing issues with primary slurmdbd. Yesterday slurmdbd stopped responding and failed to work upon restarting the service. It kept on saying it couldn't talk to database. But I was able to connect to database from slurmdbd server as slurm user. Upon checking show processlist, it looked like slurmdbd was trying to work on all the tables and so it took a while for it to start. Luckily, in the mean time I was able to get secondary slurmdbd working. I am afraid that this could happen to secondary as well then we'll be in big trouble. We do depend big time on slurmdbd for our work. Can you please let us know why this could happen and how we can fix it? This is what I found upon checking the logs ---------------------------------- from slurmdbd.log ---------------------------------- [2019-11-14T08:00:17.981] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction delete quick from "stgtfhpcdevc5c7e0_assoc_table" where lft between 1 AND 8;UPDATE "stgtfhpcdevc5c7e0_assoc_table" SET rgt = rgt - 8 WHERE rgt > 8;UPDATE "stgtfhpcdevc5c7e0_assoc_table" SET lft = lft - 8 WHERE lft > 8; [2019-11-14T08:00:17.981] error: couldn't remove assoc [2019-11-14T08:00:17.982] error: CONN:37 No error [2019-11-14T08:00:18.585] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction delete quick from "stgtfhpcdev879f65_assoc_table" where lft between 1 AND 6;UPDATE "stgtfhpcdev879f65_assoc_table" SET rgt = rgt - 6 WHERE rgt > 6;UPDATE "stgtfhpcdev879f65_assoc_table" SET lft = lft - 6 WHERE lft > 6; [2019-11-14T08:00:18.585] error: couldn't remove assoc [2019-11-14T08:00:18.586] error: CONN:37 No error --------------------------------------------------- --------------------------------------------------- show processlist --------------------------------------------------- MySQL [(none)]> show processlist; +--------+----------+--------------------+---------------+---------+------+------------------------+------------------------------------------------------------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +--------+----------+--------------------+---------------+---------+------+------------------------+------------------------------------------------------------------------------------------------------+ | 626814 | rdsadmin | localhost | NULL | Sleep | 0 | delayed send ok done | NULL | | 626815 | rdsadmin | localhost | NULL | Sleep | 0 | delayed send ok done | NULL | | 626817 | rdsadmin | localhost | NULL | Sleep | 5 | cleaned up | NULL | | 626825 | rdsadmin | localhost | NULL | Sleep | 205 | delayed send ok done | NULL | | 762520 | slurm | xxxxxxxx:37140 | slurm_acct_db | Query | 0 | Sending data | select table_name from table_defs_table where definition='alter table \"stgtfhpcdev2824b9_job_table\ | | | 762794 | slurm | xxxxxxxxxx:37172 | NULL | Query | 0 | init | show processlist | +--------+----------+--------------------+---------------+---------+------+------------------------+------------------------------------------------------------------------------------------------------+