Ticket 5400

Summary:	slurmdbd stuck waiting on table metadata lock
Product:	Slurm	Reporter:	Ryan Day <day36>
Component:	Database	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	17.11.8
Hardware:	Linux
OS:	Linux
Site:	LLNL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	18.08.0-pre2
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Ryan Day 2018-07-09 17:43:11 MDT

We're having problems where commands that query the slurmdbd take a really long time or fail. e.g.

[day36@pascal82:~]$ time sacctmgr show cluster
sacctmgr: error: slurmdbd: Getting response to message type 1412
sacctmgr: error: slurmdbd: DBD_GET_CLUSTERS failure: No error
 Problem with query.

real	15m0.222s
user	0m0.004s
sys	0m0.008s
[day36@pascal82:~]

I'm not seeing anything in the slurmdbd or mysqld logs, but eventually get errors about innodb lock timeouts, and the 15m that the sacctmgr command takes to timeout is the same as our innodb lock timeout. 'show engine innodb status' also shows a bunch of transactions Waiting on table metadata lock:

TRANSACTIONS
------------
Trx id counter 30CE066
Purge done for trx's n:o < 30B8909 undo n:o < 0
History list length 33629
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 1003308, OS thread handle 0x7fff931df700, query id 401257618 localhost root
show engine innodb status
---TRANSACTION 30CBE65, not started
mysql tables in use 1, locked 1
MySQL thread id 1003276, OS thread handle 0x7fff9334c700, query id 401242781 localhost slurm Waiting for table metadata lock
alter table "quartz_assoc_table" AUTO_INCREMENT=0
---TRANSACTION 30CE05E, not started
MySQL thread id 998193, OS thread handle 0x7fff93cbc700, query id 401257595 localhost slurm
---TRANSACTION 30CE061, not started
MySQL thread id 998190, OS thread handle 0x7fff935dd700, query id 401257613 localhost slurm
---TRANSACTION 30CE064, not started
MySQL thread id 998192, OS thread handle 0x7fff93626700, query id 401257617 localhost slurm
---TRANSACTION 30CDC52, not started
MySQL thread id 998191, OS thread handle 0x7fff934b9700, query id 401256014 localhost slurm
---TRANSACTION 2E8323A, not started
MySQL thread id 998183, OS thread handle 0x7fff92f05700, query id 401251601 localhost slurm
---TRANSACTION 30CE05B, ACTIVE 14 sec
3 lock struct(s), heap size 376, 12 row lock(s), undo log entries 2
MySQL thread id 1003307, OS thread handle 0x7ffff408f700, query id 401257591 localhost slurm Waiting for table metadata lock
select user from "quartz_assoc_table" where id_assoc=856
Trx read view will not see trx with id >= 30CE05C, sees < 30B8904
---TRANSACTION 30CDBAD, ACTIVE 330 sec
MySQL thread id 1003305, OS thread handle 0x7fff92fe0700, query id 401255605 localhost slurm Waiting for table metadata lock
select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, t1.`partition`, t1.shares, t1.grp_tres_mins, t1.grp_tres_run_mins, t1.grp_tres, t1.grp_jobs, t1.grp_submit_jobs, t1.grp_wall, t1.max_tres_mins_pj, t1.max_tres_run_mins, t1.max_tres_pj, t1.max_tres_pn, t1.max_jobs, t1.max_submit_jobs, t1.max_wall_pj, t1.parent_acct, t1.def_qos_id, t1.qos, t1.delta_qos, t1.is_def from "quartz_assoc_table" as t1 where t1.deleted=0 && (t1.acct='root') && (t1.user='') order by lft
Trx read view will not see trx with id >= 30CDBAE, sees < 30B8904

<snip>

---TRANSACTION 30CBF91, ACTIVE 2655 sec
MySQL thread id 1003277, OS thread handle 0x7fff9e753700, query id 401243226 localhost slurm Waiting for table metadata lock
select distinct t1.id_assoc, t1.lft, t1.rgt, t1.user, t1.acct, t1.`partition`, t1.shares, t1.grp_tres_mins, t1.grp_tres_run_mins, t1.grp_tres, t1.grp_jobs, t1.grp_submit_jobs, t1.grp_wall, t1.max_tres_mins_pj, t1.max_tres_run_mins, t1.max_tres_pj, t1.max_tres_pn, t1.max_jobs, t1.max_submit_jobs, t1.max_wall_pj, t1.parent_acct, t1.def_qos_id, t1.qos, t1.delta_qos, t1.is_def from "quartz_assoc_table" as t1 where t1.deleted=0 && (t1.acct='root') && (t1.user='') order by lft
Trx read view will not see trx with id >= 30CBF92, sees < 30B8904
---TRANSACTION 30CBE6A, ACTIVE 2714 sec
MySQL thread id 998194, OS thread handle 0x7fff93029700, query id 401242787 localhost slurm Waiting for table metadata lock
select user from "quartz_assoc_table" where id_assoc=2112
Trx read view will not see trx with id >= 30CBE6B, sees < 30B8904
---TRANSACTION 30C9744, ACTIVE 5586 sec
MySQL thread id 1003254, OS thread handle 0x7fff93793700, query id 401102571 localhost slurm
Trx read view will not see trx with id >= 30C9745, sees < 30B8904
---TRANSACTION 30B8904, ACTIVE 26892 sec
MySQL thread id 1003137, OS thread handle 0x7fff93900700, query id 400226310 localhost slurm
Trx read view will not see trx with id >= 30B8905, sees < 30B8905

Restarting the slurmdbd gets sacctmgr going again.

We have the following settings for innodb:

[mysqld]
innodb_buffer_pool_size=1G
innodb_log_file_size=10M
innodb_log_files_in_group=8
innodb_lock_wait_timeout=900

Do you have any recommendations for fixing this? Is it just a matter of increasing the innodb_buffer_pool_size and if so do you have any recommendations for how big to make it?

Comment 2 Marshall Garey 2018-07-11 11:03:07 MDT

I'm wondering if the pending queries requested a very large amount of data and are bogging the database down. Killing those has solved the problem for others.

In 17.11, we added a MaxQueryTimeRange parameter to slurmdbd.conf, so administrators can limit the time range of a query. That may be of interest to you whenever you upgrade.

Comment 3 Marshall Garey 2018-07-11 11:05:18 MDT

Since your restart of the slurmdbd and database, have you seen this problem occur again?

Comment 4 Ryan Day 2018-07-12 11:36:30 MDT

It's been coming up pretty regularly. Always seems to be late in the afternoon. I'll dig around and see if some user has something abusive in a crontab or something. Is there a good way to get more information about where the pending queries are coming from? I guess a number of them are referencing quartz's tables.

Comment 5 Marshall Garey 2018-07-12 11:57:25 MDT

(In reply to Ryan Day from comment #4)
> It's been coming up pretty regularly. Always seems to be late in the
> afternoon. I'll dig around and see if some user has something abusive in a
> crontab or something. Is there a good way to get more information about
> where the pending queries are coming from? I guess a number of them are
> referencing quartz's tables.

You can use

sacctmgr show transaction

(documented in the man page). It doesn't show exactly what the query was, but it shows what it was doing, where it came from, who issued it, when it was issued, and some extra information.

You can limit the time period you want to see with "start=" and "end=". You can also specify the format, which is useful if you want to expand a field (such as info) where the width isn't enough to show all the information.

sacctmgr show transaction start=7/1/18 end=7/12/18 format=time,action,actor,where,info%100

Comment 6 Marshall Garey 2018-07-17 10:08:10 MDT

Have you been able to identify the offending queries? Are you still seeing this come up?


17.11.8 is due to be released this week. 18.08 is due to be released next month. When that happens, 17.02 will no longer be supported, so we encourage you to make a plan to upgrade soon. Both versions have the MaxQueryTimeRange parameter for slurmdbd.conf, which should help with this issue. If you're mainly concerned about stability, I'd encourage you to upgrade to 17.11.

Comment 7 Ryan Day 2018-07-20 14:35:14 MDT

Unfortunately, I haven't caught this in the act since filing the bug. I do have 17.11 running in our test environment and plan to get it out in production around when you release 18.08.

Comment 8 Marshall Garey 2018-07-20 14:44:04 MDT

In that case, I'm going to close this as resolved/infogiven for now. If you have problems again, feel free to re-open the ticket.

Comment 9 Ryan Day 2018-08-09 14:15:27 MDT

I'm re-opening this as I'm now able to reproduce it on 17.02.10 and 17.11.8

Our accounts staff (the folks who add / remove users from accounts, etc) tend to start sacctmgr sessions and leave them running. It appears that, under certain conditions, the mariadb is not releasing table metadata locks until the sacctmgr session is exited. E.g. I start two sacctmgr sessions. In the first, I add an account and then show a user:

sacctmgr: add account guests parent=overhead cl=opal
 Associations
  A = guests     C = opal      
 Settings
  Parent        = overhead
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
sacctmgr: show user day36 witha
      User   Def Acct     Admin    Cluster    Account  Partition     Share MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS 
---------- ---------- --------- ---------- ---------- ---------- --------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- 
     day36         lc Administ+      borax         lc                    1                               10000                               normal,standby           
     day36         lc Administ+   catalyst         lc                    1                               10000                         expedite,normal,sta+           
     day36         lc Administ+      flash         lc                    1                               10000                               normal,standby           
     day36         lc Administ+       opal         lc                    1                                                                   normal,standby           
     day36         lc Administ+     pascal         lc                    1                               10000                               normal,standby           
     day36         lc Administ+     quartz         lc                    1                               10000                         expedite,normal,sta+           
sacctmgr:

In the second session, I add a user to that account, which works fine, then delete a user from that account, which hangs:

[day36@opal185:~]$ sacctmgr
sacctmgr: add user day36 account=guests cl=opal
 Associations =
  U = day36     A = guests     C = opal      
 Non Default Settings
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
sacctmgr: delete user day36 account=guests cl=opal
 Deleting user associations...
  C = opal       A = guests     U = day36    
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
<hangs>

Looking at the db, the sacctmgr process that is running the delete (23626) is 'waiting for table metadata lock':
[root@elsie2:~]# mysqladmin proc stat
+-------+-------+-----------+---------------+---------+------+---------------------------------+-------------------------------------------------+----------+
| Id    | User  | Host      | db            | Command | Time | State                           | Info                                            | Progress |
+-------+-------+-----------+---------------+---------+------+---------------------------------+-------------------------------------------------+----------+
| 19    | slurm | localhost | slurm_acct_db | Sleep   | 159  |                                 |                                                 | 0.000    |
| 14243 | slurm | localhost | slurm_acct_db | Sleep   | 0    |                                 |                                                 | 0.000    |
| 23624 | slurm | localhost | slurm_acct_db | Sleep   | 39   |                                 |                                                 | 0.000    |
| 23626 | slurm | localhost | slurm_acct_db | Query   | 8    | Waiting for table metadata lock | alter table "opal_assoc_table" AUTO_INCREMENT=0 | 0.000    |
| 23635 | root  | localhost |               | Query   | 0    |                                 | show processlist                                | 0.000    |
+-------+-------+-----------+---------------+---------+------+---------------------------------+-------------------------------------------------+----------+
Uptime: 1740759  Threads: 5  Questions: 2100929  Slow queries: 6  Opens: 203  Flush tables: 2  Open tables: 164  Queries per second avg: 1.206
[root@elsie2:~]#

The delete in the second sacctmgr session completes as soon as I exit the first sacctmgr session, which leads me to believe that the lock is not released until the sacctmgr exits. Is this expected behavior? The accounts specialists say that they've used sacctmgr this way for a long time, and it never used to lead to problems.

Comment 10 Marshall Garey 2018-08-09 15:15:35 MDT

Thanks for the detailed instructions. I was able to reproduce it just like you described on both Debian and CentOS. I'll look into this further. I'm not sure if it's expected behavior or not, but my gut says it's not.

Comment 12 Marshall Garey 2018-08-21 11:11:36 MDT

This is definitely a bug. But it doesn't exist on the 18.08 branch (18.08 will be released by the end of the month). I haven't yet identified which commit fixes it.

I also have found that reproducing it is as simple as this:

- have 2 sacctmgr interactive sessions open
- in one
  * show users
- in the other
  * add user account=<accountname>
  * del user account=<accountname>

So I think it's something wrong with show users. Since it has already been fixed, I'm just trying to identify the commit that fixed it.

Comment 15 Marshall Garey 2018-08-21 17:53:00 MDT

This commit solves the problem.

https://github.com/schedmd/slurm/commit/4a16541bf0e

My guess is that sacctmgr show users is holding metadata locks that prevent AUTO_INCREMENT from being changed, and prior to that commit, deleting a user would reset AUTO_INCREMENT to 0 for the association table, but it can't delete the user because of show users. This commit fixes that issue.

We only put this into 18.08 because it's a change in behavior that we didn't want to introduce to 17.11, but you should be fine to apply this directly to 17.11 if you want.

Does this solution work for you?

Comment 16 Ryan Day 2018-08-22 10:20:49 MDT

That sounds good. I'll apply it to 17.11 and get it onto our test system to make sure it works for us. You can probably close this though and I'll re-open if I run into more problems.

(In reply to Marshall Garey from comment #15)
> This commit solves the problem.
> 
> https://github.com/schedmd/slurm/commit/4a16541bf0e
> 
> My guess is that sacctmgr show users is holding metadata locks that prevent
> AUTO_INCREMENT from being changed, and prior to that commit, deleting a user
> would reset AUTO_INCREMENT to 0 for the association table, but it can't
> delete the user because of show users. This commit fixes that issue.
> 
> We only put this into 18.08 because it's a change in behavior that we didn't
> want to introduce to 17.11, but you should be fine to apply this directly to
> 17.11 if you want.
> 
> Does this solution work for you?

Comment 17 Marshall Garey 2018-08-22 10:24:42 MDT

Okay. Closing as resolved/fixed in 18.08

https://github.com/schedmd/slurm/commit/4a16541bf0e