| Summary: | Slurmdbd hangs - DETECTED DEADLOCK | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Damien <damien.leong> |
| Component: | slurmdbd | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 17.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Monash University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Damien, this sounds like a duplicate of bug 5400 (see bug 5400 comments 15,16,17), particularly since I see it's waiting to set AUTO_INCREMENT=0. This was originally reported on this post: https://lists.schedmd.com/pipermail/slurm-users/2018-February/000569.html You may apply this commit (it's only in 18.08, but should apply cleanly to 17.11) to fix it: https://github.com/schedmd/slurm/commit/4a16541bf0e Or you may use the workaround described by Jessica: set innodb_autoinc_lock_mode=2 Does this work for you? Have you had a chance to try a suggestion in comment 1? Hi Marshall Firstly thanks for your replies. We are leaning toward using "set innodb_autoinc_lock_mode=2" but unable to find more details on this. How this mechanism resolve locks ? Cheers Damien The mysql documentation describes it very well. (I've linked mysql 5.7 - you'd want to look at whatever version of mysql or mariadb you're using.) https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_autoinc_lock_mode "The lock mode to use for generating auto-increment values." https://dev.mysql.com/doc/refman/5.7/en/innodb-auto-increment-handling.html#innodb-auto-increment-lock-modes Setting innodb_autoinc_lock_mode to 2 essentially just loosens up locks on auto increment. The document notes that auto increment values may not be consecutive in this mode - that is just fine. Note that innodb_autoinc_lock_mode was introduced in mysql 5.1. If you're using anything older than that, you'll need to upgrade mysql or mariadb. Note that you do not need this once you upgrade to 18.08, because of commit 4a16541bf0e. Can you verify that setting innodb_autoinc_lock_mode to 2 solves the issue for you? Have you had a chance to test this? Hi Marshall Thanks for your reply. We will put "set innodb_autoinc_lock_mode=2" in mariaDB. So far, this problem has not reoccurred. Cheers Damien Sounds good. I'm closing this ticket as infogiven. Please re-open it or file a new bug if you experience this issue again. |
Good Afternoon Slurm Support Last week, there was a case where the slurmdbd process hung and there were deadlocks in the corresponding MariaDB-SQL. We were trying to mass-change(using sacctmgr) a whole lot of user-accounts to a different default-accounts and remove their account=default. Our primary slurmdbd is at m3-mgmt2, while the backup slurmdbd is at m3-mgmt1 These are the observations: 180925 15:46:22 *** (1) TRANSACTION: TRANSACTION 265DBE7, ACTIVE 0 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 7 lock struct(s), heap size 1248, 5 row lock(s), undo log entries 3 MySQL thread id 436243, OS thread handle 0x7f965fcdc700, query id 63356859 m3-mgmt2.massive.org.au 172.16.200.79 slurmdb Updating UPDATE "m3_assoc_table" SET rgt = rgt+2 WHERE rgt > 437 && deleted < 2 *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 0 page no 723 n bits 256 index "PRIMARY" of table "slurm_acct_db"."m3_assoc_table" trx id 265DBE7 lock_mode X waiting *** (2) TRANSACTION: TRANSACTION 265DBE8, ACTIVE 0 sec fetching rows mysql tables in use 1, locked 1 16 lock struct(s), heap size 3112, 1787 row lock(s), undo log entries 1560 MySQL thread id 436244, OS thread handle 0x7f96b8285700, query id 63356858 m3-mgmt2.massive.org.au 172.16.200.79 slurmdb Updating UPDATE "m3_assoc_table" SET rgt = rgt+2 WHERE rgt > 437 && deleted < 2 *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 0 page no 723 n bits 256 index "PRIMARY" of table "slurm_acct_db"."m3_assoc_table" trx id 265DBE8 lock_mode X *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 0 page no 1284 n bits 264 index "PRIMARY" of table "slurm_acct_db"."m3_assoc_table" trx id 265DBE8 lock_mode X waiting *** WE ROLL BACK TRANSACTION (1) ------------ Slurmdbd stuck: | 713991 | slurmdb | m3-mgmt2.massive.org.au:48916 | slurm_acct_db | Query | 6351 | Waiting for table metadata lock | alter table "m3_assoc_table" AUTO_INCREMENT=0 | 0.000 | | 713993 | slurmdb | m3-mgmt2.massive.org.au:48924 | slurm_acct_db | Sleep | 6350 | | NULL | 0.000 | | 713995 | slurmdb | m3-mgmt2.massive.org.au:48928 | slurm_acct_db | Query | 6350 | Waiting for table metadata lock | select distinct t1.lft, t1.rgt from "m3_assoc_table" as t1 where t1.deleted=0 && (t1.acct='default') | 0.000 | MariaDB [(none)]> SHOW ENGINE INNODB STATUS; +--------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | InnoDB | | ===================================== 181001 17:57:57 INNODB MONITOR OUTPUT ===================================== Per second averages calculated from the last 51 seconds ----------------- LOG --- Log sequence number 37410223571 Log flushed up to 37410223571 Last checkpoint at 37410219174 Max checkpoint age 7782360 Checkpoint age target 7539162 Modified age 4397 Checkpoint age 4397 0 pending log writes, 0 pending chkp writes 3075173 log i/o's done, 0.53 log i/o's/second ---------------------- BUFFER POOL AND MEMORY ---------------------- Total memory allocated 137756672; in additional pool allocated 0 Total memory allocated by read views 656 Internal hash tables (constant factor + variable factor) Adaptive hash index 2807408 (2213368 + 594040) Page hash 139112 (buffer pool 0 only) Dictionary cache 890448 (554768 + 335680) File system 83536 (82672 + 864) Lock system 337488 (332872 + 4616) Recovery system 0 (0 + 0) Dictionary memory allocated 335680 Buffer pool size 8191 Buffer pool size, bytes 134201344 Free buffers 1 Database pages 8154 Old database pages 2989 Modified db pages 8 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages made young 10918339, not young 0 0.04 youngs/s, 0.00 non-youngs/s Pages read 10959892, created 91918, written 7785310 0.00 reads/s, 0.00 creates/s, 2.37 writes/s Buffer pool hit rate 1000 / 1000, young-making rate 1 / 1000 not 0 / 1000 Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 8154, unzip_LRU len: 0 I/O sum[121]:cur[0], unzip sum[0]:cur[0] -------------- ROW OPERATIONS -------------- 0 queries inside InnoDB, 0 queries in queue 2 read views open inside InnoDB 1 transactions active inside InnoDB 1 out of 1000 descriptors used ---OLDEST VIEW--- Normal read view Read view low limit trx n:o 276C905 Read view up limit trx id 276C8FF Read view low limit trx id 276C905 Read view individually stored trx ids: Read view trx id 276C904 Read view trx id 276C903 Read view trx id 276C8FF ----------------- Main thread process no. 1036, id 140284856301312, state: flushing log Number of rows inserted 1575199, updated 3041895, deleted 0, read 2000002210 0.14 inserts/s, 0.61 updates/s, 0.00 deletes/s, 1.04 reads/s ------------------------ LATEST DETECTED DEADLOCK ------------------------ 180925 15:46:22 *** (1) TRANSACTION: TRANSACTION 265DBE7, ACTIVE 0 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 7 lock struct(s), heap size 1248, 5 row lock(s), undo log entries 3 MySQL thread id 436243, OS thread handle 0x7f965fcdc700, query id 63356859 m3-mgmt2.massive.org.au 172.16.200.79 slurmdb Updating UPDATE "m3_assoc_table" SET rgt = rgt+2 WHERE rgt > 437 && deleted < 2 *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 0 page no 723 n bits 256 index "PRIMARY" of table "slurm_acct_db"."m3_assoc_table" trx id 265DBE7 lock_mode X waiting *** (2) TRANSACTION: TRANSACTION 265DBE8, ACTIVE 0 sec fetching rows mysql tables in use 1, locked 1 16 lock struct(s), heap size 3112, 1787 row lock(s), undo log entries 1560 MySQL thread id 436244, OS thread handle 0x7f96b8285700, query id 63356858 m3-mgmt2.massive.org.au 172.16.200.79 slurmdb Updating UPDATE "m3_assoc_table" SET rgt = rgt+2 WHERE rgt > 437 && deleted < 2 *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 0 page no 723 n bits 256 index "PRIMARY" of table "slurm_acct_db"."m3_assoc_table" trx id 265DBE8 lock_mode X *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 0 page no 1284 n bits 264 index "PRIMARY" of table "slurm_acct_db"."m3_assoc_table" trx id 265DBE8 lock_mode X waiting *** WE ROLL BACK TRANSACTION (1) ------------ TRANSACTIONS ------------ Trx id counter 276E788 Purge done for trx's n:o < 276C905 undo n:o < 0 History list length 3391 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 0, not started MySQL thread id 714203, OS thread handle 0x7f965e57a700, query id 100920110 localhost root SHOW ENGINE INNODB STATUS ---TRANSACTION 276C8FF, not started MySQL thread id 713991, OS thread handle 0x7f965fc01700, query id 100896405 m3-mgmt2.massive.org.au 172.16.200.79 slurmdb Waiting for table metadata lock alter table "m3_assoc_table" AUTO_INCREMENT=0 ---TRANSACTION 276E786, not started MySQL thread id 683180, OS thread handle 0x7f96b8161700, query id 100920109 m3-mgmt2.massive.org.au 172.16.200.79 slurmdb ---TRANSACTION 275BB5E, not started MySQL thread id 683178, OS thread handle 0x7f965fd6e700, query id 100915512 m3-mgmt2.massive.org.au 172.16.200.79 slurmdb ---TRANSACTION 276C904, ACTIVE 6434 sec 13 lock struct(s), heap size 3112, 7 row lock(s), undo log entries 3 MySQL thread id 713993, OS thread handle 0x7f965ff6d700, query id 100896484 m3-mgmt2.massive.org.au 172.16.200.79 slurmdb Trx read view will not see trx with id >= 276C905, sees < 276C8FF ---------------------------- END OF INNODB MONITOR OUTPUT ============================ | +--------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.02 sec) Stop slurmdbd: MariaDB [(none)]> SHOW PROCESSLIST; +--------+------+-----------+------+---------+------+-------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +--------+------+-----------+------+---------+------+-------+------------------+----------+ | 714204 | root | localhost | NULL | Query | 0 | NULL | SHOW PROCESSLIST | 0.000 | +--------+------+-----------+------+---------+------+-------+------------------+----------+ 1 row in set (0.00 sec) Start slurmdbd: MariaDB [(none)]> SHOW PROCESSLIST; +--------+---------+-------------------------------+---------------+---------+------+-------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +--------+---------+-------------------------------+---------------+---------+------+-------+------------------+----------+ | 714204 | root | localhost | NULL | Query | 0 | NULL | SHOW PROCESSLIST | 0.000 | | 714206 | slurmdb | m3-mgmt1.massive.org.au:41686 | slurm_acct_db | Sleep | 18 | | NULL | 0.000 | | 714208 | slurmdb | m3-mgmt1.massive.org.au:41698 | slurm_acct_db | Query | 0 | NULL | commit | 0.000 | +--------+---------+-------------------------------+---------------+---------+------+-------+------------------+----------+ 3 rows in set (0.00 sec) MariaDB [(none)]> SHOW PROCESSLIST; +--------+---------+-------------------------------+---------------+---------+------+-------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +--------+---------+-------------------------------+---------------+---------+------+-------+------------------+----------+ | 714204 | root | localhost | NULL | Query | 0 | NULL | SHOW PROCESSLIST | 0.000 | | 714206 | slurmdb | m3-mgmt1.massive.org.au:41686 | slurm_acct_db | Sleep | 44 | | NULL | 0.000 | | 714210 | slurmdb | m3-mgmt2.massive.org.au:57842 | slurm_acct_db | Sleep | 45 | | NULL | 0.000 | | 714212 | slurmdb | m3-mgmt2.massive.org.au:57852 | slurm_acct_db | Sleep | 4 | | NULL | 0.000 | +--------+---------+-------------------------------+---------------+---------+------+-------+------------------+----------+ 4 rows in set (0.00 sec) Kindly advise, or let us if you need clarification. Thanks. Damien