Hello, I am attempting to delete an account in our SLURM db via "sacctmgr delete account". I've already deleted the sub-accounts and there are no users listed under this account: [frenchwr@vmps13 ~]$ sacctmgr show account abf_lab_account Account Descr Org ---------- -------------------- -------------------- abf_lab_a+ abf_lab_account abf_lab_account [frenchwr@vmps13 ~]$ sacctmgr show account abf_lab_account WithAssoc Account Descr Org Cluster Par Name User Share GrpJobs GrpNodes GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS ---------- -------------------- -------------------- ---------- ---------- ---------- --------- ------- -------- -------- ------- --------- ----------- ----------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- abf_lab_a+ abf_lab_account abf_lab_account accre root 1 20 204800 normal When I attempt to delete this account from the SLURM db I get the following error: root@slurmsched1:~# sacctmgr delete account name=abf_lab_account Database is busy or waiting for lock from other user. sacctmgr: error: slurmdbd: Getting response to message type 1435 sacctmgr: error: slurmdbd: DBD_REMOVE_ACCTS failure: No error Error with request: No error I've tried this command multiple times over the last week and it always fails with this error, even if mysqld and slurmdbd are not busy prior to the request. Subsequent sacctmgr commands then hang, often for several minutes without any "Database is busy" message. This behavior can then continue for several minutes/hours...it actually appears that certain sacctmgr commands can complete while others will not: root@slurmsched1:~# time sacctmgr show account abf_lab_account Account Descr Org ---------- -------------------- -------------------- abf_lab_a+ abf_lab_account abf_lab_account real 0m0.174s user 0m0.008s sys 0m0.004s root@slurmsched1:~# time sacctmgr show account abf_lab_account WithAssoc ^C real 7m45.700s user 0m0.006s sys 0m0.006s I also notice that this behavior coincides with high CPU usage from mysqld on our database server. Any ideas what's going on here? Am I issuing incorrect syntax, or is this a bug?
Do you see any errors in the slurmdbd logs? Can you provide the slurmdbd logs of when it happens? Will you also attach your slurm.conf and slurmdbd.conf? And when it's busy what does "show processlist" show in mysql?
What does this return? time sacctmgr show account withassoc | wc -l
Apologies for the delayed response. I'm trying to find a good time to test this since some of the SLURM db functionality will be affected. I'm hoping to test later today before I leave.
No problem. Will you also grab the output of "show engine innodb status \G" from mysql?
I'm finally getting around to this. I'll include other data you asked for in subsequent replies, but for now... (In reply to Brian Christiansen from comment #2) > What does this return? > > time sacctmgr show account withassoc | wc -l I'm assuming you mean before I've attempted to delete abf_lab_account: [frenchwr@vmps13 ~]$ time sacctmgr show account withassoc | wc -l 5737 real 0m1.088s user 0m0.100s sys 0m0.017s
(In reply to Brian Christiansen from comment #1) > Do you see any errors in the slurmdbd logs? Can you provide the slurmdbd > logs of when it happens? Will you also attach your slurm.conf and > slurmdbd.conf? > > And when it's busy what does "show processlist" show in mysql? MariaDB [(none)]> show processlist\G *************************** 1. row *************************** Id: 13871 User: slurm Host: slurmdb:54144 db: slurm_acct_db Command: Sleep Time: 2 State: Info: NULL Progress: 0.000 *************************** 2. row *************************** Id: 13872 User: slurm Host: slurmsched1:55320 db: slurm_jobcomp_db Command: Sleep Time: 18 State: Info: NULL Progress: 0.000 *************************** 3. row *************************** Id: 14076 User: slurm Host: slurmdb:54349 db: slurm_acct_db Command: Query Time: 205 State: Sending data Info: select t0.id_assoc from "accre_job_table" as t0, "accre_assoc_table" as t1, "accre_assoc_table" as t Progress: 0.000 *************************** 4. row *************************** Id: 14077 User: slurm Host: slurmdb:54350 db: slurm_acct_db Command: Sleep Time: 126 State: Info: NULL Progress: 0.000 *************************** 5. row *************************** Id: 14079 User: root Host: localhost db: NULL Command: Query Time: 0 State: init Info: show processlist Progress: 0.000 5 rows in set (0.00 sec)
(In reply to Brian Christiansen from comment #4) > No problem. Will you also grab the output of "show engine innodb status \G" > from mysql? MariaDB [(none)]> show engine innodb status \G *************************** 1. row *************************** Type: InnoDB Name: Status: ===================================== 2015-06-23 18:08:59 7f63dee3e700 INNODB MONITOR OUTPUT ===================================== Per second averages calculated from the last 39 seconds ----------------- BACKGROUND THREAD ----------------- srv_master_thread loops: 355970 srv_active, 0 srv_shutdown, 664069 srv_idle srv_master_thread log flush and writes: 1020024 ---------- SEMAPHORES ---------- OS WAIT ARRAY INFO: reservation count 382449 OS WAIT ARRAY INFO: signal count 357033 Mutex spin waits 303302, rounds 6558745, OS waits 204036 RW-shared spins 173232, rounds 5155456, OS waits 171760 RW-excl spins 166, rounds 199119, OS waits 6522 Spin rounds per wait: 21.62 mutex, 29.76 RW-shared, 1199.51 RW-excl ------------ TRANSACTIONS ------------ Trx id counter 26340615 Purge done for trx's n:o < 26339249 undo n:o < 0 state: running but idle History list length 541 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 0, not started MySQL thread id 14079, OS thread handle 0x7f63dee3e700, query id 98452558 localhost root init show engine innodb status ---TRANSACTION 26340602, not started MySQL thread id 13872, OS thread handle 0x7f63debf6700, query id 98452541 slurmsched1 10.0.0.49 slurm cleaning up ---TRANSACTION 26340614, not started flushing log MySQL thread id 13871, OS thread handle 0x7f63dead2700, query id 98452557 slurmdb 10.0.0.51 slurm init commit ---TRANSACTION 26339245, ACTIVE 338 sec fetching rows mysql tables in use 3, locked 0 3 lock struct(s), heap size 1184, 1 row lock(s), undo log entries 1 MySQL thread id 14076, OS thread handle 0x7f63deb1b700, query id 98450680 slurmdb 10.0.0.51 slurm Sending data select t0.id_assoc from "accre_job_table" as t0, "accre_assoc_table" as t1, "accre_assoc_table" as t2 where t1.lft between t2.lft and t2.rgt && (t2.acct='abf_lab_account') and t0.id_assoc=t1.id_assoc limit 1 Trx read view will not see trx with id >= 26339246, sees < 26339246 Trx #rec lock waits 0 #table lock waits 0 Trx total rec lock wait time 0 SEC Trx total table lock wait time 0 SEC -------- FILE I/O -------- I/O thread 0 state: waiting for completed aio requests (insert buffer thread) I/O thread 1 state: waiting for completed aio requests (log thread) I/O thread 2 state: waiting for completed aio requests (read thread) I/O thread 3 state: waiting for completed aio requests (read thread) I/O thread 4 state: waiting for completed aio requests (read thread) I/O thread 5 state: waiting for completed aio requests (read thread) I/O thread 6 state: waiting for completed aio requests (write thread) I/O thread 7 state: waiting for completed aio requests (write thread) I/O thread 8 state: waiting for completed aio requests (write thread) I/O thread 9 state: waiting for completed aio requests (write thread) Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] , ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0 Pending flushes (fsync) log: 1; buffer pool: 0 85990 OS file reads, 7315873 OS file writes, 3117550 OS fsyncs 0.00 reads/s, 0 avg bytes/read, 8.00 writes/s, 3.74 fsyncs/s ------------------------------------- INSERT BUFFER AND ADAPTIVE HASH INDEX ------------------------------------- Ibuf: size 1, free list len 0, seg size 2, 134 merges merged operations: insert 601, delete mark 74, delete 36 discarded operations: insert 0, delete mark 0, delete 0 0.31 hash searches/s, 259.25 non-hash searches/s --- LOG --- Log sequence number 19741592137 Log flushed up to 19741591710 Pages flushed up to 19741587756 Last checkpoint at 19741587756 Max checkpoint age 80826164 Checkpoint age target 78300347 Modified age 4381 Checkpoint age 4381 1 pending log writes, 0 pending chkp writes 2238880 log i/o's done, 2.62 log i/o's/second ---------------------- BUFFER POOL AND MEMORY ---------------------- Total memory allocated 70330089472; in additional pool allocated 0 Total memory allocated by read views 480 Internal hash tables (constant factor + variable factor) Adaptive hash index 1234624096 (1211076088 + 23548008) Page hash 8851208 (buffer pool 0 only) Dictionary cache 283419755 (283213232 + 206523) File system 827776 (812272 + 15504) Lock system 170001232 (169998968 + 2264) Recovery system 0 (0 + 0) Dictionary memory allocated 206523 Buffer pool size 4194296 Buffer pool size, bytes 68719345664 Free buffers 4082306 Database pages 110553 Old database pages 40652 Modified db pages 24 Percent of dirty pages(LRU & free pages): 0.001 Max dirty pages percent: 75.000 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages made young 130, not young 0 0.00 youngs/s, 0.00 non-youngs/s Pages read 85948, created 24605, written 4865093 0.00 reads/s, 0.00 creates/s, 5.10 writes/s Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000 Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 110553, unzip_LRU len: 0 I/O sum[0]:cur[0], unzip sum[0]:cur[0] ---------------------- INDIVIDUAL BUFFER POOL INFO ---------------------- ---BUFFER POOL 0 Buffer pool size 524287 Buffer pool size, bytes 8589918208 Free buffers 509962 Database pages 14143 Old database pages 5200 Modified db pages 4 Percent of dirty pages(LRU & free pages): 0.001 Max dirty pages percent: 75.000 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages made young 16, not young 0 0.00 youngs/s, 0.00 non-youngs/s Pages read 11277, created 2866, written 843671 0.00 reads/s, 0.00 creates/s, 0.77 writes/s Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000 Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 14143, unzip_LRU len: 0 I/O sum[0]:cur[0], unzip sum[0]:cur[0] ---BUFFER POOL 1 Buffer pool size 524287 Buffer pool size, bytes 8589918208 Free buffers 510489 Database pages 13591 Old database pages 4996 Modified db pages 1 Percent of dirty pages(LRU & free pages): 0.000 Max dirty pages percent: 75.000 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages made young 8, not young 0 0.00 youngs/s, 0.00 non-youngs/s Pages read 10397, created 3194, written 469473 0.00 reads/s, 0.00 creates/s, 0.38 writes/s Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000 Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 13591, unzip_LRU len: 0 I/O sum[0]:cur[0], unzip sum[0]:cur[0] ---BUFFER POOL 2 Buffer pool size 524287 Buffer pool size, bytes 8589918208 Free buffers 509993 Database pages 14121 Old database pages 5192 Modified db pages 2 Percent of dirty pages(LRU & free pages): 0.000 Max dirty pages percent: 75.000 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages made young 22, not young 0 0.00 youngs/s, 0.00 non-youngs/s Pages read 11098, created 3023, written 321467 0.00 reads/s, 0.00 creates/s, 0.26 writes/s Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000 Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 14121, unzip_LRU len: 0 I/O sum[0]:cur[0], unzip sum[0]:cur[0] ---BUFFER POOL 3 Buffer pool size 524287 Buffer pool size, bytes 8589918208 Free buffers 510854 Database pages 13250 Old database pages 4877 Modified db pages 0 Percent of dirty pages(LRU & free pages): 0.000 Max dirty pages percent: 75.000 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages made young 10, not young 0 0.00 youngs/s, 0.00 non-youngs/s Pages read 10412, created 2838, written 571805 0.00 reads/s, 0.00 creates/s, 0.92 writes/s Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000 Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 13250, unzip_LRU len: 0 I/O sum[0]:cur[0], unzip sum[0]:cur[0] ---BUFFER POOL 4 Buffer pool size 524287 Buffer pool size, bytes 8589918208 Free buffers 509481 Database pages 14622 Old database pages 5377 Modified db pages 6 Percent of dirty pages(LRU & free pages): 0.001 Max dirty pages percent: 75.000 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages made young 14, not young 0 0.00 youngs/s, 0.00 non-youngs/s Pages read 11282, created 3340, written 664583 0.00 reads/s, 0.00 creates/s, 0.28 writes/s Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000 Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 14622, unzip_LRU len: 0 I/O sum[0]:cur[0], unzip sum[0]:cur[0] ---BUFFER POOL 5 Buffer pool size 524287 Buffer pool size, bytes 8589918208 Free buffers 510631 Database pages 13492 Old database pages 4960 Modified db pages 2 Percent of dirty pages(LRU & free pages): 0.000 Max dirty pages percent: 75.000 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages made young 27, not young 0 0.00 youngs/s, 0.00 non-youngs/s Pages read 10031, created 3461, written 665300 0.00 reads/s, 0.00 creates/s, 0.44 writes/s Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000 Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 13492, unzip_LRU len: 0 I/O sum[0]:cur[0], unzip sum[0]:cur[0] ---BUFFER POOL 6 Buffer pool size 524287 Buffer pool size, bytes 8589918208 Free buffers 510090 Database pages 14030 Old database pages 5159 Modified db pages 8 Percent of dirty pages(LRU & free pages): 0.002 Max dirty pages percent: 75.000 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages made young 19, not young 0 0.00 youngs/s, 0.00 non-youngs/s Pages read 10834, created 3196, written 679029 0.00 reads/s, 0.00 creates/s, 1.72 writes/s Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000 Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 14030, unzip_LRU len: 0 I/O sum[0]:cur[0], unzip sum[0]:cur[0] ---BUFFER POOL 7 Buffer pool size 524287 Buffer pool size, bytes 8589918208 Free buffers 510806 Database pages 13304 Old database pages 4891 Modified db pages 1 Percent of dirty pages(LRU & free pages): 0.000 Max dirty pages percent: 75.000 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages made young 14, not young 0 0.00 youngs/s, 0.00 non-youngs/s Pages read 10617, created 2687, written 649765 0.00 reads/s, 0.00 creates/s, 0.33 writes/s Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000 Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 13304, unzip_LRU len: 0 I/O sum[0]:cur[0], unzip sum[0]:cur[0] -------------- ROW OPERATIONS -------------- 0 queries inside InnoDB, 0 queries in queue 1 read views open inside InnoDB 1 RW transactions active inside InnoDB 0 RO transactions active inside InnoDB 1 out of 1000 descriptors used ---OLDEST VIEW--- Normal read view Read view low limit trx n:o 26339246 Read view up limit trx id 26339246 Read view low limit trx id 26339246 Read view individually stored trx ids: ----------------- Main thread process no. 35356, id 140066995283712, state: sleeping Number of rows inserted 1210262, updated 1291137, deleted 0, read 73093279454 1.10 inserts/s, 1.26 updates/s, 0.00 deletes/s, 1702371.79 reads/s Number of system rows inserted 0, updated 0, deleted 0, read 0 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s ---------------------------- END OF INNODB MONITOR OUTPUT ============================ 1 row in set (0.00 sec)
That mysql output from was gathered while the "Database is busy or waiting for lock from other user" message was displayed after issuing the problematic "sacctmgr delete" command. The DB logs don't show anything interesting that I can see (the "sacctmgr delete" command was issued at 6:03PM): [2015-06-23T18:03:21.398] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:0 IP:10.0.0.49 CONN:8 [2015-06-23T18:05:01.457] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:0 IP:10.0.0.49 CONN:3 [2015-06-23T18:06:36.436] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:10 [2015-06-23T18:06:54.169] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12 [2015-06-23T18:10:01.324] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:0 IP:10.0.0.49 CONN:10 [2015-06-23T18:10:03.864] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:1003 IP:10.0.22.51 CONN:12 [2015-06-23T18:10:59.177] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:14 [2015-06-23T18:12:16.870] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12 [2015-06-23T18:12:29.161] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:14 [2015-06-23T18:12:37.583] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12 [2015-06-23T18:12:49.234] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:14 [2015-06-23T18:13:19.293] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12 [2015-06-23T18:13:47.768] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:14 [2015-06-23T18:13:49.677] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12 [2015-06-23T18:13:56.949] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:14 [2015-06-23T18:14:47.894] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12 [2015-06-23T18:15:01.408] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:0 IP:10.0.0.49 CONN:14 [2015-06-23T18:16:09.340] debug: DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12
Created attachment 1990 [details] slurm.conf
Created attachment 1991 [details] slurmdbd.conf
How long does it take to run this command by hand? It looks like this one was taking a while (338 seconds at the time). select t0.id_assoc from accre_job_table as t0, accre_assoc_table as t1, accre_assoc_table as t2 where t1.lft between t2.lft and t2.rgt && (t2.acct='abf_lab_account') and t0.id_assoc=t1.id_assoc limit 1;
I was able to reproduce the situation where deleting an account would take a long time. It's fixed in the following commit in 15.08 by adding a db index. https://github.com/SchedMD/slurm/commit/712122f97cad6a3347fa6f7ccebe6a6f265c36d9 You can apply this to 14.11. The dbd may take a little longer to start up depending on how many associations you have. Let me know if you have any questions.
> How long does it take to run this command by hand? It looks like this one was taking a while (338 seconds at the time). It takes quite a long time, apparently: MariaDB [slurm_acct_db]> select t0.id_assoc from accre_job_table as t0, accre_assoc_table as t1, accre_assoc_table as t2 where t1.lft between t2.lft and t2.rgt && (t2.acct='abf_lab_account') and t0.id_assoc=t1.id_assoc limit 1; Empty set (2 hours 48 min 1.96 sec) > You can apply this to 14.11. The dbd may take a little longer to start up depending on how many associations you have. I guess that all depends on what "a little longer" means. Also, will this patch be rolling into the 14.X branch at all?
I just added 6,000 associations and restarted my dbd with the patch and it took about 10 minutes for the database to create all of the indexes. So you'll probably see about the same wait. We don't plan on putting this into 14.11 because of the wait. Let us know how the patch works for you.
Just checking in. Have you had an opportunity to apply the patch?
(In reply to Brian Christiansen from comment #15) > Just checking in. Have you had an opportunity to apply the patch? No yet but we hope to soon.
Just to mention that I've just stumbled across this bug for a similar reason, I've just killed the following query in MySQL after just shy of an hour. *************************** 14. row *************************** Id: 39550 User: slurm Host: 10.11.0.41:43807 db: slurm_acct_db Command: Query Time: 190 State: Sending data Info: select t0.id_assoc from "avoca_job_table" as t0, "avoca_assoc_table" as t1, "avoca_assoc_table" as t2 where t1.lft between t2.lft and t2.rgt && (t2.acct='vr0282') and t0.id_assoc=t1.id_assoc limit 1 Rows_sent: 0 Rows_examined: 0 Rows_read: 0 Triggered by: root@my:~# /usr/local/slurm/latest/bin/sacctmgr -ip delete account name=VR0282 Database is busy or waiting for lock from other user.
Chris, I believe you are on a much older 14.03 DBD correct? In any case it would be nice to see if other queries indeed match this bug. I see you sent the 14th row, meaning there was probably other queries blocking that one from running. It would be nice to see what those were as well to see if indeed this is the same thing Will is hitting. I know in 14.11 there was massive performance improvements on the database, so perhaps you are just hitting something that could be fixed by a simple upgrade to a more modern slurmdbd.
Chris, never mind about my last comment, this is most likely the same thing, upgrading would still help none-the-less :). If you could try the patch and report back that would be helpful as well.
Hi Danny, That's the only query on our Slurm accounting database when this problem happens - in fact other sacctmgr queries to slurmdbd block until I shoot that query inside MySQL (even restarting slurmdbd doesn't fix it). :-( We did have an upcoming outage in which we were looking to upgrade Slurm, but unfortunately electricians managed to drop power to our data centre last week so we had to do our immediate work then and didn't have time to do Slurm in that window. With a new cluster on the way though it'll be something that'll happen in the next month I would say. Monday though we will backport that patch and test it out as there's some tidying we need to do on which this is blocked.
Hey Chris, yeah, that 1 line patch should help out a lot, we found in our testing it did anyway. I am surprised restarting the slurmdbd didn't kill the query. I would expect other queries to block since that query is inside of a select for update so other queries on the same rows would block. Typically it only lasts for a few nanoseconds but in this case it holds on for a long time depending on how many jobs you have and how deep your association hierarchy is. I understand the upgrade situation, I was just giving you a hard time ;).
I can confirm that the patch fixed it for us too! I couldn't catch the exact lengths of the queries, but monitoring with "mytop" showed the queries queries on the clusters with most jobs went for about 100 seconds, rather than having to kill the first of them after almost an hour. Phew! :-) Thanks all.
I'm going to close this for now. Please re-open if the patch doesn't fix it for you. Thanks, Brian
(In reply to Brian Christiansen from comment #23) > I'm going to close this for now. Please re-open if the patch doesn't fix it > for you. > > Thanks, > Brian We finally upgraded to 15.07.7 this week so I tested whether this issue is now fixed. Happy to report success! Thanks. root@slurmsched1:~# time sacctmgr -i delete account name=rodrigue_lab_account Database is busy or waiting for lock from other user. Deleting accounts... rodrigue_lab_account real 0m5.523s user 0m0.010s sys 0m0.003s