Ticket 1746 - Deleting Parent Account with sacctmgr
Summary: Deleting Parent Account with sacctmgr
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other tickets)
Version: 14.11.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-06-15 06:16 MDT by Will French
Modified: 2016-01-29 03:30 MST (History)
6 users (show)

See Also:
Site: Vanderbilt
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 15.08.0pre6
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (7.34 KB, text/plain)
2015-06-23 11:22 MDT, Will French
Details
slurmdbd.conf (652 bytes, text/plain)
2015-06-23 11:23 MDT, Will French
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Will French 2015-06-15 06:16:40 MDT
Hello, I am attempting to delete an account in our SLURM db via "sacctmgr delete account". I've already deleted the sub-accounts and there are no users listed under this account:

[frenchwr@vmps13 ~]$ sacctmgr show account abf_lab_account
   Account                Descr                  Org 
---------- -------------------- -------------------- 
abf_lab_a+      abf_lab_account      abf_lab_account 
[frenchwr@vmps13 ~]$ sacctmgr show account abf_lab_account WithAssoc
   Account                Descr                  Org    Cluster   Par Name       User     Share GrpJobs GrpNodes  GrpCPUs  GrpMem GrpSubmit     GrpWall  GrpCPUMins MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS 
---------- -------------------- -------------------- ---------- ---------- ---------- --------- ------- -------- -------- ------- --------- ----------- ----------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- 
abf_lab_a+      abf_lab_account      abf_lab_account      accre       root                    1                        20  204800                                                                                                             normal


When I attempt to delete this account from the SLURM db I get the following error:

root@slurmsched1:~# sacctmgr delete account name=abf_lab_account
 Database is busy or waiting for lock from other user.
sacctmgr: error: slurmdbd: Getting response to message type 1435
sacctmgr: error: slurmdbd: DBD_REMOVE_ACCTS failure: No error
 Error with request: No error

I've tried this command multiple times over the last week and it always fails with this error, even if mysqld and slurmdbd are not busy prior to the request. Subsequent sacctmgr commands then hang, often for several minutes without any "Database is busy" message. This behavior can then continue for several minutes/hours...it actually appears that certain sacctmgr commands can complete while others will not: 


root@slurmsched1:~# time sacctmgr show account abf_lab_account
   Account                Descr                  Org 
---------- -------------------- -------------------- 
abf_lab_a+      abf_lab_account      abf_lab_account 

real	0m0.174s
user	0m0.008s
sys	0m0.004s

root@slurmsched1:~# time sacctmgr show account abf_lab_account WithAssoc
^C

real	7m45.700s
user	0m0.006s
sys	0m0.006s

I also notice that this behavior coincides with high CPU usage from mysqld on our database server.

Any ideas what's going on here? Am I issuing incorrect syntax, or is this a bug?
Comment 1 Brian Christiansen 2015-06-15 11:08:51 MDT
Do you see any errors in the slurmdbd logs? Can you provide the slurmdbd logs of when it happens? Will you also attach your slurm.conf and slurmdbd.conf? 

And when it's busy what does "show processlist" show in mysql?
Comment 2 Brian Christiansen 2015-06-15 11:15:28 MDT
What does this return?

time sacctmgr show account withassoc | wc -l
Comment 3 Will French 2015-06-18 06:21:42 MDT
Apologies for the delayed response. I'm trying to find a good time to test this since some of the SLURM db functionality will be affected. I'm hoping to test later today before I leave.
Comment 4 Brian Christiansen 2015-06-18 10:08:44 MDT
No problem. Will you also grab the output of "show engine innodb status \G" from mysql?
Comment 5 Will French 2015-06-23 10:49:22 MDT
I'm finally getting around to this. I'll include other data you asked for in subsequent replies, but for now...

(In reply to Brian Christiansen from comment #2)
> What does this return?
> 
> time sacctmgr show account withassoc | wc -l

I'm assuming you mean before I've attempted to delete abf_lab_account: 

[frenchwr@vmps13 ~]$ time sacctmgr show account withassoc | wc -l
5737

real	0m1.088s
user	0m0.100s
sys	0m0.017s
Comment 6 Will French 2015-06-23 11:08:11 MDT
(In reply to Brian Christiansen from comment #1)
> Do you see any errors in the slurmdbd logs? Can you provide the slurmdbd
> logs of when it happens? Will you also attach your slurm.conf and
> slurmdbd.conf? 
> 
> And when it's busy what does "show processlist" show in mysql?

MariaDB [(none)]> show processlist\G
*************************** 1. row ***************************
      Id: 13871
    User: slurm
    Host: slurmdb:54144
      db: slurm_acct_db
 Command: Sleep
    Time: 2
   State: 
    Info: NULL
Progress: 0.000
*************************** 2. row ***************************
      Id: 13872
    User: slurm
    Host: slurmsched1:55320
      db: slurm_jobcomp_db
 Command: Sleep
    Time: 18
   State: 
    Info: NULL
Progress: 0.000
*************************** 3. row ***************************
      Id: 14076
    User: slurm
    Host: slurmdb:54349
      db: slurm_acct_db
 Command: Query
    Time: 205
   State: Sending data
    Info: select t0.id_assoc from "accre_job_table" as t0, "accre_assoc_table" as t1, "accre_assoc_table" as t
Progress: 0.000
*************************** 4. row ***************************
      Id: 14077
    User: slurm
    Host: slurmdb:54350
      db: slurm_acct_db
 Command: Sleep
    Time: 126
   State: 
    Info: NULL
Progress: 0.000
*************************** 5. row ***************************
      Id: 14079
    User: root
    Host: localhost
      db: NULL
 Command: Query
    Time: 0
   State: init
    Info: show processlist
Progress: 0.000
5 rows in set (0.00 sec)
Comment 7 Will French 2015-06-23 11:09:59 MDT
(In reply to Brian Christiansen from comment #4)
> No problem. Will you also grab the output of "show engine innodb status \G"
> from mysql?

MariaDB [(none)]> show engine innodb status \G
*************************** 1. row ***************************
  Type: InnoDB
  Name: 
Status: 
=====================================
2015-06-23 18:08:59 7f63dee3e700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 39 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 355970 srv_active, 0 srv_shutdown, 664069 srv_idle
srv_master_thread log flush and writes: 1020024
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 382449
OS WAIT ARRAY INFO: signal count 357033
Mutex spin waits 303302, rounds 6558745, OS waits 204036
RW-shared spins 173232, rounds 5155456, OS waits 171760
RW-excl spins 166, rounds 199119, OS waits 6522
Spin rounds per wait: 21.62 mutex, 29.76 RW-shared, 1199.51 RW-excl
------------
TRANSACTIONS
------------
Trx id counter 26340615
Purge done for trx's n:o < 26339249 undo n:o < 0 state: running but idle
History list length 541
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 14079, OS thread handle 0x7f63dee3e700, query id 98452558 localhost root init
show engine innodb status
---TRANSACTION 26340602, not started
MySQL thread id 13872, OS thread handle 0x7f63debf6700, query id 98452541 slurmsched1 10.0.0.49 slurm cleaning up
---TRANSACTION 26340614, not started flushing log
MySQL thread id 13871, OS thread handle 0x7f63dead2700, query id 98452557 slurmdb 10.0.0.51 slurm init
commit
---TRANSACTION 26339245, ACTIVE 338 sec fetching rows
mysql tables in use 3, locked 0
3 lock struct(s), heap size 1184, 1 row lock(s), undo log entries 1
MySQL thread id 14076, OS thread handle 0x7f63deb1b700, query id 98450680 slurmdb 10.0.0.51 slurm Sending data
select t0.id_assoc from "accre_job_table" as t0, "accre_assoc_table" as t1, "accre_assoc_table" as t2 where t1.lft between t2.lft and t2.rgt && (t2.acct='abf_lab_account') and t0.id_assoc=t1.id_assoc limit 1
Trx read view will not see trx with id >= 26339246, sees < 26339246
Trx #rec lock waits 0 #table lock waits 0
Trx total rec lock wait time 0 SEC
Trx total table lock wait time 0 SEC
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 1; buffer pool: 0
85990 OS file reads, 7315873 OS file writes, 3117550 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 8.00 writes/s, 3.74 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 0, seg size 2, 134 merges
merged operations:
 insert 601, delete mark 74, delete 36
discarded operations:
 insert 0, delete mark 0, delete 0
0.31 hash searches/s, 259.25 non-hash searches/s
---
LOG
---
Log sequence number 19741592137
Log flushed up to   19741591710
Pages flushed up to 19741587756
Last checkpoint at  19741587756
Max checkpoint age    80826164
Checkpoint age target 78300347
Modified age          4381
Checkpoint age        4381
1 pending log writes, 0 pending chkp writes
2238880 log i/o's done, 2.62 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 70330089472; in additional pool allocated 0
Total memory allocated by read views 480
Internal hash tables (constant factor + variable factor)
    Adaptive hash index 1234624096 	(1211076088 + 23548008)
    Page hash           8851208 (buffer pool 0 only)
    Dictionary cache    283419755 	(283213232 + 206523)
    File system         827776 	(812272 + 15504)
    Lock system         170001232 	(169998968 + 2264)
    Recovery system     0 	(0 + 0)
Dictionary memory allocated 206523
Buffer pool size        4194296
Buffer pool size, bytes 68719345664
Free buffers            4082306
Database pages          110553
Old database pages      40652
Modified db pages       24
Percent of dirty pages(LRU & free pages): 0.001
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 130, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 85948, created 24605, written 4865093
0.00 reads/s, 0.00 creates/s, 5.10 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 110553, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
----------------------
INDIVIDUAL BUFFER POOL INFO
----------------------
---BUFFER POOL 0
Buffer pool size        524287
Buffer pool size, bytes 8589918208
Free buffers            509962
Database pages          14143
Old database pages      5200
Modified db pages       4
Percent of dirty pages(LRU & free pages): 0.001
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 16, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 11277, created 2866, written 843671
0.00 reads/s, 0.00 creates/s, 0.77 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 14143, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 1
Buffer pool size        524287
Buffer pool size, bytes 8589918208
Free buffers            510489
Database pages          13591
Old database pages      4996
Modified db pages       1
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 8, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 10397, created 3194, written 469473
0.00 reads/s, 0.00 creates/s, 0.38 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 13591, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 2
Buffer pool size        524287
Buffer pool size, bytes 8589918208
Free buffers            509993
Database pages          14121
Old database pages      5192
Modified db pages       2
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 22, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 11098, created 3023, written 321467
0.00 reads/s, 0.00 creates/s, 0.26 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 14121, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 3
Buffer pool size        524287
Buffer pool size, bytes 8589918208
Free buffers            510854
Database pages          13250
Old database pages      4877
Modified db pages       0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 10, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 10412, created 2838, written 571805
0.00 reads/s, 0.00 creates/s, 0.92 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 13250, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 4
Buffer pool size        524287
Buffer pool size, bytes 8589918208
Free buffers            509481
Database pages          14622
Old database pages      5377
Modified db pages       6
Percent of dirty pages(LRU & free pages): 0.001
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 14, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 11282, created 3340, written 664583
0.00 reads/s, 0.00 creates/s, 0.28 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 14622, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 5
Buffer pool size        524287
Buffer pool size, bytes 8589918208
Free buffers            510631
Database pages          13492
Old database pages      4960
Modified db pages       2
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 27, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 10031, created 3461, written 665300
0.00 reads/s, 0.00 creates/s, 0.44 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 13492, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 6
Buffer pool size        524287
Buffer pool size, bytes 8589918208
Free buffers            510090
Database pages          14030
Old database pages      5159
Modified db pages       8
Percent of dirty pages(LRU & free pages): 0.002
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 19, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 10834, created 3196, written 679029
0.00 reads/s, 0.00 creates/s, 1.72 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 14030, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 7
Buffer pool size        524287
Buffer pool size, bytes 8589918208
Free buffers            510806
Database pages          13304
Old database pages      4891
Modified db pages       1
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 14, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 10617, created 2687, written 649765
0.00 reads/s, 0.00 creates/s, 0.33 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 13304, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
1 read views open inside InnoDB
1 RW transactions active inside InnoDB
0 RO transactions active inside InnoDB
1 out of 1000 descriptors used
---OLDEST VIEW---
Normal read view
Read view low limit trx n:o 26339246
Read view up limit trx id 26339246
Read view low limit trx id 26339246
Read view individually stored trx ids:
-----------------
Main thread process no. 35356, id 140066995283712, state: sleeping
Number of rows inserted 1210262, updated 1291137, deleted 0, read 73093279454
1.10 inserts/s, 1.26 updates/s, 0.00 deletes/s, 1702371.79 reads/s
Number of system rows inserted 0, updated 0, deleted 0, read 0
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================

1 row in set (0.00 sec)
Comment 8 Will French 2015-06-23 11:19:11 MDT
That mysql output from was gathered while the "Database is busy or waiting for lock from other user" message was displayed after issuing the problematic "sacctmgr delete" command. The DB logs don't show anything interesting that I can see (the "sacctmgr delete" command was issued at 6:03PM):

[2015-06-23T18:03:21.398] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:0 IP:10.0.0.49 CONN:8
[2015-06-23T18:05:01.457] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:0 IP:10.0.0.49 CONN:3
[2015-06-23T18:06:36.436] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:10
[2015-06-23T18:06:54.169] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12
[2015-06-23T18:10:01.324] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:0 IP:10.0.0.49 CONN:10
[2015-06-23T18:10:03.864] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:1003 IP:10.0.22.51 CONN:12
[2015-06-23T18:10:59.177] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:14
[2015-06-23T18:12:16.870] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12
[2015-06-23T18:12:29.161] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:14
[2015-06-23T18:12:37.583] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12
[2015-06-23T18:12:49.234] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:14
[2015-06-23T18:13:19.293] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12
[2015-06-23T18:13:47.768] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:14
[2015-06-23T18:13:49.677] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12
[2015-06-23T18:13:56.949] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:14
[2015-06-23T18:14:47.894] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12
[2015-06-23T18:15:01.408] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:0 IP:10.0.0.49 CONN:14
[2015-06-23T18:16:09.340] debug:  DBD_INIT: CLUSTER:accre VERSION:7168 UID:233708 IP:10.0.2.248 CONN:12
Comment 9 Will French 2015-06-23 11:22:51 MDT
Created attachment 1990 [details]
slurm.conf
Comment 10 Will French 2015-06-23 11:23:29 MDT
Created attachment 1991 [details]
slurmdbd.conf
Comment 11 Brian Christiansen 2015-06-25 10:00:52 MDT
How long does it take to run this command by hand? It looks like this one was taking a while (338 seconds at the time).

select t0.id_assoc from accre_job_table as t0, accre_assoc_table as t1, accre_assoc_table as t2 where t1.lft between t2.lft and t2.rgt && (t2.acct='abf_lab_account') and t0.id_assoc=t1.id_assoc limit 1;
Comment 12 Brian Christiansen 2015-06-25 11:36:03 MDT
I was able to reproduce the situation where deleting an account would take a long time. It's fixed in the following commit in 15.08 by adding a db index. 

https://github.com/SchedMD/slurm/commit/712122f97cad6a3347fa6f7ccebe6a6f265c36d9

You can apply this to 14.11. The dbd may take a little longer to start up depending on how many associations you have.

Let me know if you have any questions.
Comment 13 Sean Johnson 2015-06-25 14:59:53 MDT
> How long does it take to run this command by hand? It looks like this one was taking a while (338 seconds at the time).

It takes quite a long time, apparently:

MariaDB [slurm_acct_db]> select t0.id_assoc from accre_job_table as t0, accre_assoc_table as t1, accre_assoc_table as t2 where t1.lft between t2.lft and t2.rgt && (t2.acct='abf_lab_account') and t0.id_assoc=t1.id_assoc limit 1;
Empty set (2 hours 48 min 1.96 sec)

> You can apply this to 14.11. The dbd may take a little longer to start up depending on how many associations you have.

I guess that all depends on what "a little longer" means. Also, will this patch be rolling into the 14.X branch at all?
Comment 14 Brian Christiansen 2015-06-26 05:06:02 MDT
I just added 6,000 associations and restarted my dbd with the patch and it took about 10 minutes for the database to create all of the indexes. So you'll probably see about the same wait.

We don't plan on putting this into 14.11 because of the wait. 

Let us know how the patch works for you.
Comment 15 Brian Christiansen 2015-07-01 11:48:45 MDT
Just checking in. Have you had an opportunity to apply the patch?
Comment 16 Will French 2015-07-02 03:20:26 MDT
(In reply to Brian Christiansen from comment #15)
> Just checking in. Have you had an opportunity to apply the patch?

No yet but we hope to soon.
Comment 17 Chris Samuel 2015-07-15 18:10:18 MDT
Just to mention that I've just stumbled across this bug for a similar reason, I've just killed the following query in MySQL after just shy of an hour.


*************************** 14. row ***************************
           Id: 39550
         User: slurm
         Host: 10.11.0.41:43807
           db: slurm_acct_db
      Command: Query
         Time: 190
        State: Sending data
         Info: select t0.id_assoc from "avoca_job_table" as t0, "avoca_assoc_table" as t1, "avoca_assoc_table" as t2 where t1.lft between t2.lft and t2.rgt && (t2.acct='vr0282') and t0.id_assoc=t1.id_assoc limit 1
    Rows_sent: 0
Rows_examined: 0
    Rows_read: 0


Triggered by:

root@my:~# /usr/local/slurm/latest/bin/sacctmgr -ip delete account name=VR0282 
 Database is busy or waiting for lock from other user.
Comment 18 Danny Auble 2015-07-16 03:39:00 MDT
Chris, I believe you are on a much older 14.03 DBD correct?  In any case it would be nice to see if other queries indeed match this bug.  I see you sent the 14th row, meaning there was probably other queries blocking that one from running.  It would be nice to see what those were as well to see if indeed this is the same thing Will is hitting.  I know in 14.11 there was massive performance improvements on the database, so perhaps you are just hitting something that could be fixed by a simple upgrade to a more modern slurmdbd.
Comment 19 Danny Auble 2015-07-16 04:23:56 MDT
Chris, never mind about my last comment, this is most likely the same thing, upgrading would still help none-the-less :).  If you could try the patch and report back that would be helpful as well.
Comment 20 Chris Samuel 2015-07-16 15:48:12 MDT
Hi Danny,

That's the only query on our Slurm accounting database when this problem happens - in fact other sacctmgr queries to slurmdbd block until I shoot that query inside MySQL (even restarting slurmdbd doesn't fix it). :-(

We did have an upcoming outage in which we were looking to upgrade Slurm, but unfortunately electricians managed to drop power to our data centre last week so we had to do our immediate work then and didn't have time to do Slurm in that window.

With a new cluster on the way though it'll be something that'll happen in the next month I would say.

Monday though we will backport that patch and test it out as there's some tidying we need to do on which this is blocked.
Comment 21 Danny Auble 2015-07-16 16:12:36 MDT
Hey Chris, yeah, that 1 line patch should help out a lot, we found in our testing it did anyway.

I am surprised restarting the slurmdbd didn't kill the query.  I would expect other queries to block since that query is inside of a select for update so other queries on the same rows would block.  Typically it only lasts for a few nanoseconds but in this case it holds on for a long time depending on how many jobs you have and how deep your association hierarchy is.

I understand the upgrade situation, I was just giving you a hard time ;).
Comment 22 Chris Samuel 2015-07-19 14:49:29 MDT
I can confirm that the patch fixed it for us too!

I couldn't catch the exact lengths of the queries, but monitoring with "mytop" showed the queries queries on the clusters with most jobs went for about 100 seconds, rather than having to kill the first of them after almost an hour.

Phew! :-)

Thanks all.
Comment 23 Brian Christiansen 2015-07-31 04:16:27 MDT
I'm going to close this for now. Please re-open if the patch doesn't fix it for you.

Thanks,
Brian
Comment 24 Will French 2016-01-29 03:30:37 MST
(In reply to Brian Christiansen from comment #23)
> I'm going to close this for now. Please re-open if the patch doesn't fix it
> for you.
> 
> Thanks,
> Brian

We finally upgraded to 15.07.7 this week so I tested whether this issue is now fixed. Happy to report success! Thanks.

root@slurmsched1:~# time sacctmgr -i delete account name=rodrigue_lab_account
 Database is busy or waiting for lock from other user.
 Deleting accounts...
  rodrigue_lab_account

real	0m5.523s
user	0m0.010s
sys	0m0.003s