We upgraded from 23.02.7 a few days ago to 24.11.0. We have 184000 records in our cedar_assoc_table. The query that was modified in 23.11.1 seems to be extremely inefficient. During the function _check_is_def_acct_before_remove https://github.com/SchedMD/slurm/commit/8fcd96efda7a0476b4910b5ee62e55f94baa34f2#diff-7c227c4ce0cbb698c2fca7a2ff21bf59d01ad702d758d4f07c89c07248c0a423R283-R294 The added sanity check seems to be an exponentially growing lookup depending on the rows in the table. I tested this by chopping the table down to smaller sizes, 10k records, 20k, 30k, 40k. After 30k, the call took a VERY long time. Is it possible that this would fit the bill better? SELECT t1.user, t1.acct FROM cedar_assoc_table AS t1 JOIN ( SELECT t2.user FROM cedar_assoc_table AS t2 WHERE t2.deleted = 0 AND t2.user != '' AND t2.lineage LIKE '%/def-someaccount_gpu/%' GROUP BY t2.user HAVING MAX(t2.is_def) = 1 AND NOT COUNT(*) = ( SELECT COUNT(*) FROM cedar_assoc_table WHERE deleted = 0 AND user = t2.user ) ) AS subquery ON t1.user = subquery.user WHERE t1.is_def = 1 AND t1.deleted = 0; This query seemed to provide the same, correct, results, as far as I could tell. But did it in 0.2s I make no claim to the strength or accuracy or license of that query, it was produced by 'AI' With the state of the current code, we cannot delete accounts. It also seems there is something else causing the slurmdbd startup to take 1.15hr each time, but that is or may be a slightly different issue/query that I need to look into. Thank you.
A simple way that I also tested it was a new table with only the fields needed. create table test_table (id_assoc int(10) unsigned primary key, user tinytext, acct tinytext, lineage text, is_def tinyint(4), deleted tinyint(4), index(user,acct)); insert into test_table select (seq) id_assoc, concat('myname', (seq)) user, concat('myacct', (seq)) acct, concat('mylineage', (seq)) lineage, 1 is_def, 0 deleted from seq_1_to_40000;
Hi. That's good information, especially the test--thanks. I agree that the query is not constructed well. Another site has also identified this issue and I have an initial patch for it that I'm testing. I'll supply more once I've finished that step.
Hi, I'm curious if any progress has been made on this. Might it make it into 24.11.3 We have another site that upgraded to 24.11.x recently and they have 180,000 records in their assoc table, so they're unable to deletions now as well. Thanks, Adam
(In reply to Adam from comment #4) > I'm curious if any progress has been made on this. Might it make it into > 24.11.3 > > We have another site that upgraded to 24.11.x recently and they have 180,000 > records in their assoc table, so they're unable to deletions now as well. Hi. I've been out of the office for a bit and am catching up. I have a patch together for this but 24.11.3 needed to be pushed out quickly and I wasn't ready in time. The patch is a priority for me now and will let you know when we have it committed.
(In reply to Chad Vizino from comment #5) > (In reply to Adam from comment #4) > > I'm curious if any progress has been made on this. Might it make it into > > 24.11.3 > > > > We have another site that upgraded to 24.11.x recently and they have 180,000 > > records in their assoc table, so they're unable to deletions now as well. > Hi. I've been out of the office for a bit and am catching up. I have a patch > together for this but 24.11.3 needed to be pushed out quickly and I wasn't > ready in time. The patch is a priority for me now and will let you know when > we have it committed. Hi Chad, Any progress on this, by the time we have any movement on these SQL queries, we'll be at 24.11.6 it seems. This query is just one of a few, the startup time of SlurmDB is also very long due to other questionable queries during the startup process, but, this specific query is of more concern. Thanks, Adam
(In reply to Adam from comment #6) > Any progress on this, by the time we have any movement on these SQL queries, > we'll be at 24.11.6 it seems. > > This query is just one of a few, the startup time of SlurmDB is also very > long due to other questionable queries during the startup process, but, this > specific query is of more concern. Hi Adam. Sorry this has been delayed. As you point out, there are other long queries--we are looking at one in particular (it's in _get_user_coords()) that is delaying startup for sites with larger assoc tables. That query also does a 2-way join on the assoc table and we are eliminating the join so it's much faster (being internally reviewed). As I mentioned earlier, I have a fix for the delete issue but it's still not quite ready for internal review and may change a bit depending on how things go with this other one. So, hopefully not too much longer till the delete issue can get some attention and possible improvement from what we've learned. I'll give an update soon when we are certain of the direction we are going since it's is an important issue also, especially when the assoc table is larger.