Ticket 21937 - sacctmgr -i delete user where name=... takes 30+ minutes(never visibly completes)
Summary: sacctmgr -i delete user where name=... takes 30+ minutes(never visibly comple...
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other tickets)
Version: 24.11.0
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Chad Vizino
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-01-30 13:58 MST by Adam
Modified: 2025-04-25 15:54 MDT (History)
2 users (show)

See Also:
Site: Simon Fraser University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: CentOS
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Adam 2025-01-30 13:58:44 MST
We upgraded from 23.02.7 a few days ago to 24.11.0.
We have 184000 records in our cedar_assoc_table.

The query that was modified in 23.11.1 seems to be extremely inefficient.  During the function _check_is_def_acct_before_remove
https://github.com/SchedMD/slurm/commit/8fcd96efda7a0476b4910b5ee62e55f94baa34f2#diff-7c227c4ce0cbb698c2fca7a2ff21bf59d01ad702d758d4f07c89c07248c0a423R283-R294

The added sanity check seems to be an exponentially growing lookup depending on the rows in the table.

I tested this by chopping the table down to smaller sizes, 10k records, 20k, 30k, 40k.  After 30k, the call took a VERY long time.

Is it possible that this would fit the bill better?

SELECT t1.user, t1.acct FROM cedar_assoc_table AS t1
  JOIN (
    SELECT t2.user     FROM cedar_assoc_table AS t2
    WHERE t2.deleted = 0 AND t2.user != '' AND
    t2.lineage LIKE '%/def-someaccount_gpu/%'
    GROUP BY t2.user HAVING MAX(t2.is_def) = 1 
    AND NOT COUNT(*) = (
      SELECT COUNT(*) FROM cedar_assoc_table WHERE deleted = 0
      AND user = t2.user
    )
  ) AS subquery ON
  t1.user = subquery.user WHERE t1.is_def = 1 AND t1.deleted = 0;

This query seemed to provide the same, correct, results, as far as I could tell.  But did it in 0.2s

I make no claim to the strength or accuracy or license of that query, it was produced by 'AI' 

With the state of the current code, we cannot delete accounts.

It also seems there is something else causing the slurmdbd startup to take 1.15hr each time, but that is or may be a slightly different issue/query that I need to look into.

Thank you.
Comment 1 Adam 2025-01-30 14:08:07 MST
A simple way that I also tested it was a new table with only the fields needed.

create table test_table (id_assoc int(10) unsigned primary key, user tinytext, acct tinytext, lineage text, is_def tinyint(4), deleted tinyint(4), index(user,acct));
insert into test_table select (seq) id_assoc, concat('myname', (seq)) user, concat('myacct', (seq)) acct, concat('mylineage', (seq)) lineage, 1 is_def, 0 deleted from seq_1_to_40000;
Comment 2 Chad Vizino 2025-01-30 14:55:06 MST
Hi. That's good information, especially the test--thanks. I agree that the query is not constructed well. Another site has also identified this issue and I have an initial patch for it that I'm testing. I'll supply more once I've finished that step.
Comment 4 Adam 2025-03-06 10:24:07 MST
Hi,

I'm curious if any progress has been made on this. Might it make it into 24.11.3

We have another site that upgraded to 24.11.x recently and they have 180,000 records in their assoc table, so they're unable to deletions now as well.

Thanks,

Adam
Comment 5 Chad Vizino 2025-03-12 10:02:10 MDT
(In reply to Adam from comment #4)
> I'm curious if any progress has been made on this. Might it make it into
> 24.11.3
> 
> We have another site that upgraded to 24.11.x recently and they have 180,000
> records in their assoc table, so they're unable to deletions now as well.
Hi. I've been out of the office for a bit and am catching up. I have a patch together for this but 24.11.3 needed to be pushed out quickly and I wasn't ready in time. The patch is a priority for me now and will let you know when we have it committed.
Comment 6 Adam 2025-04-25 11:16:33 MDT
(In reply to Chad Vizino from comment #5)
> (In reply to Adam from comment #4)
> > I'm curious if any progress has been made on this. Might it make it into
> > 24.11.3
> > 
> > We have another site that upgraded to 24.11.x recently and they have 180,000
> > records in their assoc table, so they're unable to deletions now as well.
> Hi. I've been out of the office for a bit and am catching up. I have a patch
> together for this but 24.11.3 needed to be pushed out quickly and I wasn't
> ready in time. The patch is a priority for me now and will let you know when
> we have it committed.

Hi Chad,

Any progress on this, by the time we have any movement on these SQL queries, we'll be at 24.11.6 it seems.

This query is just one of a few, the startup time of SlurmDB is also very long due to other questionable queries during the startup process, but, this specific query is of more concern.

Thanks,

Adam
Comment 7 Chad Vizino 2025-04-25 15:54:51 MDT
(In reply to Adam from comment #6)
> Any progress on this, by the time we have any movement on these SQL queries,
> we'll be at 24.11.6 it seems.
> 
> This query is just one of a few, the startup time of SlurmDB is also very
> long due to other questionable queries during the startup process, but, this
> specific query is of more concern.
Hi Adam. Sorry this has been delayed. As you point out, there are other long queries--we are looking at one in particular (it's in _get_user_coords()) that is delaying startup for sites with larger assoc tables. That query also does a 2-way join on the assoc table and we are eliminating the join so it's much faster (being internally reviewed).

As I mentioned earlier, I have a fix for the delete issue but it's still not quite ready for internal review and may change a bit depending on how things go with this other one. So, hopefully not too much longer till the delete issue can get some attention and possible improvement from what we've learned. I'll give an update soon when we are certain of the direction we are going since it's is an important issue also, especially when the assoc table is larger.