Ticket 16541

Summary: can't delete user account since slurm upgrade
Product: Slurm Reporter: Wei Feinstein <wfeinstein>
Component: DatabaseAssignee: Chad Vizino <chad>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 22.05.6   
Hardware: Linux   
OS: Linux   
Site: UC Berkley Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmdbd.log
mysqldump

Description Wei Feinstein 2023-04-17 16:10:38 MDT
Dear support team,

We recently upgraded slurm from 20.11.08 to 22.05.6 on the Savio Cluster. Since then we have had troubles deleting users from slurmDB with the following error messages:

sacctmgr: accounting_storage/slurmdbd: acct_storage_p_remove_assocs: No error
 Nothing deleted

Please advise how to debug.

Thank you,
wei
Comment 1 Chad Vizino 2023-04-17 16:37:17 MDT
(In reply to Wei Feinstein from comment #0)
> We recently upgraded slurm from 20.11.08 to 22.05.6 on the Savio Cluster.
> Since then we have had troubles deleting users from slurmDB with the
> following error messages:
> 
> sacctmgr: accounting_storage/slurmdbd: acct_storage_p_remove_assocs: No error
>  Nothing deleted
Hi. Would you please supply your whole slurmdbd.log file covering when you ran the sacctmgr?

Also would you please run this over your current and previous (20.11.x) slurmdbd.log files and supply the output?
>$ grep -i "MySQL server version is:" <slurmdbd-log-files>
Comment 2 Wei Feinstein 2023-04-17 16:49:23 MDT
Created attachment 29881 [details]
slurmdbd.log
Comment 3 Wei Feinstein 2023-04-17 16:50:11 MDT
Hi Chad,

[root@master ~]# grep -i "MySQL server version is:"  /var/log/slurm/slurmdbd.log 
[2023-04-05T20:25:28.815] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 10.3.35-MariaDB
[2023-04-05T20:26:33.946] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 10.3.35-MariaDB
[2023-04-05T20:36:32.397] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 10.3.35-MariaDB
[2023-04-06T09:16:32.297] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 10.3.35-MariaDB

Thank you,

Wei
Comment 4 Chad Vizino 2023-04-17 17:10:23 MDT
Thanks. I believe this is due to a known MariaDB issue (described in detail in bug 13562 comment 21 if you care to read it). There's an easier way to resolve this than what is listed in that ticket and will give that to you but just want to confirm something first.

Would you please run and attach the output from the following (you may need to add a username and password (see your slurmdbd.conf file)--also am assuming your StorageLoc is either the default or is set to slurm_acct_db in your slurmdbd.conf):

>$ mysqldump --no-data slurm_acct_db > <output-file>
This will dump your slurm accounting db schema only (no data) and the output should be only around 1k lines.
Comment 5 Wei Feinstein 2023-04-17 17:23:29 MDT
Created attachment 29882 [details]
mysqldump
Comment 6 Wei Feinstein 2023-04-17 17:24:13 MDT
Yes, StorageLoc is default.

[root@master ~]# grep -i StorageLoc /etc/slurm/slurm.conf 
#AccountingStorageLoc=
Comment 7 Chad Vizino 2023-04-18 07:53:17 MDT
(In reply to Wei Feinstein from comment #5)
> Created attachment 29882 [details]
> mysqldump
Thanks. From your assoc table schema the dump lists:

>CREATE TABLE `0-a0-d1-ec-bc-c_assoc_table` (
> ...
>  `max_tres_pj` text NOT NULL,
> ...
What we want is this for max_tres_pj:

>  `max_tres_pj` text NOT NULL DEFAULT '',
To resolve (for this field and others like it in other tables), follow these steps:

* Shutdown slurmctld and slurmdbd

* Drop the table_defs_table:
>$ mysql -D slurm_acct_db -e "drop table table_defs_table"
* Restart slurmdbd and then slurmctld

Slurmdbd will rebuild the table_defs_table and make the necessary corrections to the affected text fields in the slurm_acct_db database. You should then be able to delete users with sacctmgr and no further action is needed (above procedure is only needed once).
Comment 8 Wei Feinstein 2023-04-18 08:45:41 MDT
Hi Chad,

Thank you!

Followed your instructions, deleting users from DB is allowed! 

I read the bug 13562 comment 21 you included, still not clear to me. Can you briefly please explain why deleting users were not permitted, then dropping 
table_defs_table made the operations possible?

MariaDB [slurm_acct_db]> drop table table_defs_table;

Thank you so very much,
Wei
Comment 9 Wei Feinstein 2023-04-18 09:15:10 MDT
Also we didn't upgrade MariaDB this time, why did the table "default" issue come up now?

Wei
Comment 10 Chad Vizino 2023-04-18 11:38:30 MDT
(In reply to Wei Feinstein from comment #8)
> Followed your instructions, deleting users from DB is allowed! 
Great!

> I read the bug 13562 comment 21 you included, still not clear to me. Can you
> briefly please explain why deleting users were not permitted, then dropping 
> table_defs_table made the operations possible?
I'll include an excerpt from that comment for context:

>...the tables (ex. assoc table) that contain fields that were created with “not null default ‘’” actually have
>a NULL default. When an attempt is made to set a DEFAULT value on one of the affected fields, an error is 
>produced (what you are seeing). The workaround is to alter the default value for these fields which changes 
>them from NULL to ‘’ (see below).
Basically, sacctmgr causes sql to be generated and sent to MariaDB that attempts to set a DEFAULT value on a field in the assoc table and this produces an error causing the sacctmgr operation (deleting a user for you) to fail.

The table_defs_table holds the schema for all the slurm tables and it's checked at slurmdbd startup to see if any tables need updating. This table holds the schema the tables were orgininally created with. Deleting this table effectively forces slurmdbd to make any updates necessary (from schema imbedded in the code) and then write the updates into this table so they're not repeated next time.

This leads to your next question from comment 9:
>Also we didn't upgrade MariaDB this time, why did the table "default" issue come up now?
I'm not sure. Most sites that have encountered this problem have upgraded their MariaDB from a version earlier than 10.2 to a 10.2 version or later (or from any version of MySQL). It could also happen if the slurm database is restored from a dump from the past when an older version (or MySQL) was in use. I was trying to learn more about why this happened to you from the questions I asked earlier but still am not sure.

I will also say that we've automated a fix staring in 22.05.7 that basically looks to see if this condition is present and then takes appropriate action at slurmdbd startup. Unfortunately, you are one minor version (.6) shy so it wasn't caught.

For now I'm glad things are working for you and will close this ticket.