Ticket 5662

Summary: "Message too long" during COMMIT
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: slurmdbdAssignee: Alejandro Sanchez <alex>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: akkornel, alex, brian
Version: 18.08.0   
Hardware: Linux   
OS: Linux   
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Kilian Cavalotti 2018-08-31 19:09:01 MDT
Hi there, 


Still in the process of converting our 17.11.8 Slurm DB to the 18.08.0 format.

It's been happily chugging along for about 2 hours, adding columns to the _step_table table, and then populating them, but then it stopped with this message:


Aug 31 17:28:42 sh-sl01.int slurmdbd[170926]: error: mysql_query failed: 1180 Got error 90 "Message too long" during COMMIT
                                              create table if not exists table_defs_table (creation_time int unsigned not null, mod_time int unsigned default 0 not null, table_name text not null, definition text not null, primary key (table_name(50))) engine='innodb'


I tried to run that query by hand, and it worked fine and created the table correctly. Any idea what could have caused that issue?

The frustrating thing is that restarting slurmdbd now re-update the _step_table, which in our case holds close to 25 million jobs, so that takes some time...


Thanks,
--
Kilian
Comment 1 Kilian Cavalotti 2018-09-01 09:39:45 MDT
Hi, 

We tried starting slurmdbd again, and it failed the same way, in the same place, after more than 2 hours (re-)adding records in the <cluster>_step_table.

I'm not sure why the "CREATE TABLE" query stalls and fails when executed by slurmdbd, while it succeeds when run in the MariaDB CLI. 

For background info, we use MariaDB in a Galera multi-node cluster. This info could certainly be of interest: https://www.percona.com/blog/2015/10/26/how-big-can-your-galera-transactions-be/ and seems to be pointing towards excessively large transactions. Maybe there could be a way to split them up in smaller chunks?

Anyway, this is currently preventing our cluster to be updated to 18.08.0. We currently have a scheduled datacenter downtime until Monday afternoon, so we can't try anything right now, but it would be great to have a plan to be able to move forward with the upgrade when the power comes back on Monday.

Any guidance will be much appreciated.

Thanks!
--
Kilian
Comment 2 Alejandro Sanchez 2018-09-03 07:19:22 MDT
Hi Kilian. I've still not looked at this much, but as a starting point could you please attach full slurmdbd.log and check your current wsrep_max_ws_size option value?
Comment 3 Kilian Cavalotti 2018-09-03 09:27:18 MDT
Hi Alejandro,

(In reply to Alejandro Sanchez from comment #2)
> Hi Kilian. I've still not looked at this much, but as a starting point could
> you please attach full slurmdbd.log and check your current wsrep_max_ws_size
> option value?

I don't have the log handy right now since the cluster is down for a datacenter maintenance, but wsrep_max_ws_size is not set, which makes it take the default and maximum value of 2GB, I believe.

Cheers,
-- 
Kilian
Comment 4 Kilian Cavalotti 2018-09-03 18:06:01 MDT
> (In reply to Alejandro Sanchez from comment #2)
> > Hi Kilian. I've still not looked at this much, but as a starting point could
> > you please attach full slurmdbd.log and check your current wsrep_max_ws_size
> > option value?
> 
> I don't have the log handy right now since the cluster is down for a
> datacenter maintenance, 

Here's the slurmdbd log, which covers the first start with 18.08.0:


Aug 31 13:55:38 sh-sl01 yum[170787]: Updated: slurm-slurmdbd-18.08.0-1.el7.x86_64
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: error: Database settings not recommended values: innodb_buffer_pool_size
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: pre-converting step table for sherlock
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_in_ave after tres_alloc in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_in_max after tres_usage_in_ave in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_in_max_taskid after tres_usage_in_max in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_in_max_nodeid after tres_usage_in_max_taskid in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_in_min after tres_usage_in_max_nodeid in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_in_min_taskid after tres_usage_in_min in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_in_min_nodeid after tres_usage_in_min_taskid in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_in_tot after tres_usage_in_min_nodeid in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_out_ave after tres_usage_in_tot in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_out_max after tres_usage_out_ave in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_out_max_taskid after tres_usage_out_max in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_out_max_nodeid after tres_usage_out_max_taskid in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_out_min after tres_usage_out_max_nodeid in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_out_min_taskid after tres_usage_out_min in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_out_min_nodeid after tres_usage_out_min_taskid in table "sherlock_step_table"
Aug 31 13:56:03 sh-sl01 slurmdbd[170926]: adding column tres_usage_out_tot after tres_usage_out_min_nodeid in table "sherlock_step_table"
Aug 31 16:56:46 sh-sl01 slurmdbd[170926]: pre-converting job table for sherlock
Aug 31 17:28:42 sh-sl01 slurmdbd[170926]: error: mysql_query failed: 1180 Got error 90 "Message too long" during COMMIT#012create table if not exists table_defs_table (creation_time int unsigned not null, mod_time int unsigned default 0 not null, table_name text not null, definition text not null, primary key (table_name(50))) engine='innodb'
Aug 31 17:28:42 sh-sl01 slurmdbd[170926]: error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
Aug 31 17:28:42 sh-sl01 slurmdbd[170926]: error: cannot create accounting_storage context for accounting_storage/mysql
Aug 31 17:28:42 sh-sl01 slurmdbd[170926]: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin
Aug 31 17:28:42 sh-sl01 systemd: slurmdbd.service: main process exited, code=exited, status=1/FAILURE
Aug 31 17:28:42 sh-sl01 systemd: Unit slurmdbd.service entered failed state.
Aug 31 17:28:42 sh-sl01 systemd: slurmdbd.service failed.
Aug 31 17:47:29 sh-sl01 systemd: PID file /var/run/slurmdbd.pid not readable (yet?) after start.
Aug 31 17:47:29 sh-sl01 slurmdbd[171721]: error: Database settings not recommended values: innodb_buffer_pool_size
Aug 31 17:47:29 sh-sl01 slurmdbd[171721]: pre-converting step table for sherlock
Aug 31 17:52:10 sh-sl01 systemd: PID file /var/run/slurmdbd.pid not readable (yet?) after start.
Aug 31 17:52:10 sh-sl01 slurmdbd[171888]: error: Database settings not recommended values: innodb_buffer_pool_size
Aug 31 17:52:10 sh-sl01 slurmdbd[171888]: pre-converting step table for sherlock
Aug 31 18:01:30 sh-sl01 slurmdbd[171888]: error: Could not execute statement 1213 Deadlock found when trying to get lock; try restarting transaction
Aug 31 19:51:52 sh-sl01 slurmdbd[171888]: pre-converting job table for sherlock
Aug 31 20:07:01 sh-sl01 slurmdbd[171888]: error: mysql_query failed: 2013 Lost connection to MySQL server during query#012create table if not exists table_defs_table (creation_time int unsigned not null, mod_time int unsigned default 0 not null, table_name text not null, definition text not null, primary key (table_name(50))) engine='innodb'
Aug 31 20:07:01 sh-sl01 slurmdbd[171888]: Accounting storage MYSQL plugin failed


> but wsrep_max_ws_size is not set, which makes it
> take the default and maximum value of 2GB, I believe.

I confirm:

MariaDB [slurm_acct_db]> SHOW VARIABLES LIKE 'wsrep_max_ws_size';
+-------------------+------------+
| Variable_name     | Value      |
+-------------------+------------+
| wsrep_max_ws_size | 2147483647 |
+-------------------+------------+
Comment 5 Kilian Cavalotti 2018-09-03 18:53:41 MDT
Tried again, with a single node in the Galera cluster to avoid issues related to replication, and it still fails in the same place: mariadb logs this when creating the table:

Sep  3 17:45:54 sh-sl01 mysqld: 2018-09-03 17:45:54 139828804291328 [Warning] WSREP: transaction size limit (2147483647) exceeded: 2147483648
Sep  3 17:45:55 sh-sl01 mysqld: 2018-09-03 17:45:55 139828804291328 [ERROR] WSREP: rbr write fail, data_len: 0, 2

Any idea how to bypass this limit, or slice the transaction in smaller chunks?

This is currently holding off our ability to restore our system to production after the maintenance.

Thanks,
-- 
Kilian
Comment 6 Kilian Cavalotti 2018-09-03 22:38:14 MDT
(In reply to Kilian Cavalotti from comment #5)
> This is currently holding off our ability to restore our system to
> production after the maintenance.

We're moving back to 17.11.8.
Comment 7 Alejandro Sanchez 2018-09-04 04:05:39 MDT
I'm tempted to suggest setting the following values which seem to alleviate such error for other people:

mysql> set global wsrep_max_ws_size=4294901759;
mysql> set global binlog_row_image=minimal;

Although I'm still studying the conversion logic.
Comment 8 Kilian Cavalotti 2018-09-04 08:36:57 MDT
(In reply to Alejandro Sanchez from comment #7)
> I'm tempted to suggest setting the following values which seem to alleviate
> such error for other people:
> 
> mysql> set global wsrep_max_ws_size=4294901759;
> mysql> set global binlog_row_image=minimal;
> 
> Although I'm still studying the conversion logic.

Thanks.
I tried to set binlog_row_image=minimal yesterday before our last conversion try, but it didn't help.

And wsrep_max_ws_size maximum value is 2GB, apparently:
https://mariadb.com/kb/en/library/galera-cluster-system-variables/#wsrep_max_ws_size

Thanks,
-- 
Kilian
Comment 9 Brian Christiansen 2018-09-04 13:47:09 MDT
Hey Killian, would it be possible to share the dump of the database so that we could try it locally? I can send you a link to place that you can upload it too if that’s something that you can do. 

Thanks,
Brian
Comment 11 Brian Christiansen 2018-09-04 16:23:46 MDT
Thanks Killian. I'm able to download it. I've made the comment with the link private so that it's not public to the world. We'll let you know how it goes.
Comment 12 Alejandro Sanchez 2018-09-05 09:26:16 MDT
Your database copy is still being imported locally... Although unrelated and while here, you might consider increasing how often you purge job and/or steps. From the last sherlock slurmdbd.conf copy we have

PurgeJobAfter=36month
PurgeStepAfter=36month
Comment 13 Kilian Cavalotti 2018-09-05 09:31:29 MDT
(In reply to Alejandro Sanchez from comment #12)
> Your database copy is still being imported locally... Although unrelated and
> while here, you might consider increasing how often you purge job and/or
> steps. From the last sherlock slurmdbd.conf copy we have
> 
> PurgeJobAfter=36month
> PurgeStepAfter=36month

Yup, I was considering this and thought we had everything set to 12 months, but apparently not. I'll make the change, thanks for pointing it out.

Cheers,
-- 
Kilian
Comment 18 Alejandro Sanchez 2018-09-06 11:02:54 MDT
Kilian, will you show the value of these MariaDB options? Thanks.

MariaDB [(none)]> show variables like 'innodb_buffer_pool_size';
MariaDB [(none)]> show variables like 'innodb_log_file_size';
MariaDB [(none)]> show variables like 'innodb_lock_wait_timeout';
Comment 19 Kilian Cavalotti 2018-09-06 11:12:48 MDT
Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #18)
> Kilian, will you show the value of these MariaDB options? Thanks.

Sure, here they are:

MariaDB [slurm_acct_db]> show variables like 'innodb_buffer_pool_size';
+-------------------------+-------------+
| Variable_name           | Value       |
+-------------------------+-------------+
| innodb_buffer_pool_size | 34359738368 |
+-------------------------+-------------+
1 row in set (0.00 sec)

MariaDB [slurm_acct_db]> show variables like 'innodb_log_file_size';
+----------------------+----------+
| Variable_name        | Value    |
+----------------------+----------+
| innodb_log_file_size | 67108864 |
+----------------------+----------+
1 row in set (0.00 sec)

MariaDB [slurm_acct_db]> show variables like 'innodb_lock_wait_timeout';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_lock_wait_timeout | 900   |
+--------------------------+-------+
1 row in set (0.00 sec)

Cheers,
-- 
Kilian
Comment 22 Alejandro Sanchez 2018-09-07 08:41:17 MDT
Hi Kilian. While we're finally performing the upgrade with your database in a machine with enough memory, we're guessing that if you change the Purge at an hourly rate to:

PurgeStepAfter=8760hours
PurgeJobAfter=8760hours

you'll hopefully be able to successfully finish the process. I any case this afternoon we should have results from the upgrade on the copy of your database.
Comment 24 Kilian Cavalotti 2018-09-07 08:50:29 MDT
Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #22)
> Hi Kilian. While we're finally performing the upgrade with your database in
> a machine with enough memory, we're guessing that if you change the Purge at
> an hourly rate to:
> 
> PurgeStepAfter=8760hours
> PurgeJobAfter=8760hours

Uh, interesting. I guess I'm not familiar enough with the purging process to understand the difference between setting those values to 8760 hours and 12 months: they're the same duration so what does it change? The frequency at which those purges are done?

Am I right to assume that having PurgeStepAfter=8760hours will make slurmdbd purge 1-hour worth of jobs older than 12 months every hour, while having it at 12month will make it purge 1-month of old jobs every month?


> you'll hopefully be able to successfully finish the process. I any case this
> afternoon we should have results from the upgrade on the copy of your
> database.

Great, thanks!

Cheers,
-- 
Kilian
Comment 25 Alejandro Sanchez 2018-09-07 08:58:29 MDT
(In reply to Kilian Cavalotti from comment #24)
> Uh, interesting. I guess I'm not familiar enough with the purging process to
> understand the difference between setting those values to 8760 hours and 12
> months: they're the same duration so what does it change? The frequency at
> which those purges are done?

Correct.
 
> Am I right to assume that having PurgeStepAfter=8760hours will make slurmdbd
> purge 1-hour worth of jobs older than 12 months every hour, while having it
> at 12month will make it purge 1-month of old jobs every month?

PurgeStepAfter=8760hours will make the purge process be attempted at the beginning of every hour, and all steps older than 8760hours (equivalent to 12months) will be purged.

PurgeStepAfter=12month will make the purge process be attempted at the beginning of every month, and all steps older than 12moths will be purged.

As stated in slurmdbd.conf man page:

The purge takes place at the start of each purge interval.  For example, if the purge time is 2 months, the purge would happen at the beginning of each month.
Comment 26 Kilian Cavalotti 2018-09-07 09:01:37 MDT
(In reply to Alejandro Sanchez from comment #25)
> As stated in slurmdbd.conf man page:
> 
> The purge takes place at the start of each purge interval.  For example, if
> the purge time is 2 months, the purge would happen at the beginning of each
> month.

Got it, thanks for the clarification. 
Change applied!

Cheers,
-- 
Kilian
Comment 29 Kilian Cavalotti 2018-09-18 10:42:41 MDT
Hi there, 

I was wondering if you had any chance to reproduce the problem with the dump of our database?

Thanks,
-- 
Kilian
Comment 30 Alejandro Sanchez 2018-09-18 10:50:59 MDT
I got some unrelated errors:

alex@knl2:~/slurm/upgrade/knl2/log$ grep error slurmdbd.log 
[2018-09-07T06:10:10.559] error: Database settings not recommended values: innodb_buffer_pool_size innodb_lock_wait_timeout
[2018-09-07T07:25:39.347] error: Could not execute statement 1206 The total number of locks exceeds the lock table size
[2018-09-07T08:31:31.572] error: Could not execute statement 1206 The total number of locks exceeds the lock table size
[2018-09-07T09:41:30.887] error: Could not execute statement 1206 The total number of locks exceeds the lock table size
[2018-09-07T10:48:48.888] error: Could not execute statement 1206 The total number of locks exceeds the lock table size
alex@knl2:~/slurm/upgrade/knl2/log$ tail -n2 slurmdbd.log 
[2018-09-10T04:00:00.408] debug2: Got 1 of 1 rolled up
[2018-09-10T04:00:00.408] debug2: Everything rolled up
alex@knl2:~/slurm/upgrade/knl2/log$

But the upgrade finished:

alex@knl2:~/slurm/upgrade/knl2/log$ grep -ri conversion slurmdbd.log
[2018-09-07T11:35:08.681] Conversion done: success!

We thin when we manage to purge some of your job/steps properly you will be able to upgrade. Unfortunately, we also have problems with your purge process as tracked in bug 5689.
Comment 31 Kilian Cavalotti 2018-09-28 14:33:18 MDT
(In reply to Alejandro Sanchez from comment #30)
> I got some unrelated errors:
> But the upgrade finished:
> 
> We thin when we manage to purge some of your job/steps properly you will be
> able to upgrade. Unfortunately, we also have problems with your purge
> process as tracked in bug 5689.

Mmmh, so I finally managed to import a dump of our thinned-out database (down to 6 months) in a test environment and tried the upgrade again. 

I unfortunately got the exact same error after 4h30 of DB conversion:

slurmdbd: pre-converting job table for sherlock
slurmdbd: error: mysql_query failed: 1180 Got error 90 "Message too long" during COMMIT
create table if not exists table_defs_table (creation_time int unsigned not null, mod_time int unsigned default 0 not null, table_name text not null, definition text not null, primary key (table_name(50))) engine='innodb'
slurmdbd: Accounting storage MYSQL plugin failed
slurmdbd: error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
slurmdbd: error: cannot create accounting_storage context for accounting_storage/mysql
slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin

Not sure how to proceed here...

Cheers,
--
Kilian
Comment 32 Alejandro Sanchez 2018-10-01 09:10:30 MDT
Is the test environment also backed up by a Galera cluster? I'm starting to think if using Galera vs not is making any difference since we managed to upgrade your database copy locally...
Comment 33 Kilian Cavalotti 2018-10-01 09:19:19 MDT
(In reply to Alejandro Sanchez from comment #32)
> Is the test environment also backed up by a Galera cluster? I'm starting to
> think if using Galera vs not is making any difference since we managed to
> upgrade your database copy locally...

Yes it is, I wanted to make the test environment as similar to the production system as possible. And yes, I also think that the Galera layer is significant in this case.

I will try to start the DB conversion without the Galera component, and if it works, we can envision a workaround procedure where we stop the slurmdbds, restart MariaDB without Galera, start slurmdbd to convert the DB, and then, when it's done, shutdown slurmdbd agin, re-enable Galera and restart everything.

But we would obviously prefer to understand why the DB conversion is failing with Galera, and possibly fix the problem. Because I think that having a Galera cluster as an underlying DB layer for slurmdbd would not be an uncommon case.

Thanks,
-- 
Kilian
Comment 34 Alejandro Sanchez 2018-10-04 02:55:50 MDT
Hey Kilian. Did you manage to proceed with the conversion without Galera?

I'm lowering a bit the severity to 3 if that's not an issue.
Comment 35 Kilian Cavalotti 2018-10-04 09:09:11 MDT
Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #34)
> Hey Kilian. Did you manage to proceed with the conversion without Galera?

Not yet, unfortunately, I've been a bit distracted by a lot of other things, but I'll give it another go today. And I'll keep you posted.

> I'm lowering a bit the severity to 3 if that's not an issue.

Absolutely, since we're staying with 17.11 for the time being, this is perfectly fine.

Thanks!
-- 
Kilian
Comment 36 Kilian Cavalotti 2018-10-04 18:52:31 MDT
(In reply to Kilian Cavalotti from comment #35)
> > Hey Kilian. Did you manage to proceed with the conversion without Galera?
> 
> Not yet, unfortunately, I've been a bit distracted by a lot of other things,
> but I'll give it another go today. And I'll keep you posted.

I finally got to it, and I can confirm that running MariaDB without the Glaera components allowed the update to proceed to completion:

slurmdbd: debug2: running rollup at Thu Oct 04 17:00:00 2018
slurmdbd: debug2: No need to roll cluster sherlock this day 1538636400 <= 1538636400
slurmdbd: debug2: No need to roll cluster sherlock this month 1538377200 <= 1538377200
slurmdbd: debug2: Got 1 of 1 rolled up
slurmdbd: debug2: Everything rolled up


So I guess we now have an upgrade path, but it would still be great to understand why the transaction fail with Galera.

Cheers,
-- 
Kilian
Comment 37 Alejandro Sanchez 2018-10-05 04:29:11 MDT
(In reply to Kilian Cavalotti from comment #36)
> I finally got to it, and I can confirm that running MariaDB without the
> Glaera components allowed the update to proceed to completion:
> 
> slurmdbd: debug2: running rollup at Thu Oct 04 17:00:00 2018
> slurmdbd: debug2: No need to roll cluster sherlock this day 1538636400 <=
> 1538636400
> slurmdbd: debug2: No need to roll cluster sherlock this month 1538377200 <=
> 1538377200
> slurmdbd: debug2: Got 1 of 1 rolled up
> slurmdbd: debug2: Everything rolled up

Sounds promising. I guess you're gonna go try the upgrade on production cluster by using the strategy you mentioned.
 
> So I guess we now have an upgrade path, but it would still be great to
> understand why the transaction fail with Galera.

I'm gonna try install this Galera cluster and retry the upgrade with your db copy again with it and see if I can reproduce and find the potential offending parameter or issue. Lowering the severity to 4 now since I expect your production upgrade to hopefully success.
Comment 38 Alejandro Sanchez 2018-10-05 05:09:29 MDT
Actually, may I know which MariaDB/Galera versions are you using?
Comment 39 Alejandro Sanchez 2018-10-05 06:37:47 MDT
Also the my.cnf with worded sensitive options might be useful since there are a bunch of potential Galera-specific params that can be decisive to trigger the error. At least these:

MariaDB [(none)]> show status like 'wsrep%';
+
The value of:
binlog_format
default_storage_engine
innodb_autoinc_lock_mode
innodb_doublewrite
wsrep_on
innodb_flush_log_at_trx_commit

Also not sure if you can retrieve the MariaDB logs (regular and/or error ones) to see the specific message when it failed, like:

[ERROR] WSREP: Maximum wirteset size exceeded by XXXX: 90 (Message too long)

or similar that could point to anything relevant. In any case, this looks more like a DB Admin / Galera issue than a Slurm one.
Comment 40 Kilian Cavalotti 2018-10-05 11:03:20 MDT
Hi Alejandro, 

We're using MariaDB 10.1.36, and Galera 25.3.23

(In reply to Alejandro Sanchez from comment #39)
> Also the my.cnf with worded sensitive options might be useful since there
> are a bunch of potential Galera-specific params that can be decisive to
> trigger the error. 

Here are the options we have in our config file:

-- 8< -------------------------------------------------
# cat /etc/my.cnf.d/galera.cnf
[mysqld]
bind-address=0.0.0.0
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
innodb_buffer_pool_size=2GB
innodb_buffer_pool_size=128M
innodb_lock_wait_timeout=900

tmpdir=/var/lib/mysql/.tmp

# Galera Provider Configuration
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_name="test_slurm_cluster"
wsrep_cluster_address="gcomm://sherlock-slurm01,sherlock-slurm02"
wsrep_sst_method=rsync
wsrep_node_address="10.210.31.20"
wsrep_node_name="sherlock-slurm01"
wsrep_max_ws_size=4294901759
-- 8< -------------------------------------------------


The actual config values:
-- 8< -------------------------------------------------
MariaDB [(none)]> show status like 'wsrep%';
+------------------------------+--------------------------------------+
| Variable_name                | Value                                |
+------------------------------+--------------------------------------+
| wsrep_apply_oooe             | 0.000000                             |
| wsrep_apply_oool             | 0.000000                             |
| wsrep_apply_window           | 0.000000                             |
| wsrep_causal_reads           | 0                                    |
| wsrep_cert_deps_distance     | 0.000000                             |
| wsrep_cert_index_size        | 0                                    |
| wsrep_cert_interval          | 0.000000                             |
| wsrep_cluster_conf_id        | 12                                   |
| wsrep_cluster_size           | 2                                    |
| wsrep_cluster_state_uuid     | 9fd9b841-c299-11e8-bb45-6a5bd809fa6b |
| wsrep_cluster_status         | Primary                              |
| wsrep_commit_oooe            | 0.000000                             |
| wsrep_commit_oool            | 0.000000                             |
| wsrep_commit_window          | 0.000000                             |
| wsrep_connected              | ON                                   |
| wsrep_desync_count           | 0                                    |
| wsrep_evs_delayed            |                                      |
| wsrep_evs_evict_list         |                                      |
| wsrep_evs_repl_latency       | 0/0/0/0/0                            |
| wsrep_evs_state              | OPERATIONAL                          |
| wsrep_flow_control_paused    | 0.000000                             |
| wsrep_flow_control_paused_ns | 0                                    |
| wsrep_flow_control_recv      | 0                                    |
| wsrep_flow_control_sent      | 0                                    |
| wsrep_gcomm_uuid             | 7e891008-c8ba-11e8-b09e-da972ea198ae |
| wsrep_incoming_addresses     | 10.210.31.20:3306,10.210.31.48:3306  |
| wsrep_last_committed         | 24189                                |
| wsrep_local_bf_aborts        | 0                                    |
| wsrep_local_cached_downto    | 18446744073709551615                 |
| wsrep_local_cert_failures    | 0                                    |
| wsrep_local_commits          | 0                                    |
| wsrep_local_index            | 0                                    |
| wsrep_local_recv_queue       | 0                                    |
| wsrep_local_recv_queue_avg   | 0.000000                             |
| wsrep_local_recv_queue_max   | 1                                    |
| wsrep_local_recv_queue_min   | 0                                    |
| wsrep_local_replays          | 0                                    |
| wsrep_local_send_queue       | 0                                    |
| wsrep_local_send_queue_avg   | 0.500000                             |
| wsrep_local_send_queue_max   | 2                                    |
| wsrep_local_send_queue_min   | 0                                    |
| wsrep_local_state            | 4                                    |
| wsrep_local_state_comment    | Synced                               |
| wsrep_local_state_uuid       | 9fd9b841-c299-11e8-bb45-6a5bd809fa6b |
| wsrep_protocol_version       | 8                                    |
| wsrep_provider_name          | Galera                               |
| wsrep_provider_vendor        | Codership Oy <info@codership.com>    |
| wsrep_provider_version       | 25.3.23(r3789)                       |
| wsrep_ready                  | ON                                   |
| wsrep_received               | 2                                    |
| wsrep_received_bytes         | 232                                  |
| wsrep_repl_data_bytes        | 0                                    |
| wsrep_repl_keys              | 0                                    |
| wsrep_repl_keys_bytes        | 0                                    |
| wsrep_repl_other_bytes       | 0                                    |
| wsrep_replicated             | 0                                    |
| wsrep_replicated_bytes       | 0                                    |
| wsrep_thread_count           | 2                                    |
+------------------------------+--------------------------------------+


> +
> The value of:
> binlog_format
> default_storage_engine
> innodb_autoinc_lock_mode
> innodb_doublewrite
> wsrep_on
> innodb_flush_log_at_trx_commit


+--------------------------------+--------+
| Variable_name                  | Value  |
+--------------------------------+--------+
| binlog_format                  | ROW    |
| default_storage_engine         | InnoDB |
| innodb_autoinc_lock_mode       | 2      |
| innodb_doublewrite             | ON     |
| innodb_flush_log_at_trx_commit | 1      |
| wsrep_on                       | ON     |
+--------------------------------+--------+

> Also not sure if you can retrieve the MariaDB logs (regular and/or error
> ones) to see the specific message when it failed, like:
> 
> [ERROR] WSREP: Maximum wirteset size exceeded by XXXX: 90 (Message too long)

The only thing it logged was this:

2018-09-28 12:17:46 140549635750656 [Warning] WSREP: transaction size limit (2147483647) exceeded: 2147483648
2018-09-28 12:17:46 140549635750656 [ERROR] WSREP: rbr write fail, data_len: 0, 2

> or similar that could point to anything relevant. In any case, this looks
> more like a DB Admin / Galera issue than a Slurm one.

Agreed. Although there may be way to reduce the size of the transactions during the DB upgrade to avoid triggering this limit?

Cheers,
-- 
Kilian
Comment 41 Alejandro Sanchez 2018-10-08 10:12:11 MDT
(In reply to Kilian Cavalotti from comment #40)
> Agreed. Although there may be way to reduce the size of the transactions
> during the DB upgrade to avoid triggering this limit?

I've been able to reproduce with Galera setup and similar options to yours:

alex@knl2:~/slurm/upgrade/knl2/log$ tail -f slurmdbd.log
...
[2018-10-08T09:01:41.677] error: mysql_query failed: 1180 Got error 90 "Message too long" during COMMIT
create table if not exists table_defs_table (creation_time int unsigned not null, mod_time int unsigned default 0 not null, table_name text not null, definition text not null, primary key (table_name(50))) engine='innodb'
[2018-10-08T09:01:41.677] Accounting storage MYSQL plugin failed
...

But I don't see these logged in my mariadb logs:

2018-09-28 12:17:46 140549635750656 [Warning] WSREP: transaction size limit (2147483647) exceeded: 2147483648
2018-09-28 12:17:46 140549635750656 [ERROR] WSREP: rbr write fail, data_len: 0, 2

as you reported.

I'm now trying to upgrade again for Nth time by increasing wsrep_max_ws_size to 8GB. Even if the documentation states that the default and maximum value is 2GB, according to this 

https://www.percona.com/blog/2015/10/26/how-big-can-your-galera-transactions-be/

"You can actually set this higher than 2GB, but only the 2GB is being enforced in my tests."

So I'm trying to see if that makes a difference.

With regards to your question about reducing the size of the transactions... I wouldn't mind exploring that option in the Slurm code, but I find it strange the error seems to be logged after executing a create table query:

create table if not exists table_defs_table ...

MariaDB [slurm_acct_db_5662]> describe table_defs_table;
+---------------+------------------+------+-----+---------+-------+
| Field         | Type             | Null | Key | Default | Extra |
+---------------+------------------+------+-----+---------+-------+
| creation_time | int(10) unsigned | NO   |     | NULL    |       |
| mod_time      | int(10) unsigned | NO   |     | 0       |       |
| table_name    | text             | NO   | PRI | NULL    |       |
| definition    | text             | NO   |     | NULL    |       |
+---------------+------------------+------+-----+---------+-------+
4 rows in set (0.01 sec)

MariaDB [slurm_acct_db_5662]>

Wondering if the 4th Field, 'definition' of Type text which contains the definition of the tables is what it causing any limit being hit. So not sure which transaction sizes we should be attempting to reduce in size...

In any case, I think if you manage to upgrade without Galera we'll go ahead and close the bug.
Comment 42 Alejandro Sanchez 2018-10-09 03:31:00 MDT
After attempting upgrade with wsrep_max_ws_size set to 8GB I got the error again.

I'd propose using the workaround upgrading with Galera disabled. Please, let us know if you encounter problems with that.
Comment 43 Alejandro Sanchez 2018-10-18 04:30:01 MDT
Hi Kilian, did you manage to upgrade to 18.08 on production? thanks.
Comment 44 Kilian Cavalotti 2018-10-18 08:41:25 MDT
Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #43)
> Hi Kilian, did you manage to upgrade to 18.08 on production? thanks.

Thanks for checking. We haven't done the production upgrade yet, we'll need to schedule a downtime and align it with other maintenance operations, so that requires some coordination and will take some time. But we'll definitely let you know when it's done!

Cheers,
-- 
Kilian
Comment 45 Alejandro Sanchez 2018-10-19 04:04:30 MDT
Just to note some potentially useful notes. Another site upgraded yesterday with a db containing 73M steps. It looks like the my.cnf tmpdir temporarily used 46G of space for altering tables. Once that settled, the server actively worked on the datadir copy. The overall conversion took 5.5hours, thanks to a disk in a software raid 1. Without such setup, a parallel conversion with same database took 11.5hours (2x slower). As you may guess, upgrade time is highly dependant on job/step records and underlying hardware/setup.
Comment 46 Kilian Cavalotti 2018-11-28 15:04:07 MST
Hi,

Just a quick note to let you know that we successfully converted our SlurmDBD databases to 18.08 format (finally!), after having disabled Galera.

We're now running 18.08.3, so we can close this bug.

Cheers,
-- 
Kilian
Comment 47 Alejandro Sanchez 2018-11-28 15:18:34 MST
(In reply to Kilian Cavalotti from comment #46)
> Hi,
> 
> Just a quick note to let you know that we successfully converted our
> SlurmDBD databases to 18.08 format (finally!), after having disabled Galera.
> 
> We're now running 18.08.3, so we can close this bug.
> 
> Cheers,
> -- 
> Kilian

Ahh.. glad to hear. That was why I could corroborate with your database copy. I'm currently on a training this week. As soon as I get back to my regular schedule I'll put the focus on the purge timeout issue. Thanks for your feedback.