Ticket 5852 - Upgrade 17.02.09 to 18.08.1: slurmdbd conversion errors
Summary: Upgrade 17.02.09 to 18.08.1: slurmdbd conversion errors
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other tickets)
Version: 18.08.1
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-10-15 08:47 MDT by HHLR Admins
Modified: 2018-10-29 10:09 MDT (History)
0 users

See Also:
Site: Hessen
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmdbd.log with all entries from DB conversion (8.80 KB, text/plain)
2018-10-15 08:47 MDT, HHLR Admins
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description HHLR Admins 2018-10-15 08:47:38 MDT
Created attachment 8022 [details]
slurmdbd.log with all entries from DB conversion

Dear SLURM maintainers,

during the first start of slurmdbd 18.08.1 and the following conversion of our old 17.02.9 database, the log file shows some errors, eg.

error: Could not execute statement 1206 The total number of locks exceeds the lock table size

error: _update_unused_wall, total job time 924.000000 is greater than total resv time 462.

(see attachment for full slurmdbd.log).


Even though it eventually says
  Conversion done: success!
we are not sure whether these errors might adversely affect the converted database. 
Could you have a look and check?

Thanks!
Comment 1 Marshall Garey 2018-10-15 15:09:09 MDT
Hi,

I'll respond to each error inline.


[2018-10-15T10:49:20.336] error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout

You should set these according to the guidelines in the accounting page:
https://slurm.schedmd.com/accounting.html

[2018-10-15T16:20:48.083] error: _update_unused_wall, total job time 0.000000 is greater than total resv time -41780.
(Several of these errors.)

You'll have reservations with incorrect unused wall time, so the utilization of those reservations will be reported wrong. This will affect some utilization reports when using sreport. Unfortunately, this error doesn't indicate which reservations (the error message can certainly be improved). This doesn't impact anything else.


[2018-10-15T12:35:32.681] error: Could not execute statement 1206 The total number of locks exceeds the lock table size
[2018-10-15T13:33:33.209] error: Could not execute statement 1206 The total number of locks exceeds the lock table size
[2018-10-15T14:30:41.516] error: Could not execute statement 1206 The total number of locks exceeds the lock table size

You should increase the innodb_buffer_pool_size. This indicates there were a few large queries and not enough memory for them.

I also noticed that your database update took several hours. That's a long time. How many job records do you have? You can consider archiving and/or purging job records. See the slurmdbd.conf man page.
https://slurm.schedmd.com/slurmdbd.conf.html

What version of mysql are you using? We've had a number of reports that upgrading a pre-17.11 database on mysql older than 5.5 is very slow. The good thing is that it did finish.

There is unfortunately very little information about what actually went wrong. You'd need to restore the old database and re-run the update at a higher debug level (debug2, for example). If you have a backup want to get more information by upgrading at a higher debug level, you may. But first, take care of these things:

- Set innodb_buffer_pool_size innodb_log_file_size and innodb_lock_wait_timeout
- Consider archiving and/or purging job records from the database
- I can give you a patch that will improve the error message about the unused wall time of reservations.
- Keep a backup of the old database.


I'll keep researching and see if I can find anything else.

- Marshall
Comment 2 Marshall Garey 2018-10-16 10:23:07 MDT
I realized that I didn't make this clear:

You're probably fine leaving the database as-is, but if you want to find out more information, then you'd need to re-run the upgrade with higher debugging. If you want to do that, let me know, and I'll provide you with a patch to improve the error message for unused wall. (I'll probably try to do that anyway.)

I do suggest adjusting the innodb parameters regardless.
Comment 3 Marshall Garey 2018-10-23 14:43:38 MDT
Is there anything else we can help you with?
Comment 4 Marshall Garey 2018-10-29 10:09:48 MDT
Closing as infogiven for now. Please re-open if you have more questions.