Ticket 4906 - upgrade from 16.05 - 17.11 slurmdbd over 8 hours converting job_table
Summary: upgrade from 16.05 - 17.11 slurmdbd over 8 hours converting job_table
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 17.11.4
Hardware: Linux Linux
: 1 - System not usable
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-03-12 21:30 MDT by Wei Feinstein
Modified: 2018-03-13 09:19 MDT (History)
1 user (show)

See Also:
Site: UC Berkley
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Wei Feinstein 2018-03-12 21:30:30 MDT
Upgrading the slurmdbd to 17.11.4 and when I tried to start slurmdbd I got the following errors:

[2018-03-12T11:03:39.759] error: mysql_query failed: 1206 The total number of locks exceeds the lock table size
update "brc_job_table" as job left outer join ( select job_db_inx, SUM(consumed_energy) 'sum_energy' from "brc_step_table" where id_step >= 0 and co
nsumed_energy != 18446744073709551614 group by job_db_inx ) step on job.job_db_inx=step.job_db_inx set job.tres_alloc=concat(job.tres_alloc, concat(
',3=', case when step.sum_energy then step.sum_energy else 18446744073709551614 END)) where job.tres_alloc != '' && job.tres_alloc not like '%,3=%';
[2018-03-12T11:03:39.759] error: Can't convert brc_job_table info: Unknown error 1206
[2018-03-12T11:03:39.759] error: issue converting tables after create
[2018-03-12T11:03:39.759] Accounting storage MYSQL plugin failed
[2018-03-12T11:03:39.768] error: mysql_query failed: 1062 Duplicate entry '5' for key 'PRIMARY'
update tres_table set id=5 where id=1001;
[2018-03-12T11:03:39.816] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
[2018-03-12T11:03:39.819] error: cannot create accounting_storage context for accounting_storage/mysql
[2018-03-12T11:03:39.819] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin


I restarted the process and it has been running since 11:37 this morning and it is still running.  
[2018-03-12T11:37:27.102] Warning: Note very large processing time from make table current "brc_job_table": usec=595063992 began=11:27:32.038
[2018-03-12T11:37:27.506] adding column pack_job_id after id_group in table "master_job_table"
[2018-03-12T11:37:27.506] adding column pack_job_offset after pack_job_id in table "master_job_table"
[2018-03-12T11:37:27.506] adding column mcs_label after kill_requid in table "master_job_table"
[2018-03-12T11:37:27.506] adding column work_dir after wckey in table "master_job_table"
[2018-03-12T11:37:27.506] adding key old_tuple (id_job, id_assoc, time_submit) to table "master_job_table"
[2018-03-12T11:37:27.507] adding key pack_job (pack_job_id) to table "master_job_table"
[2018-03-12T11:37:27.732] adding column pack_job_id after id_group in table "master.brc_job_table"
[2018-03-12T11:37:27.732] adding column pack_job_offset after pack_job_id in table "master.brc_job_table"
[2018-03-12T11:37:27.732] adding column mcs_label after kill_requid in table "master.brc_job_table"
[2018-03-12T11:37:27.732] adding column work_dir after wckey in table "master.brc_job_table"
[2018-03-12T11:37:27.732] adding key old_tuple (id_job, id_assoc, time_submit) to table "master.brc_job_table"
[2018-03-12T11:37:27.732] adding key pack_job (pack_job_id) to table "master.brc_job_table"
[2018-03-12T11:37:27.948] converting step table for 0-a0-d1-ec-bc-c
[2018-03-12T11:37:27.948] converting job table for 0-a0-d1-ec-bc-c
[2018-03-12T11:37:28.020] converting resv table for 0-a0-d1-ec-bc-c
[2018-03-12T11:37:28.020] converting cluster tables for 0-a0-d1-ec-bc-c
[2018-03-12T11:37:28.020] converting assoc table for 0-a0-d1-ec-bc-c
[2018-03-12T11:37:28.020] converting step table for brc
[2018-03-12T11:37:45.666] converting job table for brc

It is now 20:29 which is over 8 hours since this was running. I mistakenly did not run it with the -D option so I can't tell what is happening.  Do you think there is an issue with the upgrade or database conversion? Or is this normal?

Thanks

Jackie
Comment 1 Brian Christiansen 2018-03-12 21:40:24 MDT
Hey Jackie. We’ve traced this to the version of MySQL. See Bug 4877.

Can you confirm that you are using MySQL 5.1?
Comment 2 Wei Feinstein 2018-03-12 21:58:41 MDT
We’re running MySQL version 5.1

Thanks

Jackie Scoggins

On Mar 12, 2018, at 8:40 PM, bugs@schedmd.com wrote:

*Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=4906#c1> on bug 4906
<https://bugs.schedmd.com/show_bug.cgi?id=4906> from Brian Christiansen
<brian@schedmd.com> *

Hey Jackie. We’ve traced this to the version of MySQL. See Bug 4877
<show_bug.cgi?id=4877>.

Can you confirm that you are using MySQL 5.1?

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 3 Brian Christiansen 2018-03-12 22:06:12 MDT
Do you have a backup of the database prior to the upgrade? If so, I would consider upgrading MySQL, restore the backup and restart the SlurmDBD -- as they did in Bug 4877 Comment 16. 

FYI. I still have 5.1 upgrade running from last Friday.
Comment 4 Wei Feinstein 2018-03-12 22:25:27 MDT
Yes we do have a backup from this morning.  I don't know if we can upgrade mysql it is not just used for slurm. I will have to confer with my team to see if we can upgrade mysql.  What version of mysql do you suggest?
Comment 5 Brian Christiansen 2018-03-12 22:38:34 MDT
ok. I have MySQL 5.7 and Wyoming said 5.7 worked for them so I think that's a safe choice.

Let us know what you decide in the morning.
Comment 6 Wei Feinstein 2018-03-12 22:50:22 MDT
Were not running sl7 and 5.7 isn’t part of sl6 repos. So we’re looking into
building it or getting one for sl6.

Thanks

Jackie Scoggins

On Mar 12, 2018, at 8:40 PM, bugs@schedmd.com wrote:

*Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=4906#c1> on bug 4906
<https://bugs.schedmd.com/show_bug.cgi?id=4906> from Brian Christiansen
<brian@schedmd.com> *

Hey Jackie. We’ve traced this to the version of MySQL. See Bug 4877
<show_bug.cgi?id=4877>.

Can you confirm that you are using MySQL 5.1?

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 7 Wei Feinstein 2018-03-13 00:14:18 MDT
the mysql upgrade worked.  Question is there any concerns with using TRES when upgrading from 16.x to 17.11.4?

Just checking to see if this is an issue?

[2018-03-12T23:12:01.055] error: _handle_qos_tres_run_secs: job 2176025: QOS acrb_gpu2_normal TRES billing grp_used_tres_run_secs underflow, tried to remove 600 seconds when only 0 remained.
Comment 9 Wei Feinstein 2018-03-13 07:24:04 MDT
We got it installed MySQL 5.7 restored the database from the dump taken
that morning before the upgrade. And we’re up and running now.

Thanks

Jackie Scoggins

On Mar 13, 2018, at 3:07 AM, bugs@schedmd.com wrote:

Felip Moll <felip.moll@schedmd.com> changed bug 4906
<https://bugs.schedmd.com/show_bug.cgi?id=4906>
What Removed Added
CC   felip.moll@schedmd.com

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 10 Brian Christiansen 2018-03-13 09:19:27 MDT
Glad to hear that the upgrade was successful. We have a fix for the TRES issue that you saw that is slated for 17.11.5.

Let us know if you have any questions.

Thanks,
Brian