Ticket 5033

Summary: slurmdbd conversion fails during upgrade to 17.11.5
Product: Slurm Reporter: ruth.a.braun
Component: slurmdbdAssignee: Alejandro Sanchez <alex>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: felip.moll, tim
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
Site: EM Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmlogs-braun.tar.gz

Description ruth.a.braun 2018-04-05 13:38:35 MDT
Need assistance, slurmdbd failure during conversion


[root@clnxcat02 log]# slurmdbd -Dvvvv
slurmdbd: debug:  Log file re-opened
slurmdbd: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
slurmdbd: debug:  Munge authentication plugin loaded
slurmdbd: debug3: Success.
slurmdbd: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
slurmdbd: debug:  Munge authentication plugin loaded
slurmdbd: debug3: Success.
slurmdbd: debug3: Trying to load plugin /usr/lib64/slurm/accounting_storage_mysql.so
slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
slurmdbd: adding column federation after flags in table cluster_table
slurmdbd: adding column features after federation in table cluster_table
slurmdbd: adding column fed_id after features in table cluster_table
slurmdbd: adding column fed_state after fed_id in table cluster_table
slurmdbd: debug:  Table cluster_table has changed.  Updating...
slurmdbd: debug:  Table txn_table has changed.  Updating...
slurmdbd: debug:  Table tres_table has changed.  Updating...
slurmdbd: pre-converting job table for descartes
slurmdbd: adding column admin_comment after account in table "descartes_job_table"
slurmdbd: debug:  Table "descartes_job_table" has changed.  Updating...
slurmdbd: Warning: Note very large processing time from make table current "descartes_job_table": usec=7214190 began=15:24:47.926
slurmdbd: debug:  Table "descartes_assoc_table" has changed.  Updating...
slurmdbd: debug:  Table "descartes_assoc_usage_day_table" has changed.  Updating...
slurmdbd: debug:  Table "descartes_assoc_usage_hour_table" has changed.  Updating...
slurmdbd: Warning: Note very large processing time from make table current "descartes_assoc_usage_hour_table": usec=5402238 began=15:24:56.063
slurmdbd: debug:  Table "descartes_assoc_usage_month_table" has changed.  Updating...
slurmdbd: debug:  Table "descartes_usage_day_table" has changed.  Updating...
slurmdbd: debug:  Table "descartes_usage_hour_table" has changed.  Updating...
slurmdbd: debug:  Note large processing time from make table current "descartes_usage_hour_table": usec=1087765 began=15:25:01.937
slurmdbd: debug:  Table "descartes_usage_month_table" has changed.  Updating...
slurmdbd: debug:  Table "descartes_event_table" has changed.  Updating...
slurmdbd: adding column pack_job_id after id_group in table "descartes_job_table"
slurmdbd: adding column pack_job_offset after pack_job_id in table "descartes_job_table"
slurmdbd: adding column mcs_label after kill_requid in table "descartes_job_table"
slurmdbd: adding column work_dir after wckey in table "descartes_job_table"
slurmdbd: adding key old_tuple (id_job, id_assoc, time_submit) to table "descartes_job_table"
slurmdbd: adding key pack_job (pack_job_id) to table "descartes_job_table"
slurmdbd: debug:  Table "descartes_job_table" has changed.  Updating...
slurmdbd: Warning: Note very large processing time from make table current "descartes_job_table": usec=11895096 began=15:25:03.547
slurmdbd: debug:  Table "descartes_last_ran_table" has changed.  Updating...
slurmdbd: adding column unused_wall after tres in table "descartes_resv_table"
slurmdbd: debug:  Table "descartes_resv_table" has changed.  Updating...
slurmdbd: debug:  Table "descartes_step_table" has changed.  Updating...

slurmdbd: Warning: Note very large processing time from make table current "descartes_step_table": usec=39191376 began=15:25:15.828
slurmdbd: debug:  Table "descartes_suspend_table" has changed.  Updating...
slurmdbd: debug:  Table "descartes_wckey_table" has changed.  Updating...
slurmdbd: debug:  Table "descartes_wckey_usage_day_table" has changed.  Updating...
slurmdbd: debug:  Table "descartes_wckey_usage_hour_table" has changed.  Updating...
slurmdbd: debug:  Table "descartes_wckey_usage_month_table" has changed.  Updating...
slurmdbd: converting step table for descartes
slurmdbd: converting job table for descartes
slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
slurmdbd: pre-converting job table for descartes
slurmdbd: dropping column pack_job_id from table "descartes_job_table"
slurmdbd: dropping column pack_job_offset from table "descartes_job_table"
slurmdbd: dropping column mcs_label from table "descartes_job_table"
slurmdbd: dropping column work_dir from table "descartes_job_table"
slurmdbd: dropping key old_tuple from table "descartes_job_table"
slurmdbd: dropping key pack_job from table "descartes_job_table"
slurmdbd: debug:  Table "descartes_job_table" has changed.  Updating...




slurmdbd: error: mysql_query failed: 1091 Can't DROP 'pack_job_id'; check that column/key exists
alter table "descartes_job_table" modify `job_db_inx` bigint unsigned not null auto_increment, modify `mod_time` bigint unsigned default 0 not null, modify `deleted` tinyint default 0 not null, modify `account` tinytext, modify `admin_comment` text, modify `array_task_str` text, modify `array_max_tasks` int unsigned default 0 not null, modify `array_task_pending` int unsigned default 0 not null, modify `cpus_req` int unsigned not null, modify `derived_ec` int unsigned default 0 not null, modify `derived_es` text, modify `exit_code` int unsigned default 0 not null, modify `job_name` tinytext not null, modify `id_assoc` int unsigned not null, modify `id_array_job` int unsigned default 0 not null, modify `id_array_task` int unsigned default 0xfffffffe not null, modify `id_block` tinytext, modify `id_job` int unsigned not null, modify `id_qos` int unsigned default 0 not null, modify `id_resv` int unsigned not null, modify `id_wckey` int unsigned not null, modify `id_user` int unsigned not null, modify `id_group` int unsigned not null, modify `kill_requid` int default -1 not null, modify `mem_req` bigint unsigned default 0 not null, modify `nodelist` text, modify `nodes_alloc` int unsigned not null, modify `node_inx` text, modify `partition` tinytext not null, modify `priority` int unsigned not null, modify `state` int unsigned not null, modify `timelimit` int unsigned default 0 not null, modify `time_submit` bigint unsigned default 0 not null, modify `time_eligible` bigint unsigned default 0 not null, modify `time_start` bigint unsigned default 0 not null, modify `time_end` bigint unsigned default 0 not null, modify `time_suspended` bigint unsigned default 0 not null, modify `gres_req` text not null default '', modify `gres_alloc` text not null default '', modify `gres_used` text not null default '', modify `wckey` tinytext not null default '', modify `track_steps` tinyint not null, modify `tres_alloc` text not null default '', modify `tres_req` text not null default '', drop pack_job_id, drop pack_job_offset, drop mcs_label, drop work_dir, drop primary key, add primary key (job_db_inx), drop index id_job, add unique index (id_job, id_assoc, time_submit), drop key rollup, add key rollup (time_eligible, time_end), drop key rollup2, add key rollup2 (time_end, time_eligible), drop key nodes_alloc, add key nodes_alloc (nodes_alloc), drop key wckey, add key wckey (id_wckey), drop key qos, add key qos (id_qos), drop key association, add key association (id_assoc), drop key array_job, add key array_job (id_array_job), drop key reserv, add key reserv (id_resv), drop key sacct_def, add key sacct_def (id_user, time_start, time_end), drop key sacct_def2, add key sacct_def2 (id_user, time_end, time_eligible), drop key old_tuple, drop key pack_job;
slurmdbd: error: issue converting tables before create
slurmdbd: Accounting storage MYSQL plugin failed
slurmdbd: error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
slurmdbd: error: cannot create accounting_storage context for accounting_storage/mysql
slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin
[
Comment 1 Tim Wickberg 2018-04-05 15:09:24 MDT
Created attachment 6556 [details]
slurmlogs-braun.tar.gz

I'm trying to chase down a cause for this and/or a workaround, but some additional details may help -

What Slurm version were you on prior to the upgrade?

What MySQL version are you currently running?
Comment 2 ruth.a.braun 2018-04-05 15:11:51 MDT
Slurm 16.05.7
Mysql 5.1.73

I restarted the slurmdbd again and I got [2018-04-05T15:46:57.958] Conversion done: success!

Regards,

Ruth A. Braun
Sr IT Analyst
High Performance Computing

Scientific Computing, Technology Platforms
Research and Engineering IT
Fuels, Lubricants and Chemicals IT, Information Technology
1545 US Rt 22 East
Annandale, NJ 08801
908-335-3694

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Thursday, April 5, 2018 5:09 PM
To: Braun, Ruth A <ruth.a.braun@exxonmobil.com>
Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5

Tim Wickberg<mailto:tim@schedmd.com> changed bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033>
What

Removed

Added

Assignee

support@schedmd.com<mailto:support@schedmd.com>

tim@schedmd.com<mailto:tim@schedmd.com>

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5033#c1> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Tim Wickberg<mailto:tim@schedmd.com>

I'm trying to chase down a cause for this and/or a workaround, but some

additional details may help -



What Slurm version were you on prior to the upgrade?



What MySQL version are you currently running?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 3 Tim Wickberg 2018-04-05 15:20:00 MDT
Glad that's behaving at least, but I'm still not sure what happened here.

A couple more questions, if you could:

- Do you have a backup slurmdbd process running somewhere?

- Can you attach a copy of slurmdbd.conf (with StoragePass redacted preferably)?
Comment 4 ruth.a.braun 2018-04-05 15:26:17 MDT
Tim,
Here you go
We don’t have use a backup controller for dbd – slurmdbd runs on the same server as slurmctld
Ruth

Contents of: /etc/slurm/slurmdbd.conf
# Archive info
PurgeEventAfter        = 8760hours
#  Keep only 1 yr (8760 hours) FOR THREE YEARS, USE 26280hours   FOR TWO YEARS 17520 hours
PurgeJobAfter          = 8760hours
PurgeResvAfter         = 8760hours
PurgeStepAfter         = 8760hours
PurgeSuspendAfter      = 8760hours
#
# Authentication info
AuthType=auth/munge
#AuthInfo=/var/run/munge/munge.socket.2
#
# slurmDBD info
DbdAddr=localhost
DbdHost=localhost
#DbdPort=7031
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=info
DebugFlags=DB_ARCHIVE,DB_EVENT,DB_JOB,DB_STEP
#DebugFlags=DB_ARCHIVE,DB_EVENT
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
#
# Database info
StorageType=accounting_storage/mysql
StorageUser=slurm


Regards,

Ruth A. Braun
Sr IT Analyst
High Performance Computing

Scientific Computing, Technology Platforms
Research and Engineering IT
Fuels, Lubricants and Chemicals IT, Information Technology
1545 US Rt 22 East
Annandale, NJ 08801
908-335-3694

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Thursday, April 5, 2018 5:20 PM
To: Braun, Ruth A <ruth.a.braun@exxonmobil.com>
Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5

Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=5033#c3> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Tim Wickberg<mailto:tim@schedmd.com>

Glad that's behaving at least, but I'm still not sure what happened here.



A couple more questions, if you could:



- Do you have a backup slurmdbd process running somewhere?



- Can you attach a copy of slurmdbd.conf (with StoragePass redacted

preferably)?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 5 Tim Wickberg 2018-04-05 15:38:17 MDT
Thank you. I think I spot part of what caused this, and you shouldn't have any problems going forward.

I'm going to lower the severity on this as I assume you're back up and running now, and have Alex look into a proper fix. But - this case should not repeat for you - we just want to make sure to prevent it for anyone else in the future.

- Tim
Comment 6 ruth.a.braun 2018-04-05 15:50:39 MDT
Perfect, thanks.  Please post details when available.

Also FYI,
<clustername>_step_table that was ~1084525 just before I started
(Purging back to 1 yr didn’t get it below 1M).

Mysql command used:
SELECT TABLE_NAME,TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'slurm_acct_db';

The slurm_acct_db was also over 1M
# mysqlshow --count
+--------------------+--------+--------------+
|     Databases      | Tables |  Total Rows  |
+--------------------+--------+--------------+
| information_schema |     28 |         2988 |
| mysql              |     23 |         2039 |
| slurm_acct_db      |     27 |      1792324 |  <-purging got this down from ~2.9M
| slurm_jobcomp_db   |      2 |       335317 |



Regards,

Ruth A. Braun
Sr IT Analyst
High Performance Computing
From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Thursday, April 5, 2018 5:38 PM
To: Braun, Ruth A <ruth.a.braun@exxonmobil.com>
Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5

Tim Wickberg<mailto:tim@schedmd.com> changed bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033>
What

Removed

Added

CC



tim@schedmd.com<mailto:tim@schedmd.com>

Assignee

tim@schedmd.com<mailto:tim@schedmd.com>

alex@schedmd.com<mailto:alex@schedmd.com>

Severity

2 - High Impact

3 - Medium Impact

Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=5033#c5> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Tim Wickberg<mailto:tim@schedmd.com>

Thank you. I think I spot part of what caused this, and you shouldn't have any

problems going forward.



I'm going to lower the severity on this as I assume you're back up and running

now, and have Alex look into a proper fix. But - this case should not repeat

for you - we just want to make sure to prevent it for anyone else in the

future.



- Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 8 Alejandro Sanchez 2018-04-06 09:07:25 MDT
Hi. Will you attach the surmdbd and slurmctld logs for the upgrade day? Thanks.
Comment 11 ruth.a.braun 2018-04-06 12:57:27 MDT
Sure – by Monday

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Friday, April 6, 2018 11:07 AM
To: Braun, Ruth A <ruth.a.braun@exxonmobil.com>
Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5

Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=5033#c8> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Alejandro Sanchez<mailto:alex@schedmd.com>

Hi. Will you attach the surmdbd and slurmctld logs for the upgrade day? Thanks.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 12 ruth.a.braun 2018-04-06 14:50:28 MDT
Files attached
From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Friday, April 6, 2018 11:07 AM
To: Braun, Ruth A <ruth.a.braun@exxonmobil.com>
Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5

Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=5033#c8> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Alejandro Sanchez<mailto:alex@schedmd.com>

Hi. Will you attach the surmdbd and slurmctld logs for the upgrade day? Thanks.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 14 Alejandro Sanchez 2018-04-23 08:18:11 MDT
Hi. Our hypothesis is that while slurmdbd was in the process of being upgraded, something bad external to Slurm happened, either filesystem full / failure, slurmdbd was killed or similar. Since the upgrade finally succeed, I'm gonna go ahead and close the bug. Please, reopen if there's anything left here.