Need assistance, slurmdbd failure during conversion [root@clnxcat02 log]# slurmdbd -Dvvvv slurmdbd: debug: Log file re-opened slurmdbd: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so slurmdbd: debug: Munge authentication plugin loaded slurmdbd: debug3: Success. slurmdbd: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so slurmdbd: debug: Munge authentication plugin loaded slurmdbd: debug3: Success. slurmdbd: debug3: Trying to load plugin /usr/lib64/slurm/accounting_storage_mysql.so slurmdbd: debug2: mysql_connect() called for db slurm_acct_db slurmdbd: adding column federation after flags in table cluster_table slurmdbd: adding column features after federation in table cluster_table slurmdbd: adding column fed_id after features in table cluster_table slurmdbd: adding column fed_state after fed_id in table cluster_table slurmdbd: debug: Table cluster_table has changed. Updating... slurmdbd: debug: Table txn_table has changed. Updating... slurmdbd: debug: Table tres_table has changed. Updating... slurmdbd: pre-converting job table for descartes slurmdbd: adding column admin_comment after account in table "descartes_job_table" slurmdbd: debug: Table "descartes_job_table" has changed. Updating... slurmdbd: Warning: Note very large processing time from make table current "descartes_job_table": usec=7214190 began=15:24:47.926 slurmdbd: debug: Table "descartes_assoc_table" has changed. Updating... slurmdbd: debug: Table "descartes_assoc_usage_day_table" has changed. Updating... slurmdbd: debug: Table "descartes_assoc_usage_hour_table" has changed. Updating... slurmdbd: Warning: Note very large processing time from make table current "descartes_assoc_usage_hour_table": usec=5402238 began=15:24:56.063 slurmdbd: debug: Table "descartes_assoc_usage_month_table" has changed. Updating... slurmdbd: debug: Table "descartes_usage_day_table" has changed. Updating... slurmdbd: debug: Table "descartes_usage_hour_table" has changed. Updating... slurmdbd: debug: Note large processing time from make table current "descartes_usage_hour_table": usec=1087765 began=15:25:01.937 slurmdbd: debug: Table "descartes_usage_month_table" has changed. Updating... slurmdbd: debug: Table "descartes_event_table" has changed. Updating... slurmdbd: adding column pack_job_id after id_group in table "descartes_job_table" slurmdbd: adding column pack_job_offset after pack_job_id in table "descartes_job_table" slurmdbd: adding column mcs_label after kill_requid in table "descartes_job_table" slurmdbd: adding column work_dir after wckey in table "descartes_job_table" slurmdbd: adding key old_tuple (id_job, id_assoc, time_submit) to table "descartes_job_table" slurmdbd: adding key pack_job (pack_job_id) to table "descartes_job_table" slurmdbd: debug: Table "descartes_job_table" has changed. Updating... slurmdbd: Warning: Note very large processing time from make table current "descartes_job_table": usec=11895096 began=15:25:03.547 slurmdbd: debug: Table "descartes_last_ran_table" has changed. Updating... slurmdbd: adding column unused_wall after tres in table "descartes_resv_table" slurmdbd: debug: Table "descartes_resv_table" has changed. Updating... slurmdbd: debug: Table "descartes_step_table" has changed. Updating... slurmdbd: Warning: Note very large processing time from make table current "descartes_step_table": usec=39191376 began=15:25:15.828 slurmdbd: debug: Table "descartes_suspend_table" has changed. Updating... slurmdbd: debug: Table "descartes_wckey_table" has changed. Updating... slurmdbd: debug: Table "descartes_wckey_usage_day_table" has changed. Updating... slurmdbd: debug: Table "descartes_wckey_usage_hour_table" has changed. Updating... slurmdbd: debug: Table "descartes_wckey_usage_month_table" has changed. Updating... slurmdbd: converting step table for descartes slurmdbd: converting job table for descartes slurmdbd: debug2: mysql_connect() called for db slurm_acct_db slurmdbd: pre-converting job table for descartes slurmdbd: dropping column pack_job_id from table "descartes_job_table" slurmdbd: dropping column pack_job_offset from table "descartes_job_table" slurmdbd: dropping column mcs_label from table "descartes_job_table" slurmdbd: dropping column work_dir from table "descartes_job_table" slurmdbd: dropping key old_tuple from table "descartes_job_table" slurmdbd: dropping key pack_job from table "descartes_job_table" slurmdbd: debug: Table "descartes_job_table" has changed. Updating... slurmdbd: error: mysql_query failed: 1091 Can't DROP 'pack_job_id'; check that column/key exists alter table "descartes_job_table" modify `job_db_inx` bigint unsigned not null auto_increment, modify `mod_time` bigint unsigned default 0 not null, modify `deleted` tinyint default 0 not null, modify `account` tinytext, modify `admin_comment` text, modify `array_task_str` text, modify `array_max_tasks` int unsigned default 0 not null, modify `array_task_pending` int unsigned default 0 not null, modify `cpus_req` int unsigned not null, modify `derived_ec` int unsigned default 0 not null, modify `derived_es` text, modify `exit_code` int unsigned default 0 not null, modify `job_name` tinytext not null, modify `id_assoc` int unsigned not null, modify `id_array_job` int unsigned default 0 not null, modify `id_array_task` int unsigned default 0xfffffffe not null, modify `id_block` tinytext, modify `id_job` int unsigned not null, modify `id_qos` int unsigned default 0 not null, modify `id_resv` int unsigned not null, modify `id_wckey` int unsigned not null, modify `id_user` int unsigned not null, modify `id_group` int unsigned not null, modify `kill_requid` int default -1 not null, modify `mem_req` bigint unsigned default 0 not null, modify `nodelist` text, modify `nodes_alloc` int unsigned not null, modify `node_inx` text, modify `partition` tinytext not null, modify `priority` int unsigned not null, modify `state` int unsigned not null, modify `timelimit` int unsigned default 0 not null, modify `time_submit` bigint unsigned default 0 not null, modify `time_eligible` bigint unsigned default 0 not null, modify `time_start` bigint unsigned default 0 not null, modify `time_end` bigint unsigned default 0 not null, modify `time_suspended` bigint unsigned default 0 not null, modify `gres_req` text not null default '', modify `gres_alloc` text not null default '', modify `gres_used` text not null default '', modify `wckey` tinytext not null default '', modify `track_steps` tinyint not null, modify `tres_alloc` text not null default '', modify `tres_req` text not null default '', drop pack_job_id, drop pack_job_offset, drop mcs_label, drop work_dir, drop primary key, add primary key (job_db_inx), drop index id_job, add unique index (id_job, id_assoc, time_submit), drop key rollup, add key rollup (time_eligible, time_end), drop key rollup2, add key rollup2 (time_end, time_eligible), drop key nodes_alloc, add key nodes_alloc (nodes_alloc), drop key wckey, add key wckey (id_wckey), drop key qos, add key qos (id_qos), drop key association, add key association (id_assoc), drop key array_job, add key array_job (id_array_job), drop key reserv, add key reserv (id_resv), drop key sacct_def, add key sacct_def (id_user, time_start, time_end), drop key sacct_def2, add key sacct_def2 (id_user, time_end, time_eligible), drop key old_tuple, drop key pack_job; slurmdbd: error: issue converting tables before create slurmdbd: Accounting storage MYSQL plugin failed slurmdbd: error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed slurmdbd: error: cannot create accounting_storage context for accounting_storage/mysql slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin [
Created attachment 6556 [details] slurmlogs-braun.tar.gz I'm trying to chase down a cause for this and/or a workaround, but some additional details may help - What Slurm version were you on prior to the upgrade? What MySQL version are you currently running?
Slurm 16.05.7 Mysql 5.1.73 I restarted the slurmdbd again and I got [2018-04-05T15:46:57.958] Conversion done: success! Regards, Ruth A. Braun Sr IT Analyst High Performance Computing Scientific Computing, Technology Platforms Research and Engineering IT Fuels, Lubricants and Chemicals IT, Information Technology 1545 US Rt 22 East Annandale, NJ 08801 908-335-3694 From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, April 5, 2018 5:09 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5 Tim Wickberg<mailto:tim@schedmd.com> changed bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> What Removed Added Assignee support@schedmd.com<mailto:support@schedmd.com> tim@schedmd.com<mailto:tim@schedmd.com> Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5033#c1> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Tim Wickberg<mailto:tim@schedmd.com> I'm trying to chase down a cause for this and/or a workaround, but some additional details may help - What Slurm version were you on prior to the upgrade? What MySQL version are you currently running? ________________________________ You are receiving this mail because: * You reported the bug.
Glad that's behaving at least, but I'm still not sure what happened here. A couple more questions, if you could: - Do you have a backup slurmdbd process running somewhere? - Can you attach a copy of slurmdbd.conf (with StoragePass redacted preferably)?
Tim, Here you go We don’t have use a backup controller for dbd – slurmdbd runs on the same server as slurmctld Ruth Contents of: /etc/slurm/slurmdbd.conf # Archive info PurgeEventAfter = 8760hours # Keep only 1 yr (8760 hours) FOR THREE YEARS, USE 26280hours FOR TWO YEARS 17520 hours PurgeJobAfter = 8760hours PurgeResvAfter = 8760hours PurgeStepAfter = 8760hours PurgeSuspendAfter = 8760hours # # Authentication info AuthType=auth/munge #AuthInfo=/var/run/munge/munge.socket.2 # # slurmDBD info DbdAddr=localhost DbdHost=localhost #DbdPort=7031 SlurmUser=slurm #MessageTimeout=300 DebugLevel=info DebugFlags=DB_ARCHIVE,DB_EVENT,DB_JOB,DB_STEP #DebugFlags=DB_ARCHIVE,DB_EVENT LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid # # Database info StorageType=accounting_storage/mysql StorageUser=slurm Regards, Ruth A. Braun Sr IT Analyst High Performance Computing Scientific Computing, Technology Platforms Research and Engineering IT Fuels, Lubricants and Chemicals IT, Information Technology 1545 US Rt 22 East Annandale, NJ 08801 908-335-3694 From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, April 5, 2018 5:20 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5 Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=5033#c3> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Tim Wickberg<mailto:tim@schedmd.com> Glad that's behaving at least, but I'm still not sure what happened here. A couple more questions, if you could: - Do you have a backup slurmdbd process running somewhere? - Can you attach a copy of slurmdbd.conf (with StoragePass redacted preferably)? ________________________________ You are receiving this mail because: * You reported the bug.
Thank you. I think I spot part of what caused this, and you shouldn't have any problems going forward. I'm going to lower the severity on this as I assume you're back up and running now, and have Alex look into a proper fix. But - this case should not repeat for you - we just want to make sure to prevent it for anyone else in the future. - Tim
Perfect, thanks. Please post details when available. Also FYI, <clustername>_step_table that was ~1084525 just before I started (Purging back to 1 yr didn’t get it below 1M). Mysql command used: SELECT TABLE_NAME,TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'slurm_acct_db'; The slurm_acct_db was also over 1M # mysqlshow --count +--------------------+--------+--------------+ | Databases | Tables | Total Rows | +--------------------+--------+--------------+ | information_schema | 28 | 2988 | | mysql | 23 | 2039 | | slurm_acct_db | 27 | 1792324 | <-purging got this down from ~2.9M | slurm_jobcomp_db | 2 | 335317 | Regards, Ruth A. Braun Sr IT Analyst High Performance Computing From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, April 5, 2018 5:38 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5 Tim Wickberg<mailto:tim@schedmd.com> changed bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> What Removed Added CC tim@schedmd.com<mailto:tim@schedmd.com> Assignee tim@schedmd.com<mailto:tim@schedmd.com> alex@schedmd.com<mailto:alex@schedmd.com> Severity 2 - High Impact 3 - Medium Impact Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=5033#c5> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Tim Wickberg<mailto:tim@schedmd.com> Thank you. I think I spot part of what caused this, and you shouldn't have any problems going forward. I'm going to lower the severity on this as I assume you're back up and running now, and have Alex look into a proper fix. But - this case should not repeat for you - we just want to make sure to prevent it for anyone else in the future. - Tim ________________________________ You are receiving this mail because: * You reported the bug.
Hi. Will you attach the surmdbd and slurmctld logs for the upgrade day? Thanks.
Sure – by Monday From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Friday, April 6, 2018 11:07 AM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5 Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=5033#c8> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Alejandro Sanchez<mailto:alex@schedmd.com> Hi. Will you attach the surmdbd and slurmctld logs for the upgrade day? Thanks. ________________________________ You are receiving this mail because: * You reported the bug.
Files attached From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Friday, April 6, 2018 11:07 AM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5 Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=5033#c8> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Alejandro Sanchez<mailto:alex@schedmd.com> Hi. Will you attach the surmdbd and slurmctld logs for the upgrade day? Thanks. ________________________________ You are receiving this mail because: * You reported the bug.
Hi. Our hypothesis is that while slurmdbd was in the process of being upgraded, something bad external to Slurm happened, either filesystem full / failure, slurmdbd was killed or similar. Since the upgrade finally succeed, I'm gonna go ahead and close the bug. Please, reopen if there's anything left here.