| Summary: | slurmdbd conversion fails during upgrade to 17.11.5 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | ruth.a.braun |
| Component: | slurmdbd | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | felip.moll, tim |
| Version: | 17.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | EM | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurmlogs-braun.tar.gz | ||
|
Description
ruth.a.braun
2018-04-05 13:38:35 MDT
Created attachment 6556 [details]
slurmlogs-braun.tar.gz
I'm trying to chase down a cause for this and/or a workaround, but some additional details may help -
What Slurm version were you on prior to the upgrade?
What MySQL version are you currently running?
Slurm 16.05.7 Mysql 5.1.73 I restarted the slurmdbd again and I got [2018-04-05T15:46:57.958] Conversion done: success! Regards, Ruth A. Braun Sr IT Analyst High Performance Computing Scientific Computing, Technology Platforms Research and Engineering IT Fuels, Lubricants and Chemicals IT, Information Technology 1545 US Rt 22 East Annandale, NJ 08801 908-335-3694 From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, April 5, 2018 5:09 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5 Tim Wickberg<mailto:tim@schedmd.com> changed bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> What Removed Added Assignee support@schedmd.com<mailto:support@schedmd.com> tim@schedmd.com<mailto:tim@schedmd.com> Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5033#c1> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Tim Wickberg<mailto:tim@schedmd.com> I'm trying to chase down a cause for this and/or a workaround, but some additional details may help - What Slurm version were you on prior to the upgrade? What MySQL version are you currently running? ________________________________ You are receiving this mail because: * You reported the bug. Glad that's behaving at least, but I'm still not sure what happened here. A couple more questions, if you could: - Do you have a backup slurmdbd process running somewhere? - Can you attach a copy of slurmdbd.conf (with StoragePass redacted preferably)? Tim, Here you go We don’t have use a backup controller for dbd – slurmdbd runs on the same server as slurmctld Ruth Contents of: /etc/slurm/slurmdbd.conf # Archive info PurgeEventAfter = 8760hours # Keep only 1 yr (8760 hours) FOR THREE YEARS, USE 26280hours FOR TWO YEARS 17520 hours PurgeJobAfter = 8760hours PurgeResvAfter = 8760hours PurgeStepAfter = 8760hours PurgeSuspendAfter = 8760hours # # Authentication info AuthType=auth/munge #AuthInfo=/var/run/munge/munge.socket.2 # # slurmDBD info DbdAddr=localhost DbdHost=localhost #DbdPort=7031 SlurmUser=slurm #MessageTimeout=300 DebugLevel=info DebugFlags=DB_ARCHIVE,DB_EVENT,DB_JOB,DB_STEP #DebugFlags=DB_ARCHIVE,DB_EVENT LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid # # Database info StorageType=accounting_storage/mysql StorageUser=slurm Regards, Ruth A. Braun Sr IT Analyst High Performance Computing Scientific Computing, Technology Platforms Research and Engineering IT Fuels, Lubricants and Chemicals IT, Information Technology 1545 US Rt 22 East Annandale, NJ 08801 908-335-3694 From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, April 5, 2018 5:20 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5 Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=5033#c3> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Tim Wickberg<mailto:tim@schedmd.com> Glad that's behaving at least, but I'm still not sure what happened here. A couple more questions, if you could: - Do you have a backup slurmdbd process running somewhere? - Can you attach a copy of slurmdbd.conf (with StoragePass redacted preferably)? ________________________________ You are receiving this mail because: * You reported the bug. Thank you. I think I spot part of what caused this, and you shouldn't have any problems going forward. I'm going to lower the severity on this as I assume you're back up and running now, and have Alex look into a proper fix. But - this case should not repeat for you - we just want to make sure to prevent it for anyone else in the future. - Tim Perfect, thanks. Please post details when available. Also FYI, <clustername>_step_table that was ~1084525 just before I started (Purging back to 1 yr didn’t get it below 1M). Mysql command used: SELECT TABLE_NAME,TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'slurm_acct_db'; The slurm_acct_db was also over 1M # mysqlshow --count +--------------------+--------+--------------+ | Databases | Tables | Total Rows | +--------------------+--------+--------------+ | information_schema | 28 | 2988 | | mysql | 23 | 2039 | | slurm_acct_db | 27 | 1792324 | <-purging got this down from ~2.9M | slurm_jobcomp_db | 2 | 335317 | Regards, Ruth A. Braun Sr IT Analyst High Performance Computing From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, April 5, 2018 5:38 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5 Tim Wickberg<mailto:tim@schedmd.com> changed bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> What Removed Added CC tim@schedmd.com<mailto:tim@schedmd.com> Assignee tim@schedmd.com<mailto:tim@schedmd.com> alex@schedmd.com<mailto:alex@schedmd.com> Severity 2 - High Impact 3 - Medium Impact Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=5033#c5> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Tim Wickberg<mailto:tim@schedmd.com> Thank you. I think I spot part of what caused this, and you shouldn't have any problems going forward. I'm going to lower the severity on this as I assume you're back up and running now, and have Alex look into a proper fix. But - this case should not repeat for you - we just want to make sure to prevent it for anyone else in the future. - Tim ________________________________ You are receiving this mail because: * You reported the bug. Hi. Will you attach the surmdbd and slurmctld logs for the upgrade day? Thanks. Sure – by Monday From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Friday, April 6, 2018 11:07 AM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5 Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=5033#c8> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Alejandro Sanchez<mailto:alex@schedmd.com> Hi. Will you attach the surmdbd and slurmctld logs for the upgrade day? Thanks. ________________________________ You are receiving this mail because: * You reported the bug. Files attached From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Friday, April 6, 2018 11:07 AM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 5033] slurmdbd conversion fails during upgrade to 17.11.5 Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=5033#c8> on bug 5033<https://bugs.schedmd.com/show_bug.cgi?id=5033> from Alejandro Sanchez<mailto:alex@schedmd.com> Hi. Will you attach the surmdbd and slurmctld logs for the upgrade day? Thanks. ________________________________ You are receiving this mail because: * You reported the bug. Hi. Our hypothesis is that while slurmdbd was in the process of being upgraded, something bad external to Slurm happened, either filesystem full / failure, slurmdbd was killed or similar. Since the upgrade finally succeed, I'm gonna go ahead and close the bug. Please, reopen if there's anything left here. |