Summary: | slurmdbd upgrade failure to 22.05.2 | ||
---|---|---|---|
Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
Component: | Database | Assignee: | Scott Hilton <scott> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 1 - System not usable | ||
Priority: | --- | CC: | mcmullan, scott, sts, tripiana |
Version: | 22.05.2 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=16746 | ||
Site: | Harvard University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
slurmdbd log from upgrade
mariadb log from upgrade messages from upgrade |
Description
Paul Edmon
2022-07-12 04:28:41 MDT
Created attachment 25828 [details]
slurmdbd log from upgrade
Created attachment 25829 [details]
mariadb log from upgrade
Created attachment 25830 [details]
messages from upgrade
The reimport of the database is going to take 7 hours. We've already burned a day on this so if we can figure out a route forward I'd rather just finish the upgrade and move on rather than reimport, downgrade and the reupgrade at a later date which will likely take as long. What is taking up most of the space in the database are the job_env and job_scripts. Can I just drop those columns from the database, recreate them as empty and then do the upgrade? Would that work? I don't actually care that much about the data in job_env and job_scripts as we use it mainly for debugging jobs and I can afford to drop what is in there for now. I'm going to bump the importance to this to inoperable because we are currently dead in the water until this database reimport is done and I'd like to actually see if we can get this upgrade through. Paul, After looking at the logs I think the DB got corrupted so I think the reimport of v21 data is needed. Correct me if I'm wrong, otherwise please keep it running. Once reimport is done, I think dropping scripts/envs fields is safe and would suffice to make the migration work. You can also think about adding more disk space somehow (plugging new HDD, NFS to a large drive,...) and redo the same migration w/o removing anything. But if you just don't care about scripts/envs I think the fastest way is your option. So, please, reimport v21 data, drop the fields and fire the DB migration to v22 again. I think this will work but if you experience any problem we'll all try to help you asap. We're afraid your system is unusable right now. Regards, Carlos. Clarification: dropping the fields is not a literal action. I wanted to say setting to default (empty/null) value those table columns for all rows in table. This implies in v21 database, table *_job_table: batch_script env_vars Set to NULL for all job rows. Sorry. Okay, that's what I figured. The reimport of the database from v21 is going on right now. Once that is there I will try dropping those columns. As a favor can you send me the precise commands for this. While I can probably figure this out it would be helpful to know the correct commands from removing those columns, readding them (so that they are there and empty). Our reimport of the database is going to take a few more hours so we have some time before I will retry the upgrade. -Paul Edmon- On 7/12/2022 9:28 AM, bugs@schedmd.com wrote: > Carlos Tripiana Montes <mailto:tripiana@schedmd.com> changed bug 14514 > <https://bugs.schedmd.com/show_bug.cgi?id=14514> > What Removed Added > CC tripiana@schedmd.com > > *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c5> on > bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Carlos > Tripiana Montes <mailto:tripiana@schedmd.com> * > Paul, > > After looking at the logs I think the DB got corrupted so I think the reimport > of v21 data is needed. Correct me if I'm wrong, otherwise please keep it > running. > > Once reimport is done, I think dropping scripts/envs fields is safe and would > suffice to make the migration work. > > You can also think about adding more disk space somehow (plugging new HDD, NFS > to a large drive,...) and redo the same migration w/o removing anything. But if > you just don't care about scripts/envs I think the fastest way is your option. > > So, please, reimport v21 data, drop the fields and fire the DB migration to v22 > again. I think this will work but if you experience any problem we'll all try > to help you asap. > > We're afraid your system is unusable right now. > > Regards, > Carlos. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Do you have a command for that? My plan was to just drop the columns and then readd them but if that's not permitted then it would be good to have the command for nulling out the entries. -Paul Edmon- On 7/12/2022 9:36 AM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c6> on > bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Carlos > Tripiana Montes <mailto:tripiana@schedmd.com> * > Clarification: dropping the fields is not a literal action. I wanted to say > setting to default (empty/null) value those table columns for all rows in > table. > > This implies in v21 database, table *_job_table: > > batch_script > env_vars > > Set to NULL for all job rows. > > Sorry. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Should be something like: MariaDB > update odyssey_job_table set batch_script=NULL; MariaDB > update odyssey_job_table set env_vars=NULL; To test if all are NULL query: MariaDB > select * from odyssey_job_table where batch_script is not NULL; MariaDB > select * from odyssey_job_table where env_vars is not NULL; If no rows are returned then all are null so everything got wiped. Great. Thanks. -Paul Edmon- On 7/12/2022 10:02 AM, bugs@schedmd.com wrote: > > *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c10> on > bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Carlos > Tripiana Montes <mailto:tripiana@schedmd.com> * > To test if all are NULL query: > > MariaDB > select * from odyssey_job_table where batch_script is not NULL; > MariaDB > select * from odyssey_job_table where env_vars is not NULL; > > If no rows are returned then all are null so everything got wiped. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > After reimporting the database (8 hours) purging out the job scripts and envs (5 hours) the upgrade of the database finished after 3 hours. The scheduler is back online and up at 22.05.2 Thanks for the assistance. Anything that can be done to make upgrades faster for the database would be appreciated. I'm glad that did the trick. In 22.05 we store batch_script and env_vars in their own tables and use ids in the job table to avoid storing duplicate values. So this shouldn't be a big storage issue again. If each purge transaction still has issues you could try switching to a daily purge. (i.e. PurgeJobAfter=31days instead of PurgeJobAfter=1) This would mean smaller purges more often. Do you have any follow up questions or issues for this ticket? -Scott Nope. We are all set. -paul Edmon- On 7/13/22 12:40 PM, bugs@schedmd.com wrote: > > *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c16> on > bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Scott > Hilton <mailto:scott@schedmd.com> * > I'm glad that did the trick. > > In 22.05 we store batch_script and env_vars in their own tables and use ids in > the job table to avoid storing duplicate values. So this shouldn't be a big > storage issue again. > > If each purge transaction still has issues you could try switching to a daily > purge. (i.e. PurgeJobAfter=31days instead of PurgeJobAfter=1) This would mean > smaller purges more often. > > Do you have any follow up questions or issues for this ticket? > > -Scott > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Closing ticket |