We attempted to upgrade to 22.05.2 from 21.08.8-2 yesterday. After several hours waiting for the database to upgrade the upgrade failed due to the system running out of space. I've attached the logs. The failure happens around 23:12:39. The upgrade itself had run since 11:15. The database in question is about 586 GB uncompressed. We have mysql compression turned and so its 198 GB on disk when running. The upgrade however caused the host it is on to run out of its 800 GB of space: [root@holy-slurm02 backup]# df -h / Filesystem Size Used Avail Use% Mounted on /dev/mapper/centos-root 893G 383G 510G 43% / I'm going to reimport our old database and revert to 21.08.8-2 as our cluster has been down since yesterday morning due to the upgrade. We will need a path forward though as under 21.08.8-2 we were suffering from deadlock issues when the database did its monthly purge, so we disabled that to bridge until 22.05.2 which in theory could handle the purge more smoothly. As such our database keeps growing in size and will only become more difficult to upgrade. I've attached the logs from the system. Let me know what other information you need from me.
Created attachment 25828 [details] slurmdbd log from upgrade
Created attachment 25829 [details] mariadb log from upgrade
Created attachment 25830 [details] messages from upgrade
The reimport of the database is going to take 7 hours. We've already burned a day on this so if we can figure out a route forward I'd rather just finish the upgrade and move on rather than reimport, downgrade and the reupgrade at a later date which will likely take as long. What is taking up most of the space in the database are the job_env and job_scripts. Can I just drop those columns from the database, recreate them as empty and then do the upgrade? Would that work? I don't actually care that much about the data in job_env and job_scripts as we use it mainly for debugging jobs and I can afford to drop what is in there for now. I'm going to bump the importance to this to inoperable because we are currently dead in the water until this database reimport is done and I'd like to actually see if we can get this upgrade through.
Paul, After looking at the logs I think the DB got corrupted so I think the reimport of v21 data is needed. Correct me if I'm wrong, otherwise please keep it running. Once reimport is done, I think dropping scripts/envs fields is safe and would suffice to make the migration work. You can also think about adding more disk space somehow (plugging new HDD, NFS to a large drive,...) and redo the same migration w/o removing anything. But if you just don't care about scripts/envs I think the fastest way is your option. So, please, reimport v21 data, drop the fields and fire the DB migration to v22 again. I think this will work but if you experience any problem we'll all try to help you asap. We're afraid your system is unusable right now. Regards, Carlos.
Clarification: dropping the fields is not a literal action. I wanted to say setting to default (empty/null) value those table columns for all rows in table. This implies in v21 database, table *_job_table: batch_script env_vars Set to NULL for all job rows. Sorry.
Okay, that's what I figured. The reimport of the database from v21 is going on right now. Once that is there I will try dropping those columns. As a favor can you send me the precise commands for this. While I can probably figure this out it would be helpful to know the correct commands from removing those columns, readding them (so that they are there and empty). Our reimport of the database is going to take a few more hours so we have some time before I will retry the upgrade. -Paul Edmon- On 7/12/2022 9:28 AM, bugs@schedmd.com wrote: > Carlos Tripiana Montes <mailto:tripiana@schedmd.com> changed bug 14514 > <https://bugs.schedmd.com/show_bug.cgi?id=14514> > What Removed Added > CC tripiana@schedmd.com > > *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c5> on > bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Carlos > Tripiana Montes <mailto:tripiana@schedmd.com> * > Paul, > > After looking at the logs I think the DB got corrupted so I think the reimport > of v21 data is needed. Correct me if I'm wrong, otherwise please keep it > running. > > Once reimport is done, I think dropping scripts/envs fields is safe and would > suffice to make the migration work. > > You can also think about adding more disk space somehow (plugging new HDD, NFS > to a large drive,...) and redo the same migration w/o removing anything. But if > you just don't care about scripts/envs I think the fastest way is your option. > > So, please, reimport v21 data, drop the fields and fire the DB migration to v22 > again. I think this will work but if you experience any problem we'll all try > to help you asap. > > We're afraid your system is unusable right now. > > Regards, > Carlos. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Do you have a command for that? My plan was to just drop the columns and then readd them but if that's not permitted then it would be good to have the command for nulling out the entries. -Paul Edmon- On 7/12/2022 9:36 AM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c6> on > bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Carlos > Tripiana Montes <mailto:tripiana@schedmd.com> * > Clarification: dropping the fields is not a literal action. I wanted to say > setting to default (empty/null) value those table columns for all rows in > table. > > This implies in v21 database, table *_job_table: > > batch_script > env_vars > > Set to NULL for all job rows. > > Sorry. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Should be something like: MariaDB > update odyssey_job_table set batch_script=NULL; MariaDB > update odyssey_job_table set env_vars=NULL;
To test if all are NULL query: MariaDB > select * from odyssey_job_table where batch_script is not NULL; MariaDB > select * from odyssey_job_table where env_vars is not NULL; If no rows are returned then all are null so everything got wiped.
Great. Thanks. -Paul Edmon- On 7/12/2022 10:02 AM, bugs@schedmd.com wrote: > > *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c10> on > bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Carlos > Tripiana Montes <mailto:tripiana@schedmd.com> * > To test if all are NULL query: > > MariaDB > select * from odyssey_job_table where batch_script is not NULL; > MariaDB > select * from odyssey_job_table where env_vars is not NULL; > > If no rows are returned then all are null so everything got wiped. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
After reimporting the database (8 hours) purging out the job scripts and envs (5 hours) the upgrade of the database finished after 3 hours. The scheduler is back online and up at 22.05.2 Thanks for the assistance. Anything that can be done to make upgrades faster for the database would be appreciated.
I'm glad that did the trick. In 22.05 we store batch_script and env_vars in their own tables and use ids in the job table to avoid storing duplicate values. So this shouldn't be a big storage issue again. If each purge transaction still has issues you could try switching to a daily purge. (i.e. PurgeJobAfter=31days instead of PurgeJobAfter=1) This would mean smaller purges more often. Do you have any follow up questions or issues for this ticket? -Scott
Nope. We are all set. -paul Edmon- On 7/13/22 12:40 PM, bugs@schedmd.com wrote: > > *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c16> on > bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Scott > Hilton <mailto:scott@schedmd.com> * > I'm glad that did the trick. > > In 22.05 we store batch_script and env_vars in their own tables and use ids in > the job table to avoid storing duplicate values. So this shouldn't be a big > storage issue again. > > If each purge transaction still has issues you could try switching to a daily > purge. (i.e. PurgeJobAfter=31days instead of PurgeJobAfter=1) This would mean > smaller purges more often. > > Do you have any follow up questions or issues for this ticket? > > -Scott > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Closing ticket