Ticket 14514 - slurmdbd upgrade failure to 22.05.2
Summary: slurmdbd upgrade failure to 22.05.2
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other tickets)
Version: 22.05.2
Hardware: Linux Linux
: 1 - System not usable
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-07-12 04:28 MDT by Paul Edmon
Modified: 2023-06-19 14:43 MDT (History)
4 users (show)

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmdbd log from upgrade (1.97 KB, text/plain)
2022-07-12 04:33 MDT, Paul Edmon
Details
mariadb log from upgrade (648.93 KB, application/x-bzip)
2022-07-12 04:35 MDT, Paul Edmon
Details
messages from upgrade (2.02 MB, application/x-bzip)
2022-07-12 04:35 MDT, Paul Edmon
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2022-07-12 04:28:41 MDT
We attempted to upgrade to 22.05.2 from 21.08.8-2 yesterday.  After several hours waiting for the database to upgrade the upgrade failed due to the system running out of space.  I've attached the logs.  The failure happens around 23:12:39.  The upgrade itself had run since 11:15.  The database in question is about 586 GB uncompressed.  We have mysql compression turned and so its 198 GB on disk when running.  The upgrade however caused the host it is on to run out of its 800 GB of space:

[root@holy-slurm02 backup]# df -h /
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root  893G  383G  510G  43% /

I'm going to reimport our old database and revert to 21.08.8-2 as our cluster has been down since yesterday morning due to the upgrade.  We will need a path forward though as under 21.08.8-2 we were suffering from deadlock issues when the database did its monthly purge, so we disabled that to bridge until 22.05.2 which in theory could handle the purge more smoothly.  As such our database keeps growing in size and will only become more difficult to upgrade.

I've attached the logs from the system.  Let me know what other information you need from me.
Comment 1 Paul Edmon 2022-07-12 04:33:00 MDT
Created attachment 25828 [details]
slurmdbd log from upgrade
Comment 2 Paul Edmon 2022-07-12 04:35:12 MDT
Created attachment 25829 [details]
mariadb log from upgrade
Comment 3 Paul Edmon 2022-07-12 04:35:33 MDT
Created attachment 25830 [details]
messages from upgrade
Comment 4 Paul Edmon 2022-07-12 06:01:58 MDT
The reimport of the database is going to take 7 hours.  We've already burned a day on this so if we can figure out a route forward I'd rather just finish the upgrade and move on rather than reimport, downgrade and the reupgrade at a later date which will likely take as long.

What is taking up most of the space in the database are the job_env and job_scripts.  Can I just drop those columns from the database, recreate them as empty and then do the upgrade?  Would that work?  I don't actually care that much about the data in job_env and job_scripts as we use it mainly for debugging jobs and I can afford to drop what is in there for now.

I'm going to bump the importance to this to inoperable because we are currently dead in the water until this database reimport is done and I'd like to actually see if we can get this upgrade through.
Comment 5 Carlos Tripiana Montes 2022-07-12 07:28:33 MDT
Paul,

After looking at the logs I think the DB got corrupted so I think the reimport of v21 data is needed. Correct me if I'm wrong, otherwise please keep it running.

Once reimport is done, I think dropping scripts/envs fields is safe and would suffice to make the migration work.

You can also think about adding more disk space somehow (plugging new HDD, NFS to a large drive,...) and redo the same migration w/o removing anything. But if you just don't care about scripts/envs I think the fastest way is your option.

So, please, reimport v21 data, drop the fields and fire the DB migration to v22 again. I think this will work but if you experience any problem we'll all try to help you asap.

We're afraid your system is unusable right now.

Regards,
Carlos.
Comment 6 Carlos Tripiana Montes 2022-07-12 07:36:43 MDT
Clarification: dropping the fields is not a literal action. I wanted to say setting to default (empty/null) value those table columns for all rows in table.

This implies in v21 database, table *_job_table:

batch_script
env_vars

Set to NULL for all job rows.

Sorry.
Comment 7 Paul Edmon 2022-07-12 07:40:11 MDT
Okay, that's what I figured.  The reimport of the database from v21 is 
going on right now.  Once that is there I will try dropping those columns.

As a favor can you send me the precise commands for this.  While I can 
probably figure this out it would be helpful to know the correct 
commands from removing those columns, readding them (so that they are 
there and empty).

Our reimport of the database is going to take a few more hours so we 
have some time before I will retry the upgrade.

-Paul Edmon-

On 7/12/2022 9:28 AM, bugs@schedmd.com wrote:
> Carlos Tripiana Montes <mailto:tripiana@schedmd.com> changed bug 14514 
> <https://bugs.schedmd.com/show_bug.cgi?id=14514>
> What 	Removed 	Added
> CC 		tripiana@schedmd.com
>
> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c5> on 
> bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Carlos 
> Tripiana Montes <mailto:tripiana@schedmd.com> *
> Paul,
>
> After looking at the logs I think the DB got corrupted so I think the reimport
> of v21 data is needed. Correct me if I'm wrong, otherwise please keep it
> running.
>
> Once reimport is done, I think dropping scripts/envs fields is safe and would
> suffice to make the migration work.
>
> You can also think about adding more disk space somehow (plugging new HDD, NFS
> to a large drive,...) and redo the same migration w/o removing anything. But if
> you just don't care about scripts/envs I think the fastest way is your option.
>
> So, please, reimport v21 data, drop the fields and fire the DB migration to v22
> again. I think this will work but if you experience any problem we'll all try
> to help you asap.
>
> We're afraid your system is unusable right now.
>
> Regards,
> Carlos.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 8 Paul Edmon 2022-07-12 07:47:19 MDT
Do you have a command for that?  My plan was to just drop the columns 
and then readd them but if that's not permitted then it would be good to 
have the command for nulling out the entries.

-Paul Edmon-

On 7/12/2022 9:36 AM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c6> on 
> bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Carlos 
> Tripiana Montes <mailto:tripiana@schedmd.com> *
> Clarification: dropping the fields is not a literal action. I wanted to say
> setting to default (empty/null) value those table columns for all rows in
> table.
>
> This implies in v21 database, table *_job_table:
>
> batch_script
> env_vars
>
> Set to NULL for all job rows.
>
> Sorry.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 9 Carlos Tripiana Montes 2022-07-12 08:00:10 MDT
Should be something like:

MariaDB > update odyssey_job_table set batch_script=NULL;
MariaDB > update odyssey_job_table set env_vars=NULL;
Comment 10 Carlos Tripiana Montes 2022-07-12 08:02:01 MDT
To test if all are NULL query:

MariaDB > select * from odyssey_job_table where batch_script is not NULL;
MariaDB > select * from odyssey_job_table where env_vars is not NULL;

If no rows are returned then all are null so everything got wiped.
Comment 11 Paul Edmon 2022-07-12 08:04:29 MDT
Great.  Thanks.

-Paul Edmon-

On 7/12/2022 10:02 AM, bugs@schedmd.com wrote:
>
> *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c10> on 
> bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Carlos 
> Tripiana Montes <mailto:tripiana@schedmd.com> *
> To test if all are NULL query:
>
> MariaDB > select * from odyssey_job_table where batch_script is not NULL;
> MariaDB > select * from odyssey_job_table where env_vars is not NULL;
>
> If no rows are returned then all are null so everything got wiped.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 15 Paul Edmon 2022-07-12 21:19:12 MDT
After reimporting the database (8 hours) purging out the job scripts and envs (5 hours) the upgrade of the database finished after 3 hours.

The scheduler is back online and up at 22.05.2

Thanks for the assistance.  Anything that can be done to make upgrades faster for the database would be appreciated.
Comment 16 Scott Hilton 2022-07-13 10:40:12 MDT
I'm glad that did the trick.

In 22.05 we store batch_script and env_vars in their own tables and use ids in the job table to avoid storing duplicate values. So this shouldn't be a big storage issue again.

If each purge transaction still has issues you could try switching to a daily purge. (i.e. PurgeJobAfter=31days instead of PurgeJobAfter=1) This would mean smaller purges more often.

Do you have any follow up questions or issues for this ticket?

-Scott
Comment 17 Paul Edmon 2022-07-13 11:45:44 MDT
Nope.  We are all set.

-paul Edmon-

On 7/13/22 12:40 PM, bugs@schedmd.com wrote:
>
> *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c16> on 
> bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Scott 
> Hilton <mailto:scott@schedmd.com> *
> I'm glad that did the trick.
>
> In 22.05 we store batch_script and env_vars in their own tables and use ids in
> the job table to avoid storing duplicate values. So this shouldn't be a big
> storage issue again.
>
> If each purge transaction still has issues you could try switching to a daily
> purge. (i.e. PurgeJobAfter=31days instead of PurgeJobAfter=1) This would mean
> smaller purges more often.
>
> Do you have any follow up questions or issues for this ticket?
>
> -Scott
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 18 Scott Hilton 2022-07-13 13:29:20 MDT
Closing ticket