Ticket 14514

Summary:	slurmdbd upgrade failure to 22.05.2
Product:	Slurm	Reporter:	Paul Edmon <pedmon>
Component:	Database	Assignee:	Scott Hilton <scott>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	1 - System not usable
Priority:	---	CC:	mcmullan, scott, sts, tripiana
Version:	22.05.2
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=16746
Site:	Harvard University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmdbd log from upgrade mariadb log from upgrade messages from upgrade

Description Paul Edmon 2022-07-12 04:28:41 MDT

We attempted to upgrade to 22.05.2 from 21.08.8-2 yesterday.  After several hours waiting for the database to upgrade the upgrade failed due to the system running out of space.  I've attached the logs.  The failure happens around 23:12:39.  The upgrade itself had run since 11:15.  The database in question is about 586 GB uncompressed.  We have mysql compression turned and so its 198 GB on disk when running.  The upgrade however caused the host it is on to run out of its 800 GB of space:

[root@holy-slurm02 backup]# df -h /
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root  893G  383G  510G  43% /

I'm going to reimport our old database and revert to 21.08.8-2 as our cluster has been down since yesterday morning due to the upgrade.  We will need a path forward though as under 21.08.8-2 we were suffering from deadlock issues when the database did its monthly purge, so we disabled that to bridge until 22.05.2 which in theory could handle the purge more smoothly.  As such our database keeps growing in size and will only become more difficult to upgrade.

I've attached the logs from the system.  Let me know what other information you need from me.

Comment 1 Paul Edmon 2022-07-12 04:33:00 MDT

Created attachment 25828 [details]
slurmdbd log from upgrade

Comment 2 Paul Edmon 2022-07-12 04:35:12 MDT

Created attachment 25829 [details]
mariadb log from upgrade

Comment 3 Paul Edmon 2022-07-12 04:35:33 MDT

Created attachment 25830 [details]
messages from upgrade

Comment 4 Paul Edmon 2022-07-12 06:01:58 MDT

The reimport of the database is going to take 7 hours.  We've already burned a day on this so if we can figure out a route forward I'd rather just finish the upgrade and move on rather than reimport, downgrade and the reupgrade at a later date which will likely take as long.

What is taking up most of the space in the database are the job_env and job_scripts.  Can I just drop those columns from the database, recreate them as empty and then do the upgrade?  Would that work?  I don't actually care that much about the data in job_env and job_scripts as we use it mainly for debugging jobs and I can afford to drop what is in there for now.

I'm going to bump the importance to this to inoperable because we are currently dead in the water until this database reimport is done and I'd like to actually see if we can get this upgrade through.

Comment 5 Carlos Tripiana Montes 2022-07-12 07:28:33 MDT

Paul,

After looking at the logs I think the DB got corrupted so I think the reimport of v21 data is needed. Correct me if I'm wrong, otherwise please keep it running.

Once reimport is done, I think dropping scripts/envs fields is safe and would suffice to make the migration work.

You can also think about adding more disk space somehow (plugging new HDD, NFS to a large drive,...) and redo the same migration w/o removing anything. But if you just don't care about scripts/envs I think the fastest way is your option.

So, please, reimport v21 data, drop the fields and fire the DB migration to v22 again. I think this will work but if you experience any problem we'll all try to help you asap.

We're afraid your system is unusable right now.

Regards,
Carlos.

Comment 6 Carlos Tripiana Montes 2022-07-12 07:36:43 MDT

Clarification: dropping the fields is not a literal action. I wanted to say setting to default (empty/null) value those table columns for all rows in table.

This implies in v21 database, table *_job_table:

batch_script
env_vars

Set to NULL for all job rows.

Sorry.

Comment 7 Paul Edmon 2022-07-12 07:40:11 MDT

Okay, that's what I figured.  The reimport of the database from v21 is 
going on right now.  Once that is there I will try dropping those columns.

As a favor can you send me the precise commands for this.  While I can 
probably figure this out it would be helpful to know the correct 
commands from removing those columns, readding them (so that they are 
there and empty).

Our reimport of the database is going to take a few more hours so we 
have some time before I will retry the upgrade.

-Paul Edmon-

On 7/12/2022 9:28 AM, bugs@schedmd.com wrote:
> Carlos Tripiana Montes <mailto:tripiana@schedmd.com> changed bug 14514 
> <https://bugs.schedmd.com/show_bug.cgi?id=14514>
> What 	Removed 	Added
> CC 		tripiana@schedmd.com
>
> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c5> on 
> bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Carlos 
> Tripiana Montes <mailto:tripiana@schedmd.com> *
> Paul,
>
> After looking at the logs I think the DB got corrupted so I think the reimport
> of v21 data is needed. Correct me if I'm wrong, otherwise please keep it
> running.
>
> Once reimport is done, I think dropping scripts/envs fields is safe and would
> suffice to make the migration work.
>
> You can also think about adding more disk space somehow (plugging new HDD, NFS
> to a large drive,...) and redo the same migration w/o removing anything. But if
> you just don't care about scripts/envs I think the fastest way is your option.
>
> So, please, reimport v21 data, drop the fields and fire the DB migration to v22
> again. I think this will work but if you experience any problem we'll all try
> to help you asap.
>
> We're afraid your system is unusable right now.
>
> Regards,
> Carlos.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 8 Paul Edmon 2022-07-12 07:47:19 MDT

Do you have a command for that?  My plan was to just drop the columns 
and then readd them but if that's not permitted then it would be good to 
have the command for nulling out the entries.

-Paul Edmon-

On 7/12/2022 9:36 AM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c6> on 
> bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Carlos 
> Tripiana Montes <mailto:tripiana@schedmd.com> *
> Clarification: dropping the fields is not a literal action. I wanted to say
> setting to default (empty/null) value those table columns for all rows in
> table.
>
> This implies in v21 database, table *_job_table:
>
> batch_script
> env_vars
>
> Set to NULL for all job rows.
>
> Sorry.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 9 Carlos Tripiana Montes 2022-07-12 08:00:10 MDT

Should be something like:

MariaDB > update odyssey_job_table set batch_script=NULL;
MariaDB > update odyssey_job_table set env_vars=NULL;

Comment 10 Carlos Tripiana Montes 2022-07-12 08:02:01 MDT

To test if all are NULL query:

MariaDB > select * from odyssey_job_table where batch_script is not NULL;
MariaDB > select * from odyssey_job_table where env_vars is not NULL;

If no rows are returned then all are null so everything got wiped.

Comment 11 Paul Edmon 2022-07-12 08:04:29 MDT

Great.  Thanks.

-Paul Edmon-

On 7/12/2022 10:02 AM, bugs@schedmd.com wrote:
>
> *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c10> on 
> bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Carlos 
> Tripiana Montes <mailto:tripiana@schedmd.com> *
> To test if all are NULL query:
>
> MariaDB > select * from odyssey_job_table where batch_script is not NULL;
> MariaDB > select * from odyssey_job_table where env_vars is not NULL;
>
> If no rows are returned then all are null so everything got wiped.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 15 Paul Edmon 2022-07-12 21:19:12 MDT

After reimporting the database (8 hours) purging out the job scripts and envs (5 hours) the upgrade of the database finished after 3 hours.

The scheduler is back online and up at 22.05.2

Thanks for the assistance.  Anything that can be done to make upgrades faster for the database would be appreciated.

Comment 16 Scott Hilton 2022-07-13 10:40:12 MDT

I'm glad that did the trick.

In 22.05 we store batch_script and env_vars in their own tables and use ids in the job table to avoid storing duplicate values. So this shouldn't be a big storage issue again.

If each purge transaction still has issues you could try switching to a daily purge. (i.e. PurgeJobAfter=31days instead of PurgeJobAfter=1) This would mean smaller purges more often.

Do you have any follow up questions or issues for this ticket?

-Scott

Comment 17 Paul Edmon 2022-07-13 11:45:44 MDT

Nope.  We are all set.

-paul Edmon-

On 7/13/22 12:40 PM, bugs@schedmd.com wrote:
>
> *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=14514#c16> on 
> bug 14514 <https://bugs.schedmd.com/show_bug.cgi?id=14514> from Scott 
> Hilton <mailto:scott@schedmd.com> *
> I'm glad that did the trick.
>
> In 22.05 we store batch_script and env_vars in their own tables and use ids in
> the job table to avoid storing duplicate values. So this shouldn't be a big
> storage issue again.
>
> If each purge transaction still has issues you could try switching to a daily
> purge. (i.e. PurgeJobAfter=31days instead of PurgeJobAfter=1) This would mean
> smaller purges more often.
>
> Do you have any follow up questions or issues for this ticket?
>
> -Scott
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 18 Scott Hilton 2022-07-13 13:29:20 MDT

Closing ticket