4319 – upgrading from 16.05.8 to 17.02.8

Ticket 4319 - upgrading from 16.05.8 to 17.02.8

Summary: upgrading from 16.05.8 to 17.02.8

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	17.02.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-10-31 08:28 MDT by Robert Yelle
Modified:	2017-11-03 10:49 MDT (History)
CC List:	1 user (show)

See Also:
Site:	University of Oregon
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Robert Yelle 2017-10-31 08:28:40 MDT

Hello,

This is a generic upgrade ticket.  We will be attempting to upgrade Slurm from 16.05.8 Bright 7.3 managed RPMs to custom built 17.02.8.  I understand the first step here is to update the slurmdbd.  It seems that the only thing involved in the upgrade itself is installing the new slurmdbd then starting it.  So just to be clear, are database tables are automatically read and changed when the new slurmdbd is launched for the first time?  What is the procedure for rolling back (in case I need to)?

I understand that in addition to the mysqldump, we should backup the StateSaveLocation.  Is there anything else we should back up?

Thanks,

Rob

Comment 1 Felip Moll 2017-10-31 10:28:32 MDT

(In reply to Robert Yelle from comment #0)
> Hello,
> 
> This is a generic upgrade ticket.  We will be attempting to upgrade Slurm
> from 16.05.8 Bright 7.3 managed RPMs to custom built 17.02.8.  I understand
> the first step here is to update the slurmdbd.  It seems that the only thing
> involved in the upgrade itself is installing the new slurmdbd then starting
> it.  So just to be clear, are database tables are automatically read and
> changed when the new slurmdbd is launched for the first time?  What is the
> procedure for rolling back (in case I need to)?
> 
> I understand that in addition to the mysqldump, we should backup the
> StateSaveLocation.  Is there anything else we should back up?
> 
> Thanks,
> 
> Rob

Hello Rob,

As you say it is mandatory that SlurmDBD be upgraded in the first place.

When you restart the daemon, SlurmDBD itself will modify and adjust the required tables and fields. Once done you will not be able to downgrade the daemon directly. If you need to do so you should use a mysql dump file in order to restore the database, and then start the older daemon.

The State files should also be backed up if you want to be sure to have the possibility to do a downgrade. Failing to backup this data would produce a loss of all running and pending jobs.

Things like MPI libraries with Slurm integration should be recompiled because libslurm.so is changed in your case (16.05->17.02).

Remember that slurmctld daemon must be upgraded before (or at the same time) than the slurmd daemons.

Basically I recommend you to follow the steps and advice from https://slurm.schedmd.com/quickstart_admin.html 

Regarding the rollback procedure:
------------------------------------
1. Stop all daemons
2. Downgrade all daemons
3. Delete database contents and restore the mysqldump copy
4. Restore the StateSaveLocation
5. Start SlurmDBD and check if it works, sacctmgr show cluster/assoc
6. Start slurmctld and check sinfo --version, sinfo, squeue, etc.
7. Start slurmd daemons

In this case jobs and queue will be recovered, but be carefull with the timeouts.

If you have any other question don't hesitate and reopen the bug.

Best Regards
Felip M

Comment 2 Robert Yelle 2017-11-01 17:14:48 MDT

Hi Felip,

Thank you for the info.  Our Slurm upgrade seems to have gone well, no issues discovered yet...

Rob

On Oct 31, 2017, at 9:28 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:

Felip Moll<mailto:felip.moll@schedmd.com> changed bug 4319<https://bugs.schedmd.com/show_bug.cgi?id=4319>
What    Removed Added
Resolution      ---     INFOGIVEN
CC              felip.moll@schedmd.com<mailto:felip.moll@schedmd.com>
Assignee        support@schedmd.com<mailto:support@schedmd.com>         felip.moll@schedmd.com<mailto:felip.moll@schedmd.com>
Status  UNCONFIRMED     RESOLVED
Severity        2 - High Impact         4 - Minor Issue

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=4319#c1> on bug 4319<https://bugs.schedmd.com/show_bug.cgi?id=4319> from Felip Moll<mailto:felip.moll@schedmd.com>

(In reply to Robert Yelle from comment #0<x-msg://33/show_bug.cgi?id=4319#c0>)
> Hello,
>
> This is a generic upgrade ticket.  We will be attempting to upgrade Slurm
> from 16.05.8 Bright 7.3 managed RPMs to custom built 17.02.8.  I understand
> the first step here is to update the slurmdbd.  It seems that the only thing
> involved in the upgrade itself is installing the new slurmdbd then starting
> it.  So just to be clear, are database tables are automatically read and
> changed when the new slurmdbd is launched for the first time?  What is the
> procedure for rolling back (in case I need to)?
>
> I understand that in addition to the mysqldump, we should backup the
> StateSaveLocation.  Is there anything else we should back up?
>
> Thanks,
>
> Rob

Hello Rob,

As you say it is mandatory that SlurmDBD be upgraded in the first place.

When you restart the daemon, SlurmDBD itself will modify and adjust the
required tables and fields. Once done you will not be able to downgrade the
daemon directly. If you need to do so you should use a mysql dump file in order
to restore the database, and then start the older daemon.

The State files should also be backed up if you want to be sure to have the
possibility to do a downgrade. Failing to backup this data would produce a loss
of all running and pending jobs.

Things like MPI libraries with Slurm integration should be recompiled because
libslurm.so is changed in your case (16.05->17.02).

Remember that slurmctld daemon must be upgraded before (or at the same time)
than the slurmd daemons.

Basically I recommend you to follow the steps and advice from
https://slurm.schedmd.com/quickstart_admin.html

Regarding the rollback procedure:
------------------------------------
1. Stop all daemons
2. Downgrade all daemons
3. Delete database contents and restore the mysqldump copy
4. Restore the StateSaveLocation
5. Start SlurmDBD and check if it works, sacctmgr show cluster/assoc
6. Start slurmctld and check sinfo --version, sinfo, squeue, etc.
7. Start slurmd daemons

In this case jobs and queue will be recovered, but be carefull with the
timeouts.

If you have any other question don't hesitate and reopen the bug.

Best Regards
Felip M

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 3 Robert Yelle 2017-11-03 10:49:38 MDT

Hi Felip,

So far, so good with the Slurm upgrade, no issues except for MPI libraries that you already mentioned below.

We would also like to implement the network topology plugin, but were unable to get this done during our outage earlier this week.  Would implementing this plugin require an outage, or can we implement this while the cluster is in production?

Thanks,

Rob

On Nov 1, 2017, at 4:14 PM, Rob Yelle <ryelle@uoregon.edu<mailto:ryelle@uoregon.edu>> wrote:

Hi Felip,

Thank you for the info.  Our Slurm upgrade seems to have gone well, no issues discovered yet...

Rob

On Oct 31, 2017, at 9:28 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:

Felip Moll<mailto:felip.moll@schedmd.com> changed bug 4319<https://bugs.schedmd.com/show_bug.cgi?id=4319>
What    Removed Added
Resolution      ---     INFOGIVEN
CC              felip.moll@schedmd.com<mailto:felip.moll@schedmd.com>
Assignee        support@schedmd.com<mailto:support@schedmd.com>         felip.moll@schedmd.com<mailto:felip.moll@schedmd.com>
Status  UNCONFIRMED     RESOLVED
Severity        2 - High Impact         4 - Minor Issue

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=4319#c1> on bug 4319<https://bugs.schedmd.com/show_bug.cgi?id=4319> from Felip Moll<mailto:felip.moll@schedmd.com>

(In reply to Robert Yelle from comment #0<x-msg://33/show_bug.cgi?id=4319#c0>)
> Hello,
>
> This is a generic upgrade ticket.  We will be attempting to upgrade Slurm
> from 16.05.8 Bright 7.3 managed RPMs to custom built 17.02.8.  I understand
> the first step here is to update the slurmdbd.  It seems that the only thing
> involved in the upgrade itself is installing the new slurmdbd then starting
> it.  So just to be clear, are database tables are automatically read and
> changed when the new slurmdbd is launched for the first time?  What is the
> procedure for rolling back (in case I need to)?
>
> I understand that in addition to the mysqldump, we should backup the
> StateSaveLocation.  Is there anything else we should back up?
>
> Thanks,
>
> Rob

Hello Rob,

As you say it is mandatory that SlurmDBD be upgraded in the first place.

When you restart the daemon, SlurmDBD itself will modify and adjust the
required tables and fields. Once done you will not be able to downgrade the
daemon directly. If you need to do so you should use a mysql dump file in order
to restore the database, and then start the older daemon.

The State files should also be backed up if you want to be sure to have the
possibility to do a downgrade. Failing to backup this data would produce a loss
of all running and pending jobs.

Things like MPI libraries with Slurm integration should be recompiled because
libslurm.so is changed in your case (16.05->17.02).

Remember that slurmctld daemon must be upgraded before (or at the same time)
than the slurmd daemons.

Basically I recommend you to follow the steps and advice from
https://slurm.schedmd.com/quickstart_admin.html

Regarding the rollback procedure:
------------------------------------
1. Stop all daemons
2. Downgrade all daemons
3. Delete database contents and restore the mysqldump copy
4. Restore the StateSaveLocation
5. Start SlurmDBD and check if it works, sacctmgr show cluster/assoc
6. Start slurmctld and check sinfo --version, sinfo, squeue, etc.
7. Start slurmd daemons

In this case jobs and queue will be recovered, but be carefull with the
timeouts.

If you have any other question don't hesitate and reopen the bug.

Best Regards
Felip M

________________________________
You are receiving this mail because:

  *   You reported the bug.