Ticket 4319 - upgrading from 16.05.8 to 17.02.8
Summary: upgrading from 16.05.8 to 17.02.8
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 17.02.8
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-10-31 08:28 MDT by Robert Yelle
Modified: 2017-11-03 10:49 MDT (History)
1 user (show)

See Also:
Site: University of Oregon
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Robert Yelle 2017-10-31 08:28:40 MDT
Hello,

This is a generic upgrade ticket.  We will be attempting to upgrade Slurm from 16.05.8 Bright 7.3 managed RPMs to custom built 17.02.8.  I understand the first step here is to update the slurmdbd.  It seems that the only thing involved in the upgrade itself is installing the new slurmdbd then starting it.  So just to be clear, are database tables are automatically read and changed when the new slurmdbd is launched for the first time?  What is the procedure for rolling back (in case I need to)?

I understand that in addition to the mysqldump, we should backup the StateSaveLocation.  Is there anything else we should back up?

Thanks,

Rob
Comment 1 Felip Moll 2017-10-31 10:28:32 MDT
(In reply to Robert Yelle from comment #0)
> Hello,
> 
> This is a generic upgrade ticket.  We will be attempting to upgrade Slurm
> from 16.05.8 Bright 7.3 managed RPMs to custom built 17.02.8.  I understand
> the first step here is to update the slurmdbd.  It seems that the only thing
> involved in the upgrade itself is installing the new slurmdbd then starting
> it.  So just to be clear, are database tables are automatically read and
> changed when the new slurmdbd is launched for the first time?  What is the
> procedure for rolling back (in case I need to)?
> 
> I understand that in addition to the mysqldump, we should backup the
> StateSaveLocation.  Is there anything else we should back up?
> 
> Thanks,
> 
> Rob

Hello Rob,

As you say it is mandatory that SlurmDBD be upgraded in the first place.

When you restart the daemon, SlurmDBD itself will modify and adjust the required tables and fields. Once done you will not be able to downgrade the daemon directly. If you need to do so you should use a mysql dump file in order to restore the database, and then start the older daemon.

The State files should also be backed up if you want to be sure to have the possibility to do a downgrade. Failing to backup this data would produce a loss of all running and pending jobs.

Things like MPI libraries with Slurm integration should be recompiled because libslurm.so is changed in your case (16.05->17.02).

Remember that slurmctld daemon must be upgraded before (or at the same time) than the slurmd daemons.

Basically I recommend you to follow the steps and advice from https://slurm.schedmd.com/quickstart_admin.html 

Regarding the rollback procedure:
------------------------------------
1. Stop all daemons
2. Downgrade all daemons
3. Delete database contents and restore the mysqldump copy
4. Restore the StateSaveLocation
5. Start SlurmDBD and check if it works, sacctmgr show cluster/assoc
6. Start slurmctld and check sinfo --version, sinfo, squeue, etc.
7. Start slurmd daemons

In this case jobs and queue will be recovered, but be carefull with the timeouts.

If you have any other question don't hesitate and reopen the bug.

Best Regards
Felip M
Comment 2 Robert Yelle 2017-11-01 17:14:48 MDT
Hi Felip,

Thank you for the info.  Our Slurm upgrade seems to have gone well, no issues discovered yet...

Rob


On Oct 31, 2017, at 9:28 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:

Felip Moll<mailto:felip.moll@schedmd.com> changed bug 4319<https://bugs.schedmd.com/show_bug.cgi?id=4319>
What    Removed Added
Resolution      ---     INFOGIVEN
CC              felip.moll@schedmd.com<mailto:felip.moll@schedmd.com>
Assignee        support@schedmd.com<mailto:support@schedmd.com>         felip.moll@schedmd.com<mailto:felip.moll@schedmd.com>
Status  UNCONFIRMED     RESOLVED
Severity        2 - High Impact         4 - Minor Issue

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=4319#c1> on bug 4319<https://bugs.schedmd.com/show_bug.cgi?id=4319> from Felip Moll<mailto:felip.moll@schedmd.com>

(In reply to Robert Yelle from comment #0<x-msg://33/show_bug.cgi?id=4319#c0>)
> Hello,
>
> This is a generic upgrade ticket.  We will be attempting to upgrade Slurm
> from 16.05.8 Bright 7.3 managed RPMs to custom built 17.02.8.  I understand
> the first step here is to update the slurmdbd.  It seems that the only thing
> involved in the upgrade itself is installing the new slurmdbd then starting
> it.  So just to be clear, are database tables are automatically read and
> changed when the new slurmdbd is launched for the first time?  What is the
> procedure for rolling back (in case I need to)?
>
> I understand that in addition to the mysqldump, we should backup the
> StateSaveLocation.  Is there anything else we should back up?
>
> Thanks,
>
> Rob

Hello Rob,

As you say it is mandatory that SlurmDBD be upgraded in the first place.

When you restart the daemon, SlurmDBD itself will modify and adjust the
required tables and fields. Once done you will not be able to downgrade the
daemon directly. If you need to do so you should use a mysql dump file in order
to restore the database, and then start the older daemon.

The State files should also be backed up if you want to be sure to have the
possibility to do a downgrade. Failing to backup this data would produce a loss
of all running and pending jobs.

Things like MPI libraries with Slurm integration should be recompiled because
libslurm.so is changed in your case (16.05->17.02).

Remember that slurmctld daemon must be upgraded before (or at the same time)
than the slurmd daemons.

Basically I recommend you to follow the steps and advice from
https://slurm.schedmd.com/quickstart_admin.html

Regarding the rollback procedure:
------------------------------------
1. Stop all daemons
2. Downgrade all daemons
3. Delete database contents and restore the mysqldump copy
4. Restore the StateSaveLocation
5. Start SlurmDBD and check if it works, sacctmgr show cluster/assoc
6. Start slurmctld and check sinfo --version, sinfo, squeue, etc.
7. Start slurmd daemons

In this case jobs and queue will be recovered, but be carefull with the
timeouts.

If you have any other question don't hesitate and reopen the bug.

Best Regards
Felip M

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 3 Robert Yelle 2017-11-03 10:49:38 MDT
Hi Felip,

So far, so good with the Slurm upgrade, no issues except for MPI libraries that you already mentioned below.

We would also like to implement the network topology plugin, but were unable to get this done during our outage earlier this week.  Would implementing this plugin require an outage, or can we implement this while the cluster is in production?

Thanks,

Rob



On Nov 1, 2017, at 4:14 PM, Rob Yelle <ryelle@uoregon.edu<mailto:ryelle@uoregon.edu>> wrote:

Hi Felip,

Thank you for the info.  Our Slurm upgrade seems to have gone well, no issues discovered yet...

Rob


On Oct 31, 2017, at 9:28 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:

Felip Moll<mailto:felip.moll@schedmd.com> changed bug 4319<https://bugs.schedmd.com/show_bug.cgi?id=4319>
What    Removed Added
Resolution      ---     INFOGIVEN
CC              felip.moll@schedmd.com<mailto:felip.moll@schedmd.com>
Assignee        support@schedmd.com<mailto:support@schedmd.com>         felip.moll@schedmd.com<mailto:felip.moll@schedmd.com>
Status  UNCONFIRMED     RESOLVED
Severity        2 - High Impact         4 - Minor Issue

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=4319#c1> on bug 4319<https://bugs.schedmd.com/show_bug.cgi?id=4319> from Felip Moll<mailto:felip.moll@schedmd.com>

(In reply to Robert Yelle from comment #0<x-msg://33/show_bug.cgi?id=4319#c0>)
> Hello,
>
> This is a generic upgrade ticket.  We will be attempting to upgrade Slurm
> from 16.05.8 Bright 7.3 managed RPMs to custom built 17.02.8.  I understand
> the first step here is to update the slurmdbd.  It seems that the only thing
> involved in the upgrade itself is installing the new slurmdbd then starting
> it.  So just to be clear, are database tables are automatically read and
> changed when the new slurmdbd is launched for the first time?  What is the
> procedure for rolling back (in case I need to)?
>
> I understand that in addition to the mysqldump, we should backup the
> StateSaveLocation.  Is there anything else we should back up?
>
> Thanks,
>
> Rob

Hello Rob,

As you say it is mandatory that SlurmDBD be upgraded in the first place.

When you restart the daemon, SlurmDBD itself will modify and adjust the
required tables and fields. Once done you will not be able to downgrade the
daemon directly. If you need to do so you should use a mysql dump file in order
to restore the database, and then start the older daemon.

The State files should also be backed up if you want to be sure to have the
possibility to do a downgrade. Failing to backup this data would produce a loss
of all running and pending jobs.

Things like MPI libraries with Slurm integration should be recompiled because
libslurm.so is changed in your case (16.05->17.02).

Remember that slurmctld daemon must be upgraded before (or at the same time)
than the slurmd daemons.

Basically I recommend you to follow the steps and advice from
https://slurm.schedmd.com/quickstart_admin.html

Regarding the rollback procedure:
------------------------------------
1. Stop all daemons
2. Downgrade all daemons
3. Delete database contents and restore the mysqldump copy
4. Restore the StateSaveLocation
5. Start SlurmDBD and check if it works, sacctmgr show cluster/assoc
6. Start slurmctld and check sinfo --version, sinfo, squeue, etc.
7. Start slurmd daemons

In this case jobs and queue will be recovered, but be carefull with the
timeouts.

If you have any other question don't hesitate and reopen the bug.

Best Regards
Felip M

________________________________
You are receiving this mail because:

  *   You reported the bug.