3770 – 14.03 -> 17.02 upgrade plan

Ticket 3770 - 14.03 -> 17.02 upgrade plan

Summary: 14.03 -> 17.02 upgrade plan

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	17.02.2
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-05-03 15:25 MDT by Lyn
Modified:	2017-05-22 13:50 MDT (History)
CC List:	1 user (show)

See Also:
Site:	NASA - NCCS
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:	Discover
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Lyn 2017-05-03 15:25:23 MDT

Aloha from NCCS-Way-Way-West, SchedMD Colleagues:

Circumstances have evolved such that we not only can, but must, upgrade from 14.03 to 17.02, and pretty darn soon (before the next large hw delivery hits the loading dock, this summer). We'll be upgrading a small cluster first, as a proving ground, before upgrading the (main) Discover cluster.  

We know that there is at least one big DB conversion that must happen along this transition path. Is there more than one, in that range?

Per an earlier ticket (3112), we will also be folding in the update of UIDs in all job entries. (Agency-wide updates to all UIDs took place during the last calendar year or so.)

Given the two-releases limit on upgrades, it would seem that the minimal upgrade path is 14.03 -> 15.08 -> 17.02. Does that sound right?

Given this list of things to do, we'll appreciate any guidance, insights, etc. We see this process as potentially useful to other sites, and would plan to document it for sharing at future SLUGs and SC conferences. If SchedMD has any specific interests in the process, methods, or how we document it, perhaps a phone conversation would be in order?

Many thanks,
Lyn & Bruce

Comment 1 Tim Wickberg 2017-05-03 15:34:24 MDT

(In reply to Lyn from comment #0)
> Aloha from NCCS-Way-Way-West, SchedMD Colleagues:
>
> Circumstances have evolved such that we not only can, but must, upgrade from
> 14.03 to 17.02, and pretty darn soon (before the next large hw delivery hits
> the loading dock, this summer). We'll be upgrading a small cluster first, as
> a proving ground, before upgrading the (main) Discover cluster.  
> 
> We know that there is at least one big DB conversion that must happen along
> this transition path. Is there more than one, in that range?

The 15.08 conversion is the "big" one, and can take a bit of time depending on the size of job table. The 17.02 conversion should be almost immediate, nothing exciting changed in 16.05 or 17.02.

Obviously, you'd be well advised to make a backup of the mysql database before any conversion work.

> Per an earlier ticket (3112), we will also be folding in the update of UIDs
> in all job entries. (Agency-wide updates to all UIDs took place during the
> last calendar year or so.)
> 
> Given the two-releases limit on upgrades, it would seem that the minimal
> upgrade path is 14.03 -> 15.08 -> 17.02. Does that sound right?

Yep. That'd be my recommendation.

> Given this list of things to do, we'll appreciate any guidance, insights,
> etc. We see this process as potentially useful to other sites, and would
> plan to document it for sharing at future SLUGs and SC conferences. If
> SchedMD has any specific interests in the process, methods, or how we
> document it, perhaps a phone conversation would be in order?

I assume the cluster will be down for other maintenance when this happens, and that you're not worried about anything in the queue?

If so, then just move slurmdbd to 15.08, then 17.02, and then bring up the cluster fresh on 17.02. You'd loose the various queues though, but I'm imagining that's not a huge problem.

If you're trying to do an in-place upgrade, you'd need to get everything to 15.08 first, then move it all again to 17.02. That's likely more hassle than its worth.

You might want to peruse the various release notes for anything exciting, especially if you have any custom plugins.

If you want to attach the current slurm.conf I can skim through it for any likely problems, although everything should just work.

cheers,
- Tim

Comment 2 Lyn 2017-05-11 10:16:25 MDT

Thanks for your responses, Tim. We are indeed trying to get to 17.02 w/zero service downtime and w/out losing running or queued work.

Sidebar: we have eliminated the UID update from the plan; we completed that work this week.

Upgrade Process:
We believe we have the ability to get the slurmdbd 15.08 upgrade and DB conversion done before the 14.03 ctrld would be overwhelmed w/cached job records.

Our main concern is the optimal dance for the primary and backup ctrld upgrades. Note: our slurmdbd and primary slurmctld are on different nodes. We have a weak, old-hardware backup ctrld node that we need to replace w/a larger, current-hardware node.

Here are the options we see, and related questions. Consider step 0 to be the dbd upgrade, prior to these steps:

Option 1:
1) Bring up a new node w/15.08 ctrld installed and running
2) Modify slurm.conf to point to the new node as backup ctrld (and remove the weak backup ctrld), and reconfig.
3) Make sure ctrld changes are reflected in scontrol show conf.
3) Take down the primary ctrld and upgrade.
5) Bring primary ctrld back up at 15.08.
6) Update ~3000 slurmd nodes on a phased basis, over a couple days.
7) Repeat for 17.02.

Main question: Would the 14.03 primary ctld freak out upon seeing a 15.08 backup come online, however briefly, before we fail over to the backup?

Is there a benefit to failing over to the existing, hardware-weak backup initially, before taking down the primary ctld for its upgrade? Like so:

Option 2:
1) Fail over to the weak 14.03 backup ctrld.
2) Upgrade primary ctrld and restart. (So, fail over ctrld back to primary.)
3) Bring up the new node w/15.08 ctrld installed and running.
4) Modify slurm.conf to point to the new node as backup ctrld (and remove the weak backup ctrld), and reconfig.
5) Update ~3000 slurmd nodes on a phased basis, over a couple days.
6) Repeat for 17.02

Any showstopper gotchas? Appreciate your feedback for these specifics.

Thanks much,
Lyn

Comment 3 Tim Wickberg 2017-05-11 10:36:47 MDT

(In reply to Lyn from comment #2)
> Thanks for your responses, Tim. We are indeed trying to get to 17.02 w/zero
> service downtime and w/out losing running or queued work.
> 
> Sidebar: we have eliminated the UID update from the plan; we completed that
> work this week.
> 
> Upgrade Process:
> We believe we have the ability to get the slurmdbd 15.08 upgrade and DB
> conversion done before the 14.03 ctrld would be overwhelmed w/cached job
> records.
> 
> Our main concern is the optimal dance for the primary and backup ctrld
> upgrades. Note: our slurmdbd and primary slurmctld are on different nodes.
> We have a weak, old-hardware backup ctrld node that we need to replace w/a
> larger, current-hardware node.
> 
> Here are the options we see, and related questions. Consider step 0 to be
> the dbd upgrade, prior to these steps:
> 
> Option 1:
> 1) Bring up a new node w/15.08 ctrld installed and running
> 2) Modify slurm.conf to point to the new node as backup ctrld (and remove
> the weak backup ctrld), and reconfig.
> 3) Make sure ctrld changes are reflected in scontrol show conf.
> 3) Take down the primary ctrld and upgrade.
> 5) Bring primary ctrld back up at 15.08.
> 6) Update ~3000 slurmd nodes on a phased basis, over a couple days.
> 7) Repeat for 17.02.
> 
> Main question: Would the 14.03 primary ctld freak out upon seeing a 15.08
> backup come online, however briefly, before we fail over to the backup? 

Yes. Both primary + backup need to be on the same version, having both online at different versions will lead to problems.

If you amended the procedure to shut down the backup right before starting the primary that should be fine (as a third step 3 in your procedure). The cluster can operate without slurmctld available for brief periods of time.

If you can mock this up on some test hardware, I'd encourage you to run through through the conversions to make sure there aren't any problems, and that the compiled versions are all working properly.

The one caveat with all of these upgrades is that you cannot fall back to earlier versions - once the state files have been written out in the newer version there's no way to revert them. So finding out there was a problem with the slurmctld binary mid-upgrade would be unpleasant.

> Is there a benefit to failing over to the existing, hardware-weak backup
> initially, before taking down the primary ctld for its upgrade? Like so:
> 
> Option 2:
> 1) Fail over to the weak 14.03 backup ctrld.
> 2) Upgrade primary ctrld and restart. (So, fail over ctrld back to primary.)
> 3) Bring up the new node w/15.08 ctrld installed and running.
> 4) Modify slurm.conf to point to the new node as backup ctrld (and remove
> the weak backup ctrld), and reconfig.
> 5) Update ~3000 slurmd nodes on a phased basis, over a couple days.
> 6) Repeat for 17.02
> 
> Any showstopper gotchas? Appreciate your feedback for these specifics.

This is a lot riskier - changing out the primary vs. backup addresses doesn't buy you much, and there's a chance that the cut over between may not go smoothly. Keep in mind that Slurm internally builds tree-shaped communication hierarchies to offload a lot of communication overhead to the slurmd on the nodes; a disagreement between nodes as to who they should be talking to can lead to serious problems. (Although this is usually more obvious when making changes to the Node definitions; out-of-sync configurations between nodes can result in messages being delivered to the wrong destination.)

Comment 4 Tim Wickberg 2017-05-22 13:39:38 MDT

Hey Lyn -

Marking this resolved/infogiven; please reopen if you had any further questions, or you're always welcome to file a new bug if something comes up. Best of luck with the transition.

- Tim

Comment 5 Lyn 2017-05-22 13:50:20 MDT

Yep, sounds fine, Tim.

Thanks for all the info.

Best,
Lyn

On Mon, May 22, 2017 at 9:39 AM, <bugs@schedmd.com> wrote:

> Tim Wickberg <tim@schedmd.com> changed bug 3770
> <https://bugs.schedmd.com/show_bug.cgi?id=3770>
> What Removed Added
> Resolution --- INFOGIVEN
> Status UNCONFIRMED RESOLVED
>
> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=3770#c4> on bug
> 3770 <https://bugs.schedmd.com/show_bug.cgi?id=3770> from Tim Wickberg
> <tim@schedmd.com> *
>
> Hey Lyn -
>
> Marking this resolved/infogiven; please reopen if you had any further
> questions, or you're always welcome to file a new bug if something comes up.
> Best of luck with the transition.
>
> - Tim
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>