Ticket 11418 - suggestions for how to upgrade cluster
Summary: suggestions for how to upgrade cluster
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 20.11.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Ben Roberts
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-04-20 06:50 MDT by Mike Woodson
Modified: 2021-04-20 11:43 MDT (History)
0 users

See Also:
Site: Cornell ITSG
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Mike Woodson 2021-04-20 06:50:40 MDT
Hi,

I am in the process of upgrading about 100 systems from Ubuntu 16 (slurm 20.02.3) to Ubuntu 20 (20.11.4). Since that is a lot of systems to rebuild, we wanted to know if there was a recommended way of doing it? We are currently using no clustering software (such as Rockscluster or Bright computing). Also, do you recommend a specific clustering software? 

Thanks, 

Mike
Comment 1 Mike Woodson 2021-04-20 06:52:14 MDT
BTW, the new cluster is already running with a few nodes in it, so I don't need info on how to configure the headnode, just the best way to migrate 100 nodes. 

Mike
Comment 3 Ben Roberts 2021-04-20 11:20:09 MDT
Hi Mike,

The recommendations I have for you primarily revolve around the normal upgrade procedure for Slurm.  Upgrading the OS is outside the realm of what we can help with.  When you do upgrade the nodes to 20.11 you would want to make sure that the Slurm controllers (slurmctld and slurmdbd) are already on 20.11.  Slurm is designed to allow a newer version of a controller to communicate with an older slurmd instance, but not the other way around.  We do have documentation that gives a good overview of how to upgrade your cluster that I would recommend you review.
https://slurm.schedmd.com/quickstart_admin.html#upgrade

Please let me know if you have any questions about the procedure as outlined in the documentation.

Thanks,
Ben
Comment 4 Mike Woodson 2021-04-20 11:31:38 MDT
I have always found the upgrade portion of the quickstart_admin page to be vague and not much help. Since I have upgraded the cluster a couple of times now, that is not the issue.

However, as I think that I stated, we already have the new cluster up and running and was just trying to find a good way to migrate close to 100 servers. We know that you all do not do the OS but wondered if you had any recommendations on best practices for cluster software. It sounds like you don't, which answers my question.

Thanks,

Mike


From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, April 20, 2021 at 1:20 PM
To: Michael Anthony Woodson <maw349@cornell.edu>
Subject: [Bug 11418] suggestions for how to upgrade cluster

Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=11418#c3> on bug 11418<https://bugs.schedmd.com/show_bug.cgi?id=11418> from Ben Roberts<mailto:ben@schedmd.com>

Hi Mike,



The recommendations I have for you primarily revolve around the normal upgrade

procedure for Slurm.  Upgrading the OS is outside the realm of what we can help

with.  When you do upgrade the nodes to 20.11 you would want to make sure that

the Slurm controllers (slurmctld and slurmdbd) are already on 20.11.  Slurm is

designed to allow a newer version of a controller to communicate with an older

slurmd instance, but not the other way around.  We do have documentation that

gives a good overview of how to upgrade your cluster that I would recommend you

review.

https://slurm.schedmd.com/quickstart_admin.html#upgrade



Please let me know if you have any questions about the procedure as outlined in

the documentation.



Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 5 Ben Roberts 2021-04-20 11:43:14 MDT
My apologies that I skipped over the cluster management part of your question.  I'm afraid we can't endorse any kind of cluster management software.  Since it sounds like you have the upgrade portion of the question under control I'll go ahead and close this ticket.  Let us know if there's anything else we can do to help.

Thanks,
Ben