Ticket 3894

Summary: Slurm upgrade with rpm installation
Product: Slurm Reporter: Davide Vanzo <davide.vanzo>
Component: OtherAssignee: Tim Wickberg <tim>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.x   
Hardware: Linux   
OS: Linux   
Site: Vanderbilt Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Davide Vanzo 2017-06-14 09:00:55 MDT
We currently install Slurm in a parallel file system shared across al nodes and the two controllers and database node. In this way we can easily upgrade as follows:

- install the new version in the shared FS directory
- shut down the slurmd, slurmctld and slurmdbd daemons
- create a new symlink to the new installation
- bring all services up

We are now considering the idea of having slurm installed locally on each node via rpm. In such case, what would be the preferred way of upgrading it without a cluster-wide downtime?

Thank you
Comment 1 Tim Wickberg 2017-06-14 09:08:15 MDT
(In reply to Davide Vanzo from comment #0)
> We currently install Slurm in a parallel file system shared across all nodes
> and the two controllers and database node. In this way we can easily upgrade
> as follows:
> 
> - install the new version in the shared FS directory
> - shut down the slurmd, slurmctld and slurmdbd daemons
> - create a new symlink to the new installation
> - bring all services up

That's actually my personal preference; it gives you a nice path to gradually moving versions over without needing all the daemons to be offline at once.

> We are now considering the idea of having slurm installed locally on each
> node via rpm. In such case, what would be the preferred way of upgrading it
> without a cluster-wide downtime?

Two factors to consider:

- Do you intend to always install each version in the same location, or would you still separate them by version, and possibly manually move a symlink over?

- Is the target install location local to the node, or would it be installed once on the shared filesystem?

My suggestion would be either to install in the shared FS once, but in a versioned directory. Or to install locally on each node in a consistent location.

Either of those should give you ways to slowly roll out changes, assuming that the slurmctld and slurmdbd are on separate machines.
Comment 2 Davide Vanzo 2017-06-14 09:14:37 MDT
No, if we install via rpm, the installation will be on the local node's filesystem. And it will be installed in the same location, without having to manage symlinks.
Comment 3 Tim Wickberg 2017-06-14 09:24:38 MDT
(In reply to Davide Vanzo from comment #2)
> No, if we install via rpm, the installation will be on the local node's
> filesystem. And it will be installed in the same location, without having to
> manage symlinks.

That should be fine, with one important caveat - you need to ensure no jobs launch on a set of compute nodes with a mix of versions. If you drain all of the compute nodes, then restart into the new version once they free up, this should be taken care of for you automatically.

If you don't drain all the nodes, the risk is that a job launches on nodeA and B, with nodeA on version 17.02 but nodeB back on 16.05. When that happens, user commands on nodeA - srun in particular - will generate RPC messages in a format that nodeB cannot process, and the job will likely fail.

Note that this is only an issue when moving between major versions; maintenance releases should be unaffected.
Comment 7 Tim Wickberg 2017-06-14 10:57:12 MDT
> If you don't drain all the nodes, the risk is that a job launches on nodeA
> and B, with nodeA on version 17.02 but nodeB back on 16.05. When that
> happens, user commands on nodeA - srun in particular - will generate RPC
> messages in a format that nodeB cannot process, and the job will likely fail.
> 
> Note that this is only an issue when moving between major versions;
> maintenance releases should be unaffected.

One minor retraction - running jobs across mixed slurmd versions should, as of 15.08, be okay, as long as they're within two releases. E.g., a job with three slurmd running on version 16.05, 17.02, and 17.11 should be fine, and any failure would be a bug we'd want to resolve ASAP.
Comment 8 Tim Wickberg 2017-06-21 14:52:13 MDT
Hey Davide -

I'm marking this resolved/infogiven. Please reopen if there's anything else I can help answer on this.

cheers,
- Tim