Ticket 12690

Summary: Upgrade plan 19.05 to 21.08
Product: Slurm Reporter: Pär Lindfors <par.lindfors>
Component: OtherAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 21.08.2   
Hardware: Linux   
OS: Linux   
Site: SNIC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: UPPMAX Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: 19.05 config from rackham and snowy

Description Pär Lindfors 2021-10-19 07:33:33 MDT
Created attachment 21814 [details]
19.05 config from rackham and snowy

We are planning an upgrade of the clusters Rackham and Snowy from the old/unsupported 19.05 to 21.08.

We know that the version difference means that we have to start slurmdbd and the slurmctlds briefly at 20.02 or 20.11 to update the database and slurmctld state files, before going to 21.08. We will most certainly do the upgrade without any running jobs.

Is there anything specific we need to look out for in this upgrade? We have started reviewing the RELEASE_NOTES for 20.02 to 21.08 ourselves but would appreciate any input from you.

I am attaching our current (19.05) config. Could you do a configuration review of that with 21.08 in mind? Other than changed/deprecated options we are interested to know if there are new best practices we should consider.

Regards,
Pär
Comment 1 Ben Roberts 2021-10-19 13:04:03 MDT
Hi Pär,

It looks like you have a good idea of what you need to do for the upgrade.  You do need to start slurmctld and slurmdbd on an intermediate version to allow them time to update records in the database and the information in the state files.  One thing that I would add to what is in the upgrade guide would be to start slurmdbd by running the executable directly rather than using systemd.  When starting it with systemd, slurmdbd can take long enough to update the records in the database that systemd thinks the service isn't responding and kills it.  If this happens, it's usually while slurmdbd is in the middle of updating records.  When upgrading we recommend starting slurmdbd manually to avoid getting in this scenario.

It sounds like you are looking through the release notes for deprecated options, which is a good plan.  You have probably seen these, but I'll point out the parameters I noticed that have been deprecated:
SallocDefaultCommand
FastSchedule
TaskAffinity


I would encourage you to take a minute to make sure they are still options you find useful for your environment (mostly the FastSchedule parameter). If you do intend to continue using the functionality then the above options have been replaced with the following:
LaunchParameters=use_interactive_step
SlurmdParameters=config_overrides
TaskPlugins=task/affinity


One more thing I'll mention is that, depending on the size of your database, you may consider trimming the amount of data you are keeping to reduce the amount of time it will take to do the upgrade.  There are options for the slurmdbd.conf file that allow you to define how long you want to keep job, event, etc... information before purging it.  You can also configure slurmdbd to put the information it purges in an archive file that you can then load on another instance of slurmdbd if you want to be able to query that information at some point.  This is especially useful if you need to occasionally run reports on a large date range of old records and so don't need those records immediately available to most users on your production system.  You can read more about these options here:
https://slurm.schedmd.com/slurmdbd.conf.html#OPT_PurgeJobAfter
https://slurm.schedmd.com/slurmdbd.conf.html#OPT_ArchiveJobs

I hope this helps.  Please let me know if you have any questions about any of the information I sent.

Thanks,
Ben
Comment 2 Ben Roberts 2021-11-23 08:27:05 MST
Hi Pär,

I wanted to make sure you didn't have any additional questions about upgrading and to see whether you have been able to do the upgrade already.  Let me know if there's anything else I can do to help or if this ticket is ok to close.

Thanks,
Ben
Comment 3 Pär Lindfors 2021-11-29 15:43:54 MST
I think we got the answers we need, but let me double-check with my colleagues.
During the last couple of weeks I have not been directly involved with the 21.08-tests.

Pär
Comment 4 Ben Roberts 2021-11-30 08:25:16 MST
Sounds good, I'll wait to hear if you have any follow up questions.

Thanks,
Ben
Comment 5 Ben Roberts 2021-12-27 11:36:31 MST
Hi Pär,

I haven't heard any follow up questions.  I'll go ahead and close this ticket.  If something does come up feel free to update the ticket and I'll respond.

Thanks,
Ben