| Summary: | Upgrade plan 19.05 to 21.08 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Pär Lindfors <par.lindfors> |
| Component: | Other | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 21.08.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SNIC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | UPPMAX | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | 19.05 config from rackham and snowy | ||
Hi Pär, It looks like you have a good idea of what you need to do for the upgrade. You do need to start slurmctld and slurmdbd on an intermediate version to allow them time to update records in the database and the information in the state files. One thing that I would add to what is in the upgrade guide would be to start slurmdbd by running the executable directly rather than using systemd. When starting it with systemd, slurmdbd can take long enough to update the records in the database that systemd thinks the service isn't responding and kills it. If this happens, it's usually while slurmdbd is in the middle of updating records. When upgrading we recommend starting slurmdbd manually to avoid getting in this scenario. It sounds like you are looking through the release notes for deprecated options, which is a good plan. You have probably seen these, but I'll point out the parameters I noticed that have been deprecated: SallocDefaultCommand FastSchedule TaskAffinity I would encourage you to take a minute to make sure they are still options you find useful for your environment (mostly the FastSchedule parameter). If you do intend to continue using the functionality then the above options have been replaced with the following: LaunchParameters=use_interactive_step SlurmdParameters=config_overrides TaskPlugins=task/affinity One more thing I'll mention is that, depending on the size of your database, you may consider trimming the amount of data you are keeping to reduce the amount of time it will take to do the upgrade. There are options for the slurmdbd.conf file that allow you to define how long you want to keep job, event, etc... information before purging it. You can also configure slurmdbd to put the information it purges in an archive file that you can then load on another instance of slurmdbd if you want to be able to query that information at some point. This is especially useful if you need to occasionally run reports on a large date range of old records and so don't need those records immediately available to most users on your production system. You can read more about these options here: https://slurm.schedmd.com/slurmdbd.conf.html#OPT_PurgeJobAfter https://slurm.schedmd.com/slurmdbd.conf.html#OPT_ArchiveJobs I hope this helps. Please let me know if you have any questions about any of the information I sent. Thanks, Ben Hi Pär, I wanted to make sure you didn't have any additional questions about upgrading and to see whether you have been able to do the upgrade already. Let me know if there's anything else I can do to help or if this ticket is ok to close. Thanks, Ben I think we got the answers we need, but let me double-check with my colleagues. During the last couple of weeks I have not been directly involved with the 21.08-tests. Pär Sounds good, I'll wait to hear if you have any follow up questions. Thanks, Ben Hi Pär, I haven't heard any follow up questions. I'll go ahead and close this ticket. If something does come up feel free to update the ticket and I'll respond. Thanks, Ben |
Created attachment 21814 [details] 19.05 config from rackham and snowy We are planning an upgrade of the clusters Rackham and Snowy from the old/unsupported 19.05 to 21.08. We know that the version difference means that we have to start slurmdbd and the slurmctlds briefly at 20.02 or 20.11 to update the database and slurmctld state files, before going to 21.08. We will most certainly do the upgrade without any running jobs. Is there anything specific we need to look out for in this upgrade? We have started reviewing the RELEASE_NOTES for 20.02 to 21.08 ourselves but would appreciate any input from you. I am attaching our current (19.05) config. Could you do a configuration review of that with 21.08 in mind? Other than changed/deprecated options we are interested to know if there are new best practices we should consider. Regards, Pär