| Summary: | How to configure a Backup controller | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Hjalti Sveinsson <hjalti.sveinsson> |
| Component: | Configuration | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 17.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | deCODE | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Hjalti Sveinsson
2018-10-23 04:01:25 MDT
Hi Hjalti, This is the basic workflow: On the controller node, shutdown the slurmctld, backup StateSaveLocation folder, move StateSaveLocation files to a folder over NFS, mount that NFS folder over the StateSaveLocation shown in slurm.conf, and restart slurmctld. It should be straightforward to start up the backup controller in a similar way. Make sure that the log files for the primary and backup controller are still written locally, and not over NFS. Also make sure that you remove the local files under the StateSaveLocation mount point, so you don’t accidentally load an old state if NFS fails to mount. As always, it’s recommended to try this out on a dummy cluster. Running jobs should still run properly, even when the controller is down. However, my testing shows that if a job completes while the controller is down, after the controller comes back up, it may stay in a running state. So be aware of that. If you want to be extra cautious and do this when no jobs are running, you may want to consider creating reservations with the “maint” flag. See https://slurm.schedmd.com/reservations.html. I hope that helps. -Michael Thank you, this clears it up. Best Regards, Hjalti Sveinsson You're welcome! Please reopen if you have any other questions. -Michael Hi again, there was one big thing I forgot to mention. We have an underlying slurmdbd running on our current head node with mysql database. How do we make this redundant? Install slurm packages on newdbhost. Shutdown services (slurmctld/slurmdbd)? Move database (myqsldump?) to seperate DB host? Change config to match DB host? ## StorageHost=localhost StoragePort=3306 To StorageHost="newdbhost" StoragePort=3306 ## ## # slurmDBD info DbdAddr=localhost DbdHost=localhost DbdPort=6819 To # slurmDBD info DbdAddr="newdbhost" DbdHost="newdbhost" DbdPort=6819 # Would be good to get information on how we go about doing this. Best regards, Hjalti Sveinsson Hjalti, First of all, slurmctld itself backs up all cluster transactions when slurmdbd goes down, for a limited time. So in effect, slurmctld is a backup slurmdbd. Things like sacct won’t work while slurmdbd is down, but no data will be lost. When slurmdbd is down, slurmctld writes job-related data to the local disk. So make sure to monitor `DBD Agent Queue size` with the sdiag command. Once this is full, you may start losing data. But this shouldn’t fill up for a while, unless you have a heavy workload. The sdiag man page explains how long slurmctld will cache information under "DBD Agent Queue Size": https://slurm.schedmd.com/sdiag.html. Second, I would not recommend duplicating the db like you describe. The two databases will get out of sync if the backup slurmdbd kicks in, and you will have to merge the DBs back together manually. Instead, you should look into DB replication. How you do DB replication is out of the scope of what we support, but it appears that two prominent replication strategies are Master-Slave and Galera Cluster. Here are a few resources you could look into: *https://mariadb.com/kb/en/library/replication-cluster-multi-master/ *https://mariadb.com/kb/en/library/what-is-mariadb-galera-cluster/ *https://mariadb.com/kb/en/library/setting-up-replication/ *https://www.digitalocean.com/community/tutorials/how-to-set-up-master-slave-replication-in-mysql Third, if you still want a backup *slurmdbd*, put it on a different node from the primary slurmdbd, but have both dameons use the same underlying replicated DB. To install the backup slurmdbd, you don’t have to shut anything down. When you start it up, it will automatically go into backup mode. As for your slurmdbd.conf, I think you have the right idea. Just make sure the MySql/Mariadb db permissions are set up to accept incoming connections other than localhost. -Michael Another alternative to DB replication is to simply backup your database regularly. Closing ticket. Please reopen if you have any other questions or if something doesn't make sense. Thanks, Michael |