| Summary: | backup controller | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Stuart Midgley <stuartm> |
| Component: | Other | Assignee: | David Bigagli <david> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | da |
| Version: | 14.03.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Stuart Midgley
2014-05-14 06:04:21 MDT
They need shared state save files (the StateSaveLocation directory). Ideally a cross-mounted RAID disk. Right, so I've been kidding myself that I've actually got failover... By a cross mounted raid server, you mean nfs mounted from 3rd machine? or using drbd? How much bandwidth is done to this shared state... is drbd an option? or perhaps, the real question is, if the save files have high latency, will that be a problem for slurm? (In reply to Stuart Midgley from comment #2) > Right, so I've been kidding myself that I've actually got failover... > > By a cross mounted raid server, you mean nfs mounted from 3rd machine? or > using drbd? Correct > How much bandwidth is done to this shared state... is drbd an option? Slurm will use whatever bandwidth is available. Something slow will not delay anything, but it would result in state changes not yet written to storage being lost in a fail-over situation. Is it appropriate to run the save state dir on lustre? With global or local locking? (In reply to Stuart Midgley from comment #5) > Is it appropriate to run the save state dir on lustre? With global or local > locking? Lustre should work fine. Since only the primary or backup will be active at any point in time, either global or local locking would be fine. To move the StateSaveLocation: Shutdown both your primary and backup slurmctld Install new slurm.conf files with the new path Copy the files into the new file system (that will include subdirectories with user scripts and environment variables) Restart the primary slurmctld and make sure it is running fine Restart the backup slurmctld |