Ticket 813

Summary: backup controller
Product: Slurm Reporter: Stuart Midgley <stuartm>
Component: OtherAssignee: David Bigagli <david>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: da
Version: 14.03.0   
Hardware: Linux   
OS: Linux   
Site: DownUnder GeoSolutions Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Stuart Midgley 2014-05-14 06:04:21 MDT
for my own peace of mind, the slurm controller and backup controller don't need share storage right?  If this is true, how does the slurm backup controller rebuild state if the controller goes down for an extended time?  It doesn't have all the job files (as far as I can see).
Comment 1 Moe Jette 2014-05-14 06:06:39 MDT
They need shared state save files (the StateSaveLocation directory). Ideally a cross-mounted RAID disk.
Comment 2 Stuart Midgley 2014-05-14 06:09:41 MDT
Right, so I've been kidding myself that I've actually got failover...

By a cross mounted raid server, you mean nfs mounted from 3rd machine?  or using drbd?

How much bandwidth is done to this shared state... is drbd an option?
Comment 3 Stuart Midgley 2014-05-14 06:15:33 MDT
or perhaps, the real question is, if the save files have high latency, will that be a problem for slurm?
Comment 4 Moe Jette 2014-05-14 06:16:37 MDT
(In reply to Stuart Midgley from comment #2)
> Right, so I've been kidding myself that I've actually got failover...
> 
> By a cross mounted raid server, you mean nfs mounted from 3rd machine?  or
> using drbd?

Correct

> How much bandwidth is done to this shared state... is drbd an option?

Slurm will use whatever bandwidth is available. Something slow will not delay anything, but it would result in state changes not yet written to storage being lost in a fail-over situation.
Comment 5 Stuart Midgley 2014-05-14 10:16:06 MDT
Is it appropriate to run the save state dir on lustre?  With global or local locking?
Comment 6 Moe Jette 2014-05-14 10:22:28 MDT
(In reply to Stuart Midgley from comment #5)
> Is it appropriate to run the save state dir on lustre?  With global or local
> locking?

Lustre should work fine. Since only the primary or backup will be active at any point in time, either global or local locking would be fine.

To move the StateSaveLocation:

Shutdown both your primary and backup slurmctld
Install new slurm.conf files with the new path
Copy the files into the new file system (that will include subdirectories with user scripts and environment variables)
Restart the primary slurmctld and make sure it is running fine
Restart the backup slurmctld