3800 – Shared File System for State Files

Ticket 3800 - Shared File System for State Files

Summary: Shared File System for State Files

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	16.05.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-05-11 10:07 MDT by Will French
Modified:	2017-05-22 13:42 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Vanderbilt
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Will French 2017-05-11 10:07:55 MDT

Hi there, we are taking steps to remediate some of the timeout issues encountered in Bug 3692 and Bug 3554. 

One issue is that our state files are stored on GPFS, and when GPFS is under heavy load (e.g. due to a massive restripe or rebalance of data on /scratch) SLURM tends to likewise get laggy and timeout, presumably because the SLURM state files (and possibly binaries?) live on GPFS and are impacted by the heavy load.

What is your recommended setup for sharing state files between the two SLURM HA servers? Do most sites use GPFS/Lustre or a NFS mount? 

One approach we are considering is DRBD (https://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device). Do you have any experience with DRBD? Specifically, do you know if any other sites store state files and do some sort of SLURM HA through DRBD (rather than using SLURM's built in HA)?

Thanks for any advice or comments you can provide.

Best,
Will

Comment 1 Tim Wickberg 2017-05-11 10:24:34 MDT

(In reply to Will French from comment #0)
> Hi there, we are taking steps to remediate some of the timeout issues
> encountered in Bug 3692 and Bug 3554. 
> 
> One issue is that our state files are stored on GPFS, and when GPFS is under
> heavy load (e.g. due to a massive restripe or rebalance of data on /scratch)
> SLURM tends to likewise get laggy and timeout, presumably because the SLURM
> state files (and possibly binaries?) live on GPFS and are impacted by the
> heavy load.

As some additional background, what's happening here is that the state save thread needs to acquire and hold the read locks on most of the internal data structures. This precludes any other threads from acquiring the write locks on those same structures, which is a fairly common operation. Delays writing will thus cause other operations to stall.

> What is your recommended setup for sharing state files between the two SLURM
> HA servers? Do most sites use GPFS/Lustre or a NFS mount? 

A shared NFS mount is the most common by far. I believe some very large installations (10k+) have set it on a dedicated NetApp, although that's overkill in most situations.

If you have some other filesystem present in the cluster switching to that may be sufficient. A lot of installations seem to keep a smaller NFS array around for home directories / cluster-wide application installations, and if that becomes unavailable the user jobs would fail to launch anyways.

> One approach we are considering is DRBD
> (https://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device). Do you
> have any experience with DRBD? Specifically, do you know if any other sites
> store state files and do some sort of SLURM HA through DRBD (rather than
> using SLURM's built in HA)?

I'm not aware of anyone using DRBD in production, although I believe it would work in theory. You'd want to test that extensively, and make sure that it doesn't introduce too much additional latency.

I do know some places run slurm within other HA stacks with mixed results, and it's not a setup I recommend.

I've personally used OCFS2 in the past with good results - for that all you'd need is a shared storage array accessible from both controllers over SAS / iSCSI / FC, plus some additional configuration.

One caveat - until we have a further patch ready, it's best to ensure that the network path used to communicate between the two controllers and the compute nodes is the same as is used to access the shared storage. This helps avoid the most common 'split-brain' scenarios where the controllers have lost contact but are both able to access and modify the state files simultaneously, leading to potential corruption.

Comment 2 Will French 2017-05-11 16:09:52 MDT

Thanks, Tim.

(In reply to Tim Wickberg from comment #1)
> (In reply to Will French from comment #0)
> > Hi there, we are taking steps to remediate some of the timeout issues
> > encountered in Bug 3692 and Bug 3554. 
> > 
> > One issue is that our state files are stored on GPFS, and when GPFS is under
> > heavy load (e.g. due to a massive restripe or rebalance of data on /scratch)
> > SLURM tends to likewise get laggy and timeout, presumably because the SLURM
> > state files (and possibly binaries?) live on GPFS and are impacted by the
> > heavy load.
> 
> As some additional background, what's happening here is that the state save
> thread needs to acquire and hold the read locks on most of the internal data
> structures. This precludes any other threads from acquiring the write locks
> on those same structures, which is a fairly common operation. Delays writing
> will thus cause other operations to stall.


By "other operations" I assume this includes stalls on common commands like sbatch, squeue, etc?


> 
> > What is your recommended setup for sharing state files between the two SLURM
> > HA servers? Do most sites use GPFS/Lustre or a NFS mount? 
> 
> A shared NFS mount is the most common by far. I believe some very large
> installations (10k+) have set it on a dedicated NetApp, although that's
> overkill in most situations.
> 
> If you have some other filesystem present in the cluster switching to that
> may be sufficient. A lot of installations seem to keep a smaller NFS array
> around for home directories / cluster-wide application installations, and if
> that becomes unavailable the user jobs would fail to launch anyways.
> 
> > One approach we are considering is DRBD
> > (https://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device). Do you
> > have any experience with DRBD? Specifically, do you know if any other sites
> > store state files and do some sort of SLURM HA through DRBD (rather than
> > using SLURM's built in HA)?
> 
> I'm not aware of anyone using DRBD in production, although I believe it
> would work in theory. You'd want to test that extensively, and make sure
> that it doesn't introduce too much additional latency.
> 
> I do know some places run slurm within other HA stacks with mixed results,
> and it's not a setup I recommend.



Can you comment on how this might impact our support, or ability to get support from SchedMD on HA-related issues?


> 
> I've personally used OCFS2 in the past with good results - for that all
> you'd need is a shared storage array accessible from both controllers over
> SAS / iSCSI / FC, plus some additional configuration.
> 
> One caveat - until we have a further patch ready, it's best to ensure that
> the network path used to communicate between the two controllers and the
> compute nodes is the same as is used to access the shared storage. This
> helps avoid the most common 'split-brain' scenarios where the controllers
> have lost contact but are both able to access and modify the state files
> simultaneously, leading to potential corruption.

Can you confirm that the secondary controller (while inactive) is not expected to write or hold locks?

Comment 3 Tim Wickberg 2017-05-11 16:28:36 MDT

> > As some additional background, what's happening here is that the state save
> > thread needs to acquire and hold the read locks on most of the internal data
> > structures. This precludes any other threads from acquiring the write locks
> > on those same structures, which is a fairly common operation. Delays writing
> > will thus cause other operations to stall.
> 
> 
> By "other operations" I assume this includes stalls on common commands like
> sbatch, squeue, etc?

sbatch/salloc/srun in particular, yes. Along with anything else changing the status of a job or node - e.g., the job completing.

squeue and sinfo aren't directly impacted, but may be delayed by other outstanding requests.

> Can you comment on how this might impact our support, or ability to get
> support from SchedMD on HA-related issues?

We don't have a specific list of supported configurations for HA (although I do plan to elaborate on a recommended setup at some point; none of our public documentation elaborates on this at present).

We'll help to the best of our abilities; but if it looks like the state files were corrupted due to the HA there's not much we can do at that point, and you may loose the job queue. That's almost always a worst-case scenario, but it has happened a few times, and is why I'm always very conservative with the deployment recommendations I make.

At the end of the day, it's your system, and your call on how to assess and mitigate risks from different deployment strategies. All I can do is advise. :)

> > I've personally used OCFS2 in the past with good results - for that all
> > you'd need is a shared storage array accessible from both controllers over
> > SAS / iSCSI / FC, plus some additional configuration.
> > 
> > One caveat - until we have a further patch ready, it's best to ensure that
> > the network path used to communicate between the two controllers and the
> > compute nodes is the same as is used to access the shared storage. This
> > helps avoid the most common 'split-brain' scenarios where the controllers
> > have lost contact but are both able to access and modify the state files
> > simultaneously, leading to potential corruption.
> 
> Can you confirm that the secondary controller (while inactive) is not
> expected to write or hold locks?

While inactive, the secondary controller does not access the state files, or directly affect the primary. It's only job at that stage is to ping the primary and ensure it's still operational; this happens every 40 seconds by default (1/3rd of the SlurmctldTimeout value, which is 120 seconds by default), and should not be adding any perceivable load to the primary.

Comment 4 Tim Wickberg 2017-05-22 13:42:14 MDT

Marking resolved/infogiven; please reopen if there was anything further I can address.

cheers,
- Tim