Ticket 6719

Summary: slurm database failure impact on cluster
Product: Slurm Reporter: Kumaresan <kperiyasamy>
Component: DatabaseAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: sts
Version: 18.08.2   
Hardware: Linux   
OS: Other   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Kumaresan 2019-03-19 01:01:29 MDT
Hello Team,

We have small cluster of slurm 18.8.2 running with mysql DB for accounting & limits.

1. We are trying to figure out, How much would impact the cluster if sudden Failure of DB/slurmdbd ? 

2.As i came to know if DB failure happens slurm still able to dispatch jobs but we not sure how long it can sustain without DB and is service going to impact due to failure of DB ? While DB is offline where the accounting details are cached/stored ? 

3. Since DB holds the accounting details,while db failure. the jobs which are consumed resource/ jobs which are dispatched under accounts and how the fairshare details are update to DB when it back online ?

Please let me know any details is required from our end.

Thanks.

_Kumaresan.