Ticket 6719 - slurm database failure impact on cluster
Summary: slurm database failure impact on cluster
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other tickets)
Version: 18.08.2
Hardware: Linux Other
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-03-19 01:01 MDT by Kumaresan
Modified: 2020-02-19 12:59 MST (History)
1 user (show)

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kumaresan 2019-03-19 01:01:29 MDT
Hello Team,

We have small cluster of slurm 18.8.2 running with mysql DB for accounting & limits.

1. We are trying to figure out, How much would impact the cluster if sudden Failure of DB/slurmdbd ? 

2.As i came to know if DB failure happens slurm still able to dispatch jobs but we not sure how long it can sustain without DB and is service going to impact due to failure of DB ? While DB is offline where the accounting details are cached/stored ? 

3. Since DB holds the accounting details,while db failure. the jobs which are consumed resource/ jobs which are dispatched under accounts and how the fairshare details are update to DB when it back online ?

Please let me know any details is required from our end.

Thanks.

_Kumaresan.