Ticket 12731

Summary: GPU cluster slurmctld down frequently
Product: Slurm Reporter: Hiroshi Kobayashi <hiroshi.kobayashi>
Component: CloudAssignee: Director of Support <support>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: nick
Version: 20.11.4   
Hardware: Linux   
OS: Linux   
Site: WDC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: logfiles and config files

Description Hiroshi Kobayashi 2021-10-23 14:45:16 MDT
Created attachment 21907 [details]
logfiles and config files

Hi SchedMD support

I am running a GPU cluster on GCP.
After a while, I updated the cluster config which spin-up compute node for every job to refresh GPU in following steps.


1. stop all jobs
2. adding "OverSubscribe=Exclusive" into /usr/local/etc/slurm/slurm.conf
3. switched exclusive option to "true" from "false" in /slurm/scripts/config.yaml
4. restart slurmctld on controller node


After few times array job submissions and canceling jobs, slurmctld became unstable. slurmctld down several times.

I cannot fine the exact reason of the service down.
Could you take a look log files?
Comment 3 Jason Booth 2021-10-25 11:43:34 MDT
The issue is fixed by 6b6b3879e97208a0[1] which was merged into the Slurm 20.11 branch and released in 20.11.7.

https://github.com/SchedMD/slurm/commit/6b6b3879e97208a041c104df1ccf2574a60ecf27

Please upgrade to obtain the fix.

*** This ticket has been marked as a duplicate of ticket 11480 ***