| Summary: | GPU cluster slurmctld down frequently | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Hiroshi Kobayashi <hiroshi.kobayashi> |
| Component: | Cloud | Assignee: | Director of Support <support> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | nick |
| Version: | 20.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | WDC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | logfiles and config files | ||
The issue is fixed by 6b6b3879e97208a0[1] which was merged into the Slurm 20.11 branch and released in 20.11.7. https://github.com/SchedMD/slurm/commit/6b6b3879e97208a041c104df1ccf2574a60ecf27 Please upgrade to obtain the fix. *** This ticket has been marked as a duplicate of ticket 11480 *** |
Created attachment 21907 [details] logfiles and config files Hi SchedMD support I am running a GPU cluster on GCP. After a while, I updated the cluster config which spin-up compute node for every job to refresh GPU in following steps. 1. stop all jobs 2. adding "OverSubscribe=Exclusive" into /usr/local/etc/slurm/slurm.conf 3. switched exclusive option to "true" from "false" in /slurm/scripts/config.yaml 4. restart slurmctld on controller node After few times array job submissions and canceling jobs, slurmctld became unstable. slurmctld down several times. I cannot fine the exact reason of the service down. Could you take a look log files?