| Summary: | Slurm won't restart in | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sanjaya Gajurel <sxg125> |
| Component: | slurmctld | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 1 - System not usable | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Case | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Sanjaya Gajurel
2018-01-24 16:48:10 MST
That's a really old of Slurm. Our support model requires customers to stay within the last two major releases (currently 17.02 or 17.11). You should upgrade ASAP. Will you attach your slurm.conf and the full logs? Will you start the slurmctld with a high debug level (debug3) and attach those? Are there any core files? If so will you give the backtraces from them. Were you able to get the cluster back up? Hi Brian, We had to issue the command "service slurm startclean" to bring the cluster back though it killed the running jobs. This is our old cluster (RedCat) and utilization was low. The issue seems to be the result of the reservation made earlier when the cluster had more cores (2236) than when the reservation actually started (2044 cores) after transitioning 10 nodes in our new cluster (Rider). Thank you, -Sanjaya On Thu, Jan 25, 2018 at 11:12 AM, <bugs@schedmd.com> wrote: > Brian Christiansen <brian@schedmd.com> changed bug 4680 > <https://bugs.schedmd.com/show_bug.cgi?id=4680> > What Removed Added > Assignee support@schedmd.com brian@schedmd.com > > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=4680#c2> on bug > 4680 <https://bugs.schedmd.com/show_bug.cgi?id=4680> from Brian > Christiansen <brian@schedmd.com> * > > Were you able to get the cluster back up? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hey Sanjaya, Glad to hear that you're back up. I'm not able to reproduce the crash in 17.11. Let us know if you have any other questions. Thanks, Brian |