| Summary: | automated node reboots can incorrectly resume prior to reboot | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | KNL | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 16.05.8 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 16.05.10 and 17.02.1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Doug Jacobsen
2017-01-10 23:03:24 MST
I'm not seeing this behavior unless capmc returns an error. Could you look for lines in your slurmctld log file around the reboot time that include the string "capmc_resume"? If found, they are likely to be rather terse, such as this one: error: capmc_resume[23797]: capmc(node_reinit,-n,1): 256 (In reply to Moe Jette from comment #5) > I'm not seeing this behavior unless capmc returns an error. Could you look > for lines in your slurmctld log file around the reboot time that include the > string "capmc_resume"? > > If found, they are likely to be rather terse, such as this one: > error: capmc_resume[23797]: capmc(node_reinit,-n,1): 256 Any update on this? Hello, We'll deploy 16.05.9 on cori next week (and 17.01.0-0rc1 on gerty aw well). We will also re-enable dynamic mode reprovisions at that time, so I should be able to let you know by late next week if these corrections helped. Thanks so much, Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Thu, Feb 2, 2017 at 10:14 AM, <bugs@schedmd.com> wrote: > *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=3392#c7> on bug > 3392 <https://bugs.schedmd.com/show_bug.cgi?id=3392> from Moe Jette > <jette@schedmd.com> * > > (In reply to Moe Jette from comment #5 <https://bugs.schedmd.com/show_bug.cgi?id=3392#c5>)> I'm not seeing this behavior unless capmc returns an error. Could you look > > for lines in your slurmctld log file around the reboot time that include the > > string "capmc_resume"? > > > > If found, they are likely to be rather terse, such as this one: > > error: capmc_resume[23797]: capmc(node_reinit,-n,1): 256 > > Any update on this? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Doug Jacobsen from comment #8) > Hello, > We'll deploy 16.05.9 on cori next week (and 17.01.0-0rc1 on gerty aw > well). We will also re-enable dynamic mode reprovisions at that time, so I > should be able to let you know by late next week if these corrections > helped. > Thanks so much, > Doug Any updates? The "capmc reinit" command that Slurm uses to reboot nodes still does not return any information about specific failures. If Slurm requests a reboot of 1000 nodes and something bad happens to any of them, I just get back a generic error and have to try to determine what happened. I've got a request to Cray for node-specific errors for the "reinit" command like they have for the "set_mcdram_cfg" and "set_numa_cfg", but that doesn't seem to be going anywhere. I was revisiting this bug and was just able to reproduce the failure. (In reply to Moe Jette from comment #10) > I was revisiting this bug and was just able to reproduce the failure. Actually, I reproduced the bug with Slurm version 16.05.9 and can no longer reproduce it with version 16.05.10 (or 17.02.1). I believe the bug was fixed in this commit: https://github.com/SchedMD/slurm/commit/f6d42fdbb293ca89da609779db8d8c04a86a8d13 Since there's been no update on this since Feb 2 and indications here are that the bug was fixed, I'[l close this. Thanks, Moe! ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Tue, Mar 7, 2017 at 11:05 AM, <bugs@schedmd.com> wrote: > *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=3392#c12> on bug > 3392 <https://bugs.schedmd.com/show_bug.cgi?id=3392> from Moe Jette > <jette@schedmd.com> * > > (In reply to Moe Jette from comment #10 <https://bugs.schedmd.com/show_bug.cgi?id=3392#c10>)> I was revisiting this bug and was just able to reproduce the failure. > > Actually, I reproduced the bug with Slurm version 16.05.9 and can no longer > reproduce it with version 16.05.10 (or 17.02.1). I believe the bug was fixed in > this commit:https://github.com/SchedMD/slurm/commit/f6d42fdbb293ca89da609779db8d8c04a86a8d13 > > Since there's been no update on this since Feb 2 and indications here are that > the bug was fixed, I'[l close this. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Already fixed. |