Summary: | slurmctld core dumps on job dalloc | ||
---|---|---|---|
Product: | Slurm | Reporter: | ifisk |
Component: | slurmctld | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 2 - High Impact | ||
Priority: | --- | CC: | dsimon |
Version: | 17.02.10 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Simons Foundation & Flatiron Institute | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
ifisk
2018-09-06 11:04:59 MDT
Hi, I think we've fixed this. The backtrace looks familiar. I'm double checking that right now. - Marshall Yeah, this looks like a duplicate of bug 5452. It was fixed in this commit on 17.11: https://github.com/SchedMD/slurm/commit/fef07a40972 You can cherry pick this into 17.02. The NEWS file will conflict, but you can just ignore that conflict and apply the change. This patch will prevent future issues, and I *think* get the slurmctld up and running again. Can you verify this gets you up and running again? If it doesn't, here's a patch that will get the slurmctld running. You should still use the other patch to prevent future issues. https://bugs.schedmd.com/attachment.cgi?id=7321&action=diff Is slurmctld running again? Dropping severity to sev-2 since we've identified the problem and have a fix available. Please let us know if it doesn't work and you can move it back to sev-1. Hi,
Thanks.
We are running Bright, so it’s less easy to patch. I think we are going to add
EnforcePartLimits=ALL
To the configuration, which should be a temporary patch.
Thanks, Ian
> On Sep 6, 2018, at 5:22 PM, bugs@schedmd.com wrote:
>
> Marshall Garey <mailto:marshall@schedmd.com> changed bug 5675 <https://bugs.schedmd.com/show_bug.cgi?id=5675>
> What Removed Added
> Severity 1 - System not usable 2 - High Impact
>
> Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=5675#c4> on bug 5675 <https://bugs.schedmd.com/show_bug.cgi?id=5675> from Marshall Garey <mailto:marshall@schedmd.com>
> Dropping severity to sev-2 since we've identified the problem and have a fix
> available. Please let us know if it doesn't work and you can move it back to
> sev-1.
>
> You are receiving this mail because:
> You reported the bug.
You'll want to REMOVE EnforcePartLimits=ALL. (ALL is the problem.) EnforcePartLimits=ANY shouldn't have that problem. Hi Ian, Bright should be able to re-roll the tarballs with just the patch or a more recent version of SLURM. If they do provide an updated tarball with a more recent version for SLURM then it may break some integration work that they have done for the specific version, although they would not know for certain without testing. Kind regards, Jason For the record, we managed to work around the repeated crash issue by removing the node running the affected job (worker1017 from the stack dump) from slurm.conf. It has been stable since then. (The problem originally started on slurm 17.02.2. We upgraded to 17.02.10 but the crashes continued until removing the node.) Only a small number of jobs were lost. We have since changed to EnforcePartLimits=ANY and will see if bright can provide a patch or their patched source so we can build our own. Thanks. Thanks for the update. Definitely see if you can get that patch in. Since 18.08 was released, we've technically dropped support for 17.02. We're giving sites time to upgrade to either 17.11 or 18.08, so we encourage you to make a plan to upgrade sometime soon. I'm keeping this ticket open for now. Since you've found a workaround and know the fix, I'm going to close this ticket. Please reopen it if you have more issues. - Marshall *** This ticket has been marked as a duplicate of ticket 5452 *** |