Summary: | unable to create cgroup | ||
---|---|---|---|
Product: | Slurm | Reporter: | Robert Yelle <ryelle> |
Component: | slurmstepd | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | ahkumar, charles.wright, novosirj, rmoye, slurm-support, tim |
Version: | 17.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=9721 https://bugs.schedmd.com/show_bug.cgi?id=14013 |
||
Site: | University of Oregon | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
current cgroup.conf
lscgroup | grep -i slurm |
There was another ticket that came in with this same error. It's a private ticket, so you can't see it. The workaround for them is to reboot the node. This was occurring on some nodes, but not all. From bug 3749 comment 1: "I'm guessing that the afflicted node has run more jobs than the other okay nodes? If you reboot midway2-0414 does that clear up the error? If it does, then I think we can conclude the Linux kernel bug discussed in bug 3364 is still present and causing problems; you'd need to find a way to upgrade to a fixed Linux kernel to resolve that if so." Does this apply to you? Is this occurring on all nodes, or just some of the nodes? Does rebooting the node fix the problem? In the past this error was caused by a Linux kernel bug. I'm not sure if your version of the kernel has the fixes or not. Can you contact RedHat and see if they've included the fixes linked in https://bugs.schedmd.com/show_bug.cgi?id=3890#c3 in your version of the kernel? I've copied them here for reference: https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546 https://github.com/torvalds/linux/commit/24ee3cf89bef04e8bc23788aca4e029a3f0f06d9 The CgroupReleaseAgentDir cgroup.conf parameter is obsolete and completely ignored, so taking it out of cgroup.conf doesn't make any functional difference (as you saw). The other recent bug was for Slurm version 17.02.9, so I don't think this is a problem with 17.11.5. Have you had a chance to look at this ticket? Hi Marshall, Yes I did, thanks. Rebooting the affected nodes did resolve the issue. Have not found out if there is a fixed RHEL7 kernel for this yet, but if I do, I will let you know. Rob On Apr 26, 2018, at 9:13 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=5082#c3> on bug 5082<https://bugs.schedmd.com/show_bug.cgi?id=5082> from Marshall Garey<mailto:marshall@schedmd.com> Have you had a chance to look at this ticket? ________________________________ You are receiving this mail because: * You reported the bug. > Rebooting the affected nodes did resolve the issue. Have not found out if > there is a fixed RHEL7 kernel for this yet, but if I do, I will let you know. Great, I'm glad you at least have a workaround, even if it's not ideal. I'm going to close this as a duplicate of bug 3749 for now. If or when you do find out if your kernel has the fix, let us know. If your kernel has the fix and you're still seeing these issues, please reopen this ticket by changing the status from resolved to unconfirmed and leave a comment. *** This ticket has been marked as a duplicate of ticket 3749 *** Actually, I'm going to reopen this to ask you to run a test for me. Can you run lscgroup | grep -i slurm (or grep for nodename, either one works) on a node when it has this problem? Maybe cgroups aren't properly being cleaned up, and thus a whole bunch of old cgroups are hanging around, causing an error like ENOSPC. Created attachment 6729 [details]
lscgroup | grep -i slurm
We are seeing this on 17.11.2 also.
Thanks Charles. That doesn't seem like enough cgroups to cause an ENOSPC error. I'm going to do some more testing on our CentOS 7 machine to see if I can reproduce the issue and find out anything more. Hi Marshall, The issue has surfaced again recently. I ran "lscgroup | grep -i slurm | wc -l” on one of the affected nodes and I had 337 instances (many of them old), even though the node had been drained and not in current use. So I think this would confirm that cgroups are not being cleaned up. I rebooted one of the affected nodes this morning and that cleaned it up. Rob On Apr 26, 2018, at 9:47 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Marshall Garey<mailto:marshall@schedmd.com> changed bug 5082<https://bugs.schedmd.com/show_bug.cgi?id=5082> What Removed Added Resolution DUPLICATE --- Status RESOLVED UNCONFIRMED Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=5082#c6> on bug 5082<https://bugs.schedmd.com/show_bug.cgi?id=5082> from Marshall Garey<mailto:marshall@schedmd.com> Actually, I'm going to reopen this to ask you to run a test for me. Can you run lscgroup | grep -i slurm (or grep for nodename, either one works) on a node when it has this problem? Maybe cgroups aren't properly being cleaned up, and thus a whole bunch of old cgroups are hanging around, causing an error like ENOSPC. ________________________________ You are receiving this mail because: * You reported the bug. Thanks Robert. I wouldn't think only 337 would cause this, but since they're old and not getting properly cleaned up, that's definitely an issue. I'm looking into this. I've replicated the issue of job and uid cgroups in freezer not getting cleaned up. I'm working on a patch to fix that. I'm not certain if that causes or contributes to the original error (copied here): slurmstepd: error: task/cgroup: unable to add task[pid=46628] to memory cg '(null)' slurmstepd: error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_1333/job_1061833' : No space left on device slurmstepd: error: jobacct_gather/cgroup: unable to instanciate job 1061833 memory cgroup I learned the lscgroup doesn't actually say how many cgroups there are. Can either of you upload the output of the following on a node that currently has the "No space left on device" error? cat /proc/cgroups It will say how many cgroups actually exist for each subsystem. I believe there's a 64k limit on the number of current cgroups allowed. If your kernel turns out not to have the following patch (committed in 2016, so I don't know why RedHat wouldn't have included it, but I don't know), then I think there's a limit of 64k cgroups that can ever be made. https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546 I'll let you know as soon we have a patch ready for you to test. Thanks Marshall Here's a few nodes worth. Also I opened a case with Redhat about the commits. Redhat Case Number : 02089400 [root@mgt1.farnam ~]# ssh c13n06 cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 5 47 1 cpu 10 76 1 cpuacct 10 76 1 memory 11 84 1 devices 3 101 1 freezer 9 76 1 net_cls 6 1 1 blkio 7 76 1 perf_event 2 1 1 hugetlb 8 1 1 pids 4 1 1 net_prio 6 1 1 [root@mgt1.farnam ~]# ssh c28n08 cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 9 59 1 cpu 2 63 1 cpuacct 2 63 1 memory 10 68 1 devices 6 110 1 freezer 3 72 1 net_cls 8 1 1 blkio 4 63 1 perf_event 5 1 1 hugetlb 11 1 1 pids 7 1 1 net_prio 8 1 1 [root@mgt1.farnam ~]# ssh c28n07 cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 10 75 1 cpu 3 67 1 cpuacct 3 67 1 memory 11 71 1 devices 5 138 1 freezer 6 129 1 net_cls 7 1 1 blkio 8 67 1 perf_event 4 1 1 hugetlb 2 1 1 pids 9 1 1 net_prio 7 1 1 [root@mgt1.farnam ~]# ssh c28n06 cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 10 95 1 cpu 4 67 1 cpuacct 4 67 1 memory 2 75 1 devices 9 141 1 freezer 11 126 1 net_cls 6 1 1 blkio 3 67 1 perf_event 7 1 1 hugetlb 5 1 1 pids 8 1 1 net_prio 6 1 1 Hello, Redhat told me they have opened a bugzilla ticket, which I can't access, but they seem to be studying my SOS report and have escalated my support case. https://bugzilla.redhat.com/show_bug.cgi?id=1470325 If you have any information you think would be helpful to pass to redhat, I have their attention. Also should this bug be set to CONFIRMED now that schedmd has reproduced it? Thanks. I've been perusing the internet, and found a few other places where people have posted they're having the exact same "No space left on device" bug when trying to create memory cgroups. https://github.com/moby/moby/issues/6479#issuecomment-97503551 https://github.com/moby/moby/issues/24559 https://bugzilla.kernel.org/show_bug.cgi?id=124641 It looks like there have been multiple problems with memory cgroups in the Linux kernel, and they've been fixed in kernel 4.<something> (different minor versions depending on which fix you look at). I believe to get all the fixes, at least from those threads, you need kernel 4.4.<something>. It sounds like none of those fixes have been backported to Linux 3.10, including the latest RedHat/CentOS 7 kernels. You may want to point them at those threads and bug them to backport fixes into the 3.10 kernel. One of the posts said a problem had something to do with kernel memory limits. You could try disabling kmem limits in your cgroup.conf file by setting ConstrainKmemSpace=No but I'm not convinced that will fix all of the problems. I haven't specifically reproduced the "No space left" problem, unfortunately, so I can't test that. We're still working on the patch to fix the leaking of freezer cgroups, but I'm pretty convinced that doesn't have anything to do with the "No space left" bug with memory cgroups. We still haven't been able to reproduce inside or outside of Slurm on CentOS 7 (or any system for that matter). Hopefully RedHat can say whether or not they've backported fixes to their kernel - their kernel is heavily patched, so it's hard to compare barebones Linux 3.10 to RedHat Linux 3.10. We'll keep looking into this bug. Keep us updated on what RedHat says. Hi Marshall, The “No space left” bug has bitten us again. It had been quiet for a while, but has surfaced on five of our nodes today. lscgroups | grep slurm | wc -l give values between 2229 and 2754. Again, well below 65000. Is there any other information I can pull from these nodes that might be helpful to you before I reboot them and put them back into service? Thanks, Rob On May 10, 2018, at 10:11 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Comment # 23<https://bugs.schedmd.com/show_bug.cgi?id=5082#c23> on bug 5082<https://bugs.schedmd.com/show_bug.cgi?id=5082> from Marshall Garey<mailto:marshall@schedmd.com> We still haven't been able to reproduce inside or outside of Slurm on CentOS 7 (or any system for that matter). Hopefully RedHat can say whether or not they've backported fixes to their kernel - their kernel is heavily patched, so it's hard to compare barebones Linux 3.10 to RedHat Linux 3.10. We'll keep looking into this bug. Keep us updated on what RedHat says. ________________________________ You are receiving this mail because: * You reported the bug. Just an lscgroup to see what's actually there, then feel free to reboot the node. Thanks for your help and patience on this one. *** Ticket 5130 has been marked as a duplicate of this ticket. *** Per Redhat Support: Neither of the commits are currently in the kernel. The first commit, "mm: memcontrol: fix cgroup creation failure after many small jobs", was included and removed twice due to causing a regression with openshift nodes. The changes in question are being tracked within a bug. The second one, 'cpuset: use trialcs->mems_allowed as a temp variable' is not currently tracked as we can tell. If you have a reproducer, I can work to open a bugzilla report on this one as well. Hi Marshall, Previously you stated I've replicated the issue of job and uid cgroups in freezer not getting cleaned up. I'm working on a patch to fix that. I'm not certain if that causes or contributes to the original error. Redhat states: If you have a reproducer, I can work to open a bugzilla report on this one as well. Do you want me to pass along your work to Redhat? Thanks. (In reply to Charles Wright from comment #29) > Hi Marshall, > > Previously you stated > > I've replicated the issue of job and uid cgroups in freezer not getting > cleaned up. I'm working on a patch to fix that. I'm not certain if that > causes or contributes to the original error. > > > Redhat states: > > If you have a reproducer, I can work to open a bugzilla report on this one > as well. > > Do you want me to pass along your work to Redhat? > > Thanks. No, I believe the leaking of freezer cgroups is a completely separate issue. I have finally successfully reproduce the memory cgroup "no space left on device" bug using Slurm in a CentOS VM, and I'm working on creating a reproducer outside of Slurm. If/when I get one working, I'll send that over to you to pass along to RedHat. I'm sure they'd appreciate a reproducer that isn't 550k lines of C. Thanks for following up with RedHat. It makes me more confident that I can create a reproducer outside of Slurm if I mimic how Slurm creates and destroys cgroups. After further research, I believe my suspicion in comment 22 was correct - that constraining kmem.limit_in_bytes leaks memory. See these two comments from the same bug report: https://github.com/moby/moby/issues/6479#issuecomment-97503551 https://github.com/moby/moby/issues/6479#issuecomment-97352331 I was able to reproduce the issue following the directions there (but launching sbatch jobs instead of kubernetes pods). Turning off the kmem space constraint in my cgroup.conf fixes the issue: ConstrainKmemSpace=no You shouldn't need to restart the slurmd for this to take effect. This will affect new jobs. Can you let us know if you continue to see this problem after that? The relevant kernel patch is here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6e0b7fa11862433773d986b5f995ffdf47ce672 Can you ask RedHat if this has been patched into their kernel? (In reply to Marshall Garey from comment #31) > I was able to reproduce the issue following the directions there (but > launching sbatch jobs instead of kubernetes pods). Sorry, this comment was relating to yet another link: https://github.com/kubernetes/kubernetes/issues/61937#issuecomment-377512486 I tested the following bug in CentOS 7.4 (there's a reproducing program in the description). https://patchwork.kernel.org/patch/9184539/ I didn't reproduce the bug that patch fixes, so it would seem that particular bug is not the problem. *** Ticket 4937 has been marked as a duplicate of this ticket. *** Hi all, It looks like this issue has already been reported to RedHat, and they're working on a fix: https://bugzilla.redhat.com/show_bug.cgi?id=1507149 That thread confirmed what I just found - disabling kernel memory enforcement fixes the issue. I'm closing this ticket as resolved/infogiven. I'll be opening an internal bug about Slurm leaking freezer cgroups, since that's a separate issue. (I already have a patch that is pending review.) I'll also be opening another internal ticket to discuss documenting this known bug and possibly disabling kernel memory enforcement as the default. Please reopen if you continue to have issues after disabling kernel memory enforcement. - Marshall *** Ticket 5497 has been marked as a duplicate of this ticket. *** Marshall, can I somehow get some information about this other bug that you've reported on leaking cgroups and what version of SLURM is required to fix the problem? This is an automated reply; I am out of the office until Sept 11 and will not be able to reply to you immediately. I will get back to you as soon as I am able. (In reply to Ryan Novosielski from comment #37) > Marshall, can I somehow get some information about this other bug that > you've reported on leaking cgroups and what version of SLURM is required to > fix the problem? I'm assuming you're referring to comment 35 in which I mentioned Slurm leaked freezer cgroups occasionally. That was fixed in commit 7f9c4f7368d in 17.11.8. In general, you can look at our NEWS file to see a list of bug fixes relevant to administrators. This bug fix is in there. https://github.com/SchedMD/slurm/blob/slurm-19-05-2-1/NEWS#L1211 -- Prevent occasionally leaking freezer cgroups. If you have more questions, create a new ticket, since this is an old ticket with a lot of people on CC. *** Ticket 8490 has been marked as a duplicate of this ticket. *** |
Created attachment 6657 [details] current cgroup.conf Hello, We have had a growing number of cases lately where jobs are failing with: slurmstepd: error: task/cgroup: unable to add task[pid=46628] to memory cg '(null)' slurmstepd: error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_1333/job_1061833' : No space left on device slurmstepd: error: jobacct_gather/cgroup: unable to instanciate job 1061833 memory cgroup We are running RHEL7.4 with the following kernel: 3.10.0-693.17.1.el7.x86_64 I saw the reference to the deprecated CgroupReleaseAgentDir from related tickets, so I have commented that out, but we have had instances fail since that point. Is this a bug in 17.11.5 (we did not receive complaints on this prior to 17.11.5), or the kernel itself? Do you have a workaround for this? Our current cgroup.conf file is attached, let me know if you need anything else. Thanks, Rob