Ticket 5082

Summary: unable to create cgroup
Product: Slurm Reporter: Robert Yelle <ryelle>
Component: slurmstepdAssignee: Marshall Garey <marshall>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: ahkumar, charles.wright, novosirj, rmoye, slurm-support, tim
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=9721
https://bugs.schedmd.com/show_bug.cgi?id=14013
Site: University of Oregon Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: current cgroup.conf
lscgroup | grep -i slurm

Description Robert Yelle 2018-04-19 13:08:07 MDT
Created attachment 6657 [details]
current cgroup.conf

Hello,

We have had a growing number of cases lately where jobs are failing with:

slurmstepd: error: task/cgroup: unable to add task[pid=46628] to memory cg '(null)'
slurmstepd: error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_1333/job_1061833' : No space left on device
slurmstepd: error: jobacct_gather/cgroup: unable to instanciate job 1061833 memory cgroup

We are running RHEL7.4 with the following kernel:

3.10.0-693.17.1.el7.x86_64

I saw the reference to the deprecated CgroupReleaseAgentDir from related tickets, so I have commented that out, but we have had instances fail since that point.  Is this a bug in 17.11.5 (we did not receive complaints on this prior to 17.11.5), or the kernel itself?  Do you have a workaround for this?

Our current cgroup.conf file is attached, let me know if you need anything else.

Thanks,

Rob
Comment 1 Marshall Garey 2018-04-19 14:33:10 MDT
There was another ticket that came in with this same error. It's a private ticket, so you can't see it. The workaround for them is to reboot the node. This was occurring on some nodes, but not all.

From bug 3749 comment 1:

"I'm guessing that the afflicted node has run more jobs than the other okay nodes? If you reboot midway2-0414 does that clear up the error?

If it does, then I think we can conclude the Linux kernel bug discussed in bug 3364 is still present and causing problems; you'd need to find a way to upgrade to a fixed Linux kernel to resolve that if so."

Does this apply to you?
Is this occurring on all nodes, or just some of the nodes?
Does rebooting the node fix the problem?

In the past this error was caused by a Linux kernel bug. I'm not sure if your version of the kernel has the fixes or not. Can you contact RedHat and see if they've included the fixes linked in https://bugs.schedmd.com/show_bug.cgi?id=3890#c3 in your version of the kernel? I've copied them here for reference:

https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546
https://github.com/torvalds/linux/commit/24ee3cf89bef04e8bc23788aca4e029a3f0f06d9



The CgroupReleaseAgentDir cgroup.conf parameter is obsolete and completely ignored, so taking it out of cgroup.conf doesn't make any functional difference (as you saw).
Comment 2 Marshall Garey 2018-04-19 14:38:14 MDT
The other recent bug was for Slurm version 17.02.9, so I don't think this is a problem with 17.11.5.
Comment 3 Marshall Garey 2018-04-26 10:13:11 MDT
Have you had a chance to look at this ticket?
Comment 4 Robert Yelle 2018-04-26 10:30:28 MDT
Hi Marshall,

Yes I did, thanks.

Rebooting the affected nodes did resolve the issue.  Have not found out if there is a fixed RHEL7 kernel for this yet, but if I do, I will let you know.

Rob


On Apr 26, 2018, at 9:13 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:


Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=5082#c3> on bug 5082<https://bugs.schedmd.com/show_bug.cgi?id=5082> from Marshall Garey<mailto:marshall@schedmd.com>

Have you had a chance to look at this ticket?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 5 Marshall Garey 2018-04-26 10:40:31 MDT
> Rebooting the affected nodes did resolve the issue.  Have not found out if
> there is a fixed RHEL7 kernel for this yet, but if I do, I will let you know.
Great, I'm glad you at least have a workaround, even if it's not ideal.

I'm going to close this as a duplicate of bug 3749 for now. If or when you do find out if your kernel has the fix, let us know. If your kernel has the fix and you're still seeing these issues, please reopen this ticket by changing the status from resolved to unconfirmed and leave a comment.

*** This ticket has been marked as a duplicate of ticket 3749 ***
Comment 6 Marshall Garey 2018-04-26 10:47:31 MDT
Actually, I'm going to reopen this to ask you to run a test for me. Can you run

lscgroup | grep -i slurm (or grep for nodename, either one works)

on a node when it has this problem? Maybe cgroups aren't properly being cleaned up, and thus a whole bunch of old cgroups are hanging around, causing an error like ENOSPC.
Comment 7 Charles Wright 2018-04-30 15:15:04 MDT
Created attachment 6729 [details]
lscgroup | grep -i slurm

We are seeing this on 17.11.2 also.
Comment 8 Marshall Garey 2018-05-01 12:40:49 MDT
Thanks Charles. That doesn't seem like enough cgroups to cause an ENOSPC error. I'm going to do some more testing on our CentOS 7 machine to see if I can reproduce the issue and find out anything more.
Comment 9 Robert Yelle 2018-05-01 12:57:41 MDT
Hi Marshall,

The issue has surfaced again recently.  I ran "lscgroup | grep -i slurm | wc -l” on one of the affected nodes and I had 337 instances (many of them old), even though the node had been drained and not in current use.  So I think this would confirm that cgroups are not being cleaned up.  I rebooted one of the affected nodes this morning and that cleaned it up.

Rob


On Apr 26, 2018, at 9:47 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:

Marshall Garey<mailto:marshall@schedmd.com> changed bug 5082<https://bugs.schedmd.com/show_bug.cgi?id=5082>
What    Removed Added
Resolution      DUPLICATE       ---
Status  RESOLVED        UNCONFIRMED

Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=5082#c6> on bug 5082<https://bugs.schedmd.com/show_bug.cgi?id=5082> from Marshall Garey<mailto:marshall@schedmd.com>

Actually, I'm going to reopen this to ask you to run a test for me. Can you run

lscgroup | grep -i slurm (or grep for nodename, either one works)

on a node when it has this problem? Maybe cgroups aren't properly being cleaned
up, and thus a whole bunch of old cgroups are hanging around, causing an error
like ENOSPC.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 10 Marshall Garey 2018-05-01 12:58:46 MDT
Thanks Robert. I wouldn't think only 337 would cause this, but since they're old and not getting properly cleaned up, that's definitely an issue. I'm looking into this.
Comment 13 Marshall Garey 2018-05-04 17:06:28 MDT
I've replicated the issue of job and uid cgroups in freezer not getting cleaned up. I'm working on a patch to fix that. I'm not certain if that causes or contributes to the original error (copied here):

slurmstepd: error: task/cgroup: unable to add task[pid=46628] to memory cg '(null)'
slurmstepd: error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_1333/job_1061833' : No space left on device
slurmstepd: error: jobacct_gather/cgroup: unable to instanciate job 1061833 memory cgroup

I learned the lscgroup doesn't actually say how many cgroups there are. Can either of you upload the output of the following on a node that currently has the "No space left on device" error?

cat /proc/cgroups

It will say how many cgroups actually exist for each subsystem. I believe there's a 64k limit on the number of current cgroups allowed.

If your kernel turns out not to have the following patch (committed in 2016, so I don't know why RedHat wouldn't have included it, but I don't know), then I think there's a limit of 64k cgroups that can ever be made.

https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546


I'll let you know as soon we have a patch ready for you to test.
Comment 15 Charles Wright 2018-05-04 17:29:45 MDT
Thanks Marshall

Here's a few nodes worth.

Also I opened a case with Redhat about the commits.   
Redhat Case Number      : 02089400

[root@mgt1.farnam ~]# ssh c13n06 cat /proc/cgroups
#subsys_name	hierarchy	num_cgroups	enabled
cpuset	5	47	1
cpu	10	76	1
cpuacct	10	76	1
memory	11	84	1
devices	3	101	1
freezer	9	76	1
net_cls	6	1	1
blkio	7	76	1
perf_event	2	1	1
hugetlb	8	1	1
pids	4	1	1
net_prio	6	1	1
[root@mgt1.farnam ~]# ssh c28n08 cat /proc/cgroups
#subsys_name	hierarchy	num_cgroups	enabled
cpuset	9	59	1
cpu	2	63	1
cpuacct	2	63	1
memory	10	68	1
devices	6	110	1
freezer	3	72	1
net_cls	8	1	1
blkio	4	63	1
perf_event	5	1	1
hugetlb	11	1	1
pids	7	1	1
net_prio	8	1	1
[root@mgt1.farnam ~]# ssh c28n07 cat /proc/cgroups
#subsys_name	hierarchy	num_cgroups	enabled
cpuset	10	75	1
cpu	3	67	1
cpuacct	3	67	1
memory	11	71	1
devices	5	138	1
freezer	6	129	1
net_cls	7	1	1
blkio	8	67	1
perf_event	4	1	1
hugetlb	2	1	1
pids	9	1	1
net_prio	7	1	1
[root@mgt1.farnam ~]# ssh c28n06 cat /proc/cgroups
#subsys_name	hierarchy	num_cgroups	enabled
cpuset	10	95	1
cpu	4	67	1
cpuacct	4	67	1
memory	2	75	1
devices	9	141	1
freezer	11	126	1
net_cls	6	1	1
blkio	3	67	1
perf_event	7	1	1
hugetlb	5	1	1
pids	8	1	1
net_prio	6	1	1
Comment 21 Charles Wright 2018-05-08 18:52:33 MDT
Hello, 
   Redhat told me they have opened a bugzilla ticket, which I can't access, but they seem to be studying my SOS report and have escalated my support case.

https://bugzilla.redhat.com/show_bug.cgi?id=1470325

If you have any information you think would be helpful to pass to redhat, I have their attention.

Also should this bug be set to CONFIRMED now that schedmd has reproduced it?

Thanks.
Comment 22 Marshall Garey 2018-05-09 09:39:19 MDT
I've been perusing the internet, and found a few other places where people have posted they're having the exact same "No space left on device" bug when trying to create memory cgroups.

https://github.com/moby/moby/issues/6479#issuecomment-97503551
https://github.com/moby/moby/issues/24559
https://bugzilla.kernel.org/show_bug.cgi?id=124641

It looks like there have been multiple problems with memory cgroups in the Linux kernel, and they've been fixed in kernel 4.<something> (different minor versions depending on which fix you look at). I believe to get all the fixes, at least from those threads, you need kernel 4.4.<something>.

It sounds like none of those fixes have been backported to Linux 3.10, including the latest RedHat/CentOS 7 kernels. You may want to point them at those threads and bug them to backport fixes into the 3.10 kernel.

One of the posts said a problem had something to do with kernel memory limits. You could try disabling kmem limits in your cgroup.conf file by setting

ConstrainKmemSpace=No

but I'm not convinced that will fix all of the problems. I haven't specifically reproduced the "No space left" problem, unfortunately, so I can't test that.

We're still working on the patch to fix the leaking of freezer cgroups, but I'm pretty convinced that doesn't have anything to do with the "No space left" bug with memory cgroups.
Comment 23 Marshall Garey 2018-05-10 11:11:52 MDT
We still haven't been able to reproduce inside or outside of Slurm on CentOS 7 (or any system for that matter). Hopefully RedHat can say whether or not they've backported fixes to their kernel - their kernel is heavily patched, so it's hard to compare barebones Linux 3.10 to RedHat Linux 3.10.

We'll keep looking into this bug. Keep us updated on what RedHat says.
Comment 24 Robert Yelle 2018-05-10 15:06:48 MDT
Hi Marshall,

The “No space left” bug has bitten us again.  It had been quiet for a while, but has surfaced on five of our nodes today.

lscgroups | grep slurm | wc -l

give values between 2229 and 2754.  Again, well below 65000.

Is there any other information I can pull from these nodes that might be helpful to you before I reboot them and put them back into service?

Thanks,

Rob


On May 10, 2018, at 10:11 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:


Comment # 23<https://bugs.schedmd.com/show_bug.cgi?id=5082#c23> on bug 5082<https://bugs.schedmd.com/show_bug.cgi?id=5082> from Marshall Garey<mailto:marshall@schedmd.com>

We still haven't been able to reproduce inside or outside of Slurm on CentOS 7
(or any system for that matter). Hopefully RedHat can say whether or not
they've backported fixes to their kernel - their kernel is heavily patched, so
it's hard to compare barebones Linux 3.10 to RedHat Linux 3.10.

We'll keep looking into this bug. Keep us updated on what RedHat says.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 25 Marshall Garey 2018-05-10 15:14:56 MDT
Just an lscgroup to see what's actually there, then feel free to reboot the node.

Thanks for your help and patience on this one.
Comment 27 Marshall Garey 2018-05-11 17:50:54 MDT
*** Ticket 5130 has been marked as a duplicate of this ticket. ***
Comment 28 Charles Wright 2018-05-16 16:15:32 MDT
Per Redhat Support:

Neither of the commits are currently in the kernel. The first commit, "mm: memcontrol: fix cgroup creation failure after many small jobs", was included and removed twice due to causing a regression with openshift nodes. The changes in question are being tracked within a bug. The second one, 'cpuset: use trialcs->mems_allowed as a temp variable' is not currently tracked as we can tell. If you have a reproducer, I can work to open a bugzilla report on this one as well.
Comment 29 Charles Wright 2018-05-17 11:26:34 MDT
Hi Marshall,

Previously you stated

I've replicated the issue of job and uid cgroups in freezer not getting cleaned up. I'm working on a patch to fix that. I'm not certain if that causes or contributes to the original error.


Redhat states: 

If you have a reproducer, I can work to open a bugzilla report on this one as well.

Do you want me to pass along your work to Redhat?

Thanks.
Comment 30 Marshall Garey 2018-05-17 11:42:37 MDT
(In reply to Charles Wright from comment #29)
> Hi Marshall,
> 
> Previously you stated
> 
> I've replicated the issue of job and uid cgroups in freezer not getting
> cleaned up. I'm working on a patch to fix that. I'm not certain if that
> causes or contributes to the original error.
> 
> 
> Redhat states: 
> 
> If you have a reproducer, I can work to open a bugzilla report on this one
> as well.
> 
> Do you want me to pass along your work to Redhat?
> 
> Thanks.

No, I believe the leaking of freezer cgroups is a completely separate issue.

I have finally successfully reproduce the memory cgroup "no space left on device" bug using Slurm in a CentOS VM, and I'm working on creating a reproducer outside of Slurm. If/when I get one working, I'll send that over to you to pass along to RedHat. I'm sure they'd appreciate a reproducer that isn't 550k lines of C.

Thanks for following up with RedHat. It makes me more confident that I can create a reproducer outside of Slurm if I mimic how Slurm creates and destroys cgroups.
Comment 31 Marshall Garey 2018-05-29 14:47:31 MDT
After further research, I believe my suspicion in comment 22 was correct - that constraining kmem.limit_in_bytes leaks memory.

See these two comments from the same bug report:

https://github.com/moby/moby/issues/6479#issuecomment-97503551
https://github.com/moby/moby/issues/6479#issuecomment-97352331

I was able to reproduce the issue following the directions there (but launching sbatch jobs instead of kubernetes pods).

Turning off the kmem space constraint in my cgroup.conf fixes the issue:

ConstrainKmemSpace=no

You shouldn't need to restart the slurmd for this to take effect. This will affect new jobs. Can you let us know if you continue to see this problem after that?


The relevant kernel patch is here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6e0b7fa11862433773d986b5f995ffdf47ce672

Can you ask RedHat if this has been patched into their kernel?
Comment 32 Marshall Garey 2018-05-29 15:25:52 MDT
(In reply to Marshall Garey from comment #31)
> I was able to reproduce the issue following the directions there (but
> launching sbatch jobs instead of kubernetes pods).

Sorry, this comment was relating to yet another link:

https://github.com/kubernetes/kubernetes/issues/61937#issuecomment-377512486
Comment 33 Marshall Garey 2018-05-29 15:29:00 MDT
I tested the following bug in CentOS 7.4 (there's a reproducing program in the description).

https://patchwork.kernel.org/patch/9184539/

I didn't reproduce the bug that patch fixes, so it would seem that particular bug is not the problem.
Comment 34 Marshall Garey 2018-05-29 16:08:34 MDT
*** Ticket 4937 has been marked as a duplicate of this ticket. ***
Comment 35 Marshall Garey 2018-05-29 16:25:52 MDT
Hi all,

It looks like this issue has already been reported to RedHat, and they're working on a fix:

https://bugzilla.redhat.com/show_bug.cgi?id=1507149

That thread confirmed what I just found - disabling kernel memory enforcement fixes the issue.

I'm closing this ticket as resolved/infogiven.

I'll be opening an internal bug about Slurm leaking freezer cgroups, since that's a separate issue. (I already have a patch that is pending review.)

I'll also be opening another internal ticket to discuss documenting this known bug and possibly disabling kernel memory enforcement as the default.

Please reopen if you continue to have issues after disabling kernel memory enforcement.

- Marshall
Comment 36 Marshall Garey 2018-08-06 09:55:41 MDT
*** Ticket 5497 has been marked as a duplicate of this ticket. ***
Comment 37 Ryan Novosielski 2019-09-09 15:08:22 MDT
Marshall, can I somehow get some information about this other bug that you've reported on leaking cgroups and what version of SLURM is required to fix the problem?
Comment 38 Robert Yelle 2019-09-09 15:08:32 MDT
This is an automated reply; I am out of the office until Sept 11 and will not be able to reply to you immediately. I will get back to you as soon as I am able.
Comment 39 Marshall Garey 2019-09-09 15:26:52 MDT
(In reply to Ryan Novosielski from comment #37)
> Marshall, can I somehow get some information about this other bug that
> you've reported on leaking cgroups and what version of SLURM is required to
> fix the problem?

I'm assuming you're referring to comment 35 in which I mentioned Slurm leaked freezer cgroups occasionally. That was fixed in commit 7f9c4f7368d in 17.11.8. In general, you can look at our NEWS file to see a list of bug fixes relevant to administrators. This bug fix is in there.

https://github.com/SchedMD/slurm/blob/slurm-19-05-2-1/NEWS#L1211
 -- Prevent occasionally leaking freezer cgroups.

If you have more questions, create a new ticket, since this is an old ticket with a lot of people on CC.
Comment 40 Albert Gil 2020-03-05 00:48:51 MST
*** Ticket 8490 has been marked as a duplicate of this ticket. ***