Ticket 671

Summary:	sbatch error: unable to remove step memcg
Product:	Slurm	Reporter:	David Gloe <david.gloe>
Component:	slurmd	Assignee:	David Bigagli <david>
Status:	RESOLVED INVALID	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	da
Version:	14.11.x
Hardware:	Linux
OS:	Linux
Site:	CRAY	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmd log file showing the problem

Description David Gloe 2014-03-31 07:45:23 MDT

On sbatch jobs, we're consistently seeing this error at the end of job output:
slurmstepd: task/cgroup: unable to remove step memcg : Device or resource busy

This results in leftover memory control groups with no tasks like this one:
# cat /dev/mcgroup/slurm/uid_29597/job_7045/step_4294967294/tasks
#

There must be some process left in the memory control group when it's being removed.

We're using TaskPlugin=task/affinity,task/cgroup,task/cray.

Comment 1 David Bigagli 2014-03-31 08:03:15 MDT

Could please upload the slurmd log file?
Do you use the release agent provided with Slurm?


On 03/31/2014 12:45 PM, bugs@schedmd.com wrote:
> Site 	CRAY
> Bug ID 	671 <http://bugs.schedmd.com/show_bug.cgi?id=671>
> Summary 	sbatch error: unable to remove step memcg
> Product 	SLURM
> Version 	14.11.x
> Hardware 	Linux
> OS 	Linux
> Status 	UNCONFIRMED
> Severity 	4 - Minor Issue
> Priority 	---
> Component 	slurmd daemon
> Assignee 	david@schedmd.com
> Reporter 	dgloe@cray.com
> CC 	da@schedmd.com, david@schedmd.com, jette@schedmd.com
>
> On sbatch jobs, we're consistently seeing this error at the end of job output:
> slurmstepd: task/cgroup: unable to remove step memcg : Device or resource busy
>
> This results in leftover memory control groups with no tasks like this one:
> # cat /dev/mcgroup/slurm/uid_29597/job_7045/step_4294967294/tasks
> #
>
> There must be some process left in the memory control group when it's being
> removed.
>
> We're using TaskPlugin=task/affinity,task/cgroup,task/cray.
>
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>   * You are watching someone on the CC list of the bug.
>   * You are watching the assignee of the bug.
>

Comment 2 David Gloe 2014-03-31 08:35:05 MDT

Created attachment 723 [details]
slurmd log file showing the problem

We don't use the Slurm release agent. Here's our cgroup.conf:

###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes
#CgroupReleaseAgentDir="/etc/slurm/cgroup"
CgroupMountpoint="/dev"
ConstrainCores=yes
ConstrainRAMSpace=yes
TaskAffinity=no

Comment 3 David Bigagli 2014-03-31 08:44:18 MDT

Thanks, I am investigating the problem.

On 03/31/2014 01:35 PM, bugs@schedmd.com wrote:
> *Comment # 2 <http://bugs.schedmd.com/show_bug.cgi?id=671#c2> on bug 671
> <http://bugs.schedmd.com/show_bug.cgi?id=671> from David Gloe
> <mailto:dgloe@cray.com> *
>
> Createdattachment 723  <attachment.cgi?id=723>  [details]  <attachment.cgi?id=723&action=edit>
> slurmd log file showing the problem
>
> We don't use the Slurm release agent. Here's our cgroup.conf:
>
> ###
> #
> # Slurm cgroup support configuration file
> #
> # See man slurm.conf and man cgroup.conf for further
> # information on cgroup configuration parameters
> #--
> CgroupAutomount=yes
> #CgroupReleaseAgentDir="/etc/slurm/cgroup"
> CgroupMountpoint="/dev"
> ConstrainCores=yes
> ConstrainRAMSpace=yes
> TaskAffinity=no
>
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>   * You are watching someone on the CC list of the bug.
>   * You are watching the assignee of the bug.
>

Comment 4 David Bigagli 2014-03-31 10:47:27 MDT

The messages you seen indicate that there are still processes using the cgroup,
or jobs belonging to the same user running on the machine. The release agent
provided with Slurm should clean those directories after the last process
exit. Why are you not use the release agent?

David

Comment 5 David Gloe 2014-04-02 05:55:38 MDT

(In reply to David Bigagli from comment #4)
> The messages you seen indicate that there are still processes using the
> cgroup,
> or jobs belonging to the same user running on the machine. The release agent
> provided with Slurm should clean those directories after the last process
> exit. Why are you not use the release agent?
> 
> David

We don't use the Slurm release agent due to some quirkiness on our compute nodes. Slurm is installed in a chroot environment, along with its release agent, so the release agent path (which is based on the actual root, not the chroot) isn't set correctly in the cgroup by Slurm.

We do have our own release agent we use for cpusets; I'll try setting the mcgroup release agent to that as well and see if that helps at all.

Comment 6 David Bigagli 2014-04-02 06:00:35 MDT

One think we do in our release agent is to lock, using the flock 
command, the entire subsystem before doing anything, since the code that 
creates the cgroups uses the flock() sys call as well this allows for a 
mutual exclusion.

On 04/02/2014 10:55 AM, bugs@schedmd.com wrote:
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=671#c5> on bug 671
> <http://bugs.schedmd.com/show_bug.cgi?id=671> from David Gloe
> <mailto:dgloe@cray.com> *
>
> (In reply to David Bigagli fromcomment #4  <show_bug.cgi?id=671#c4>)
>> The messages you seen indicate that there are still processes using the
>> cgroup,
>> or jobs belonging to the same user running on the machine. The release agent
>> provided with Slurm should clean those directories after the last process
>> exit. Why are you not use the release agent?
>>
>> David
>
> We don't use the Slurm release agent due to some quirkiness on our compute
> nodes. Slurm is installed in a chroot environment, along with its release
> agent, so the release agent path (which is based on the actual root, not the
> chroot) isn't set correctly in the cgroup by Slurm.
>
> We do have our own release agent we use for cpusets; I'll try setting the
> mcgroup release agent to that as well and see if that helps at all.
>
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>   * You are watching someone on the CC list of the bug.
>   * You are watching the assignee of the bug.
>

Comment 7 David Gloe 2014-04-03 09:55:28 MDT

I've dug up a potential cause:
http://linux-kernel.2935.n7.nabble.com/PATCH-cgroup-fix-rmdir-EBUSY-regression-in-3-11-td710818.html

I'm looking to see if we have that patch on our compute nodes.

Comment 8 David Bigagli 2014-04-22 09:27:03 MDT

David, do you have an update on this issue?

Thanks, David

Comment 9 David Gloe 2014-04-22 09:29:12 MDT

After some more investigation, this is definitely a Cray kernel issue. I don't believe Slurm is at fault unless it's doing something strange that results in the memory control group having memory usage remaining when no tasks are left.