Ticket 3108

Summary: some cgroups not cleaning up as expect after job finishes
Product: Slurm Reporter: Charles Wright <charles.wright>
Component: ConfigurationAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 16.05.4   
Hardware: Linux   
OS: Linux   
Site: Yale Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 16.05.5 17.02-pre2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Charles Wright 2016-09-21 11:56:34 MDT
For the output below job 107 has finished (and should have no cgroups), job 108 is running.

While a job is running we see cgroups for memory,cpuset,devices,freezer but after a job finish it only cleans up memory and freezer cgroups.

It is unclear to me how to debug this or configure it to clean up all cgroups after a job finishes.

[root@c13n08.farnam ~]# lscgroup 
cpu,cpuacct:/
net_cls:/
memory:/
memory:/slurm
memory:/slurm/uid_10374
memory:/slurm/uid_10374/job_108
memory:/slurm/uid_10374/job_108/step_batch
cpuset:/
cpuset:/slurm
cpuset:/slurm/uid_10374
cpuset:/slurm/uid_10374/job_108
cpuset:/slurm/uid_10374/job_108/step_batch
cpuset:/slurm/uid_10374/job_107
cpuset:/slurm/uid_10374/job_107/step_batch
perf_event:/
blkio:/
hugetlb:/
devices:/
devices:/slurm
devices:/slurm/uid_10374
devices:/slurm/uid_10374/job_108
devices:/slurm/uid_10374/job_108/step_batch
devices:/slurm/uid_10374/job_107
devices:/slurm/uid_10374/job_107/step_batch
freezer:/
freezer:/slurm
freezer:/slurm/uid_10374
freezer:/slurm/uid_10374/job_108
freezer:/slurm/uid_10374/job_108/step_batch

[root@c13n08.farnam cgroup]# ls -l /etc/slurm/cgroup
total 4
-rwxr-xr-x 1 slurm slurm 3307 Sep 20 09:35 release_common
lrwxrwxrwx 1 slurm slurm   14 Sep 20 10:07 release_cpuset -> release_common
lrwxrwxrwx 1 slurm slurm   14 Sep 20 10:08 release_devices -> release_common
lrwxrwxrwx 1 slurm slurm   14 Sep 20 10:08 release_freezer -> release_common
lrwxrwxrwx 1 slurm slurm   14 Sep 20 10:08 release_memory -> release_common

[root@c13n08.farnam cgroup]# cat /etc/slurm/cgroup.conf 
# update this to where your release agents are installed:
CgroupReleaseAgentDir="/etc/slurm/cgroup"
CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
TaskAffinity=yes
Comment 1 Tim Wickberg 2016-09-21 12:12:30 MDT
What OS is installed on the node? I'm guessing RHEL7 or some other systemd-based distribution?

systemd tends to remove the release_agent setting on the cgroup mount, which then leads to these stray cgroup directoris. They generally shouldn't cause any problems (up until the node has processed thousands of jobs), and the 17.02 release has some additional work that should remove the need for the release_agent setting entirely to avoid this conflict. (The 16.05 release already has similar work done to remove the release_agent requirement for memory and freezer.)

In the meantime, can you check the output from 'cat /proc/mounts' and see what options are set for the cgroup directory? If the release_agent line isn't there then that would explain the problem.
Comment 2 Charles Wright 2016-09-21 14:03:15 MDT
[root@c13n08.farnam cgroup]# cat /etc/redhat-release  ; mount -t cgroup
Red Hat Enterprise Linux Server release 7.2 (Maipo)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
[root@c13n08.farnam cgroup]#
Comment 3 Charles Wright 2016-09-22 08:56:05 MDT
The mount options on memory and cpuset are the same.   One cleans up and the other doesn't.   I'm not sure how to change the default cgroup mount options or what to change them to.   Can you test rhel7 and provide instructions?

Thanks.
Comment 4 Tim Wickberg 2016-09-22 11:59:49 MDT


(In reply to Charles Wright from comment #3)
> The mount options on memory and cpuset are the same.   One cleans up and the
> other doesn't.   I'm not sure how to change the default cgroup mount options
> or what to change them to.   Can you test rhel7 and provide instructions?
> 
> Thanks.

Your mount command showed that slurmctld's release_agent option had been removed in favor of systemd's:

> cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)

This has been a recurring problem on RHEL7, but we haven't been able to isolate the exact cause - it appears something within the systemd code eventually re-mounts the cgroups and removes Slurm's release_agent option.

Why you only see this on some of the hierarchies is that the memory and freezer already had code in place to handle cleanup within slurmctld, and do not rely on the release_agent. The cpuset and device hierarchies were both missing this code. We'd previously committed a patch to the master branch that handles this, but it's become apparent that RHEL7 could really use that code now, and that we shouldn't wait until the next release to get this fix out there.

Commit 66beca68217 pulls in that extended cleanup logic from the master branch, and will be included with the 16.05.5 which we expect to release shortly. (You can apply this patch in the meantime if you'd like, although as I've mentioned the orphaned directories shouldn't cause any issues until there are a significant number of them.)

With 16.05.5 and on, you should no longer need the ReleaseAgent setting within cgroup.conf at all; slurmctld should then handle cleaning up all the cgroup directories properly internally. I'll be revising the documentation to mention that the ReleaseAgent setting is no longer required.

Just to summarize: once 16.05.5 is released please update, and remove the ReleaseAgent setting from cgroup.conf .