Ticket 3604 - cpuset cgroups don't seem to be cleaned up after job exits
Summary: cpuset cgroups don't seem to be cleaned up after job exits
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 17.02.1
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-03-20 11:56 MDT by Kilian Cavalotti
Modified: 2020-07-20 15:15 MDT (History)
3 users (show)

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.02.3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
cgroup.conf (253 bytes, text/plain)
2017-03-21 07:56 MDT, Kilian Cavalotti
Details
slurm.conf (2.55 KB, text/plain)
2017-03-21 07:56 MDT, Kilian Cavalotti
Details
slurmd.log (31.93 KB, text/x-log)
2017-03-21 07:57 MDT, Kilian Cavalotti
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2017-03-20 11:56:08 MDT
Hi!

Unless I missed something, I understand that Slurm now automatically cleans up cgroups when jobs exit, without needing to define any release agent in cgroups.conf.

It seems to work fine for the most part, except for the cpuset subsystem, where cgroup directories seem to persist after jobs are done.


For instance, on compute node sh-101037, I can see that those subsystems are defined:

[root@sh-101-37 ~]# find /sys/fs/cgroup/ -type d -name slurm
/sys/fs/cgroup/memory/slurm
/sys/fs/cgroup/devices/slurm
/sys/fs/cgroup/cpuset/slurm
/sys/fs/cgroup/freezer/slurm

All jobs are done:

[root@sh-101-37 ~]# squeue -w localhost
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Yet I can still see job_XXX folders in the cpuset subsystem (but not in others)

[root@sh-101-37 ~]# tree -L 1 /sys/fs/cgroup/*/slurm/uid*
/sys/fs/cgroup/cpuset/slurm/uid_215845
├── cgroup.clone_children
├── cgroup.event_control
├── cgroup.procs
├── cpuset.cpu_exclusive
├── cpuset.cpus
├── cpuset.mem_exclusive
├── cpuset.mem_hardwall
├── cpuset.memory_migrate
├── cpuset.memory_pressure
├── cpuset.memory_spread_page
├── cpuset.memory_spread_slab
├── cpuset.mems
├── cpuset.sched_load_balance
├── cpuset.sched_relax_domain_level
├── job_258
├── job_280
├── job_281
├── job_283
├── job_70
├── job_79
├── notify_on_release
└── tasks


Is it expected? Is there still some release agent to define for cpuset?

Thanks!
Kilian
Comment 1 Alejandro Sanchez 2017-03-21 04:47:01 MDT
Hi Kilian. We're gonna take a look at this and will come back to you.
Comment 2 Alejandro Sanchez 2017-03-21 06:23:41 MDT
Kilian, the automatic cleanup of task/cgroup cpuset and devices subsystems after steps are done was introduced in the following commit:

https://github.com/SchedMD/slurm/commit/66beca68217f

which is available since 16.05.5 and 17.02.0pre2. I've just build a Slurm 17.02.1 to test if anything was accidentally broken, but the cleanup is properly done automatically in my testbed without the need of any release agent configuration:

While job is being executed:
$ lscgroup | grep slurm
freezer:/slurm_compute1
freezer:/slurm_compute1/uid_1001
freezer:/slurm_compute1/uid_1001/job_20004
freezer:/slurm_compute1/uid_1001/job_20004/step_0
memory:/slurm_compute1
memory:/slurm_compute1/uid_1001
memory:/slurm_compute1/uid_1001/job_20004
memory:/slurm_compute1/uid_1001/job_20004/step_0
cpuset:/slurm_compute1
cpuset:/slurm_compute1/uid_1001
cpuset:/slurm_compute1/uid_1001/job_20004
cpuset:/slurm_compute1/uid_1001/job_20004/step_0
devices:/slurm_compute1
devices:/slurm_compute1/uid_1001
devices:/slurm_compute1/uid_1001/job_20004
devices:/slurm_compute1/uid_1001/job_20004/step_0


After the job finishes:
$ lscgroup | grep slurm
freezer:/slurm_compute1
memory:/slurm_compute1
cpuset:/slurm_compute1
devices:/slurm_compute1

Your behavior is definitely _not_ expected. I'm wondering:

1. Are you sure these directories weren't created before the upgrade?
2. Can you check the slurmd version on this node 'slurmd -V'?
3. Is this happening in this specific node or in all nodes?
4. Which OS and Linux kernel version is being used in this node?
5. Could you attach your slurm.conf, cgroup.conf, the slurmd.log messages in the timeframe while a test job is executed until right after it finishes (to catch up any cgroup related relevant info), a job submission example and the output of 'lscgroup | grep slurm' while the job is being executed and again when it finishes?

In the meantime, I'm gonna see if there's any difference in how the code cleans up cpuset subsystem vs how the code does so for the other subsystems. Thanks.
Comment 3 Kilian Cavalotti 2017-03-21 07:55:33 MDT
Hi Alejandro, 

Thanks for looking into this. Replies inline below.

(In reply to Alejandro Sanchez from comment #2)
> Your behavior is definitely _not_ expected. I'm wondering:
> 
> 1. Are you sure these directories weren't created before the upgrade?

That's a brand new system (we're building our next gen cluster), that was directly installed with 17.02. 


> 2. Can you check the slurmd version on this node 'slurmd -V'?

[root@sh-101-37 ~]# slurmd -V
slurm 17.02.1-2


> 3. Is this happening in this specific node or in all nodes?

I noticed it on all nodes, actually.


> 4. Which OS and Linux kernel version is being used in this node?

# cat /etc/redhat-release
CentOS Linux release 7.3.1611 (Core)

# uname -a
Linux sh-101-37.int 3.10.0-514.10.2.el7.x86_64 #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux


> 5. Could you attach your slurm.conf, cgroup.conf, the slurmd.log messages in
> the timeframe while a test job is executed until right after it finishes (to
> catch up any cgroup related relevant info), a job submission example and the
> output of 'lscgroup | grep slurm' while the job is being executed and again
> when it finishes?

Sure, here they are (next post). To make sure everything was fresh, I rebooted the node right before submitting the job and taking the logs. And it allowed me to narrow it down a little, as I think it only happens when a multi-node job is executed (doesn't seem to happen with single-node jobs):

Job submission: 
$ srun -N 2 -n 2 sleep 10

While the job is running:

[root@sh-101-37 ~]# lscgroup | grep slurm
cpu,cpuacct:/slurm
cpu,cpuacct:/slurm/uid_215845
cpu,cpuacct:/slurm/uid_215845/job_312
cpu,cpuacct:/slurm/uid_215845/job_312/step_0
cpu,cpuacct:/slurm/uid_215845/job_312/step_0/task_0
cpu,cpuacct:/slurm/uid_215845/job_312/step_extern
cpu,cpuacct:/slurm/uid_215845/job_312/step_extern/task_0
freezer:/slurm
freezer:/slurm/uid_215845
freezer:/slurm/uid_215845/job_312
freezer:/slurm/uid_215845/job_312/step_0
freezer:/slurm/uid_215845/job_312/step_extern
cpuset:/slurm
cpuset:/slurm/uid_215845
cpuset:/slurm/uid_215845/job_312
cpuset:/slurm/uid_215845/job_312/step_0
cpuset:/slurm/uid_215845/job_312/step_extern
memory:/slurm
memory:/slurm/uid_215845
memory:/slurm/uid_215845/job_312
memory:/slurm/uid_215845/job_312/step_0
memory:/slurm/uid_215845/job_312/step_0/task_0
memory:/slurm/uid_215845/job_312/step_extern
memory:/slurm/uid_215845/job_312/step_extern/task_0
memory:/slurm/system
devices:/slurm
devices:/slurm/uid_215845
devices:/slurm/uid_215845/job_312
devices:/slurm/uid_215845/job_312/step_0
devices:/slurm/uid_215845/job_312/step_extern

After the job is done:
[root@sh-101-37 ~]# lscgroup | grep slurm
cpu,cpuacct:/slurm
freezer:/slurm
cpuset:/slurm
cpuset:/slurm/uid_215845
cpuset:/slurm/uid_215845/job_312
cpuset:/slurm/uid_215845/job_312/step_0
cpuset:/slurm/uid_215845/job_312/step_extern
memory:/slurm
memory:/slurm/system
devices:/slurm
devices:/slurm/uid_215845
devices:/slurm/uid_215845/job_312
devices:/slurm/uid_215845/job_312/step_0
devices:/slurm/uid_215845/job_312/step_extern

I did it several times, it's 100% reproducible.
But on the other hand, when submitting a job with -N 1, all the cgroups are correctly cleaned up.

> In the meantime, I'm gonna see if there's any difference in how the code
> cleans up cpuset subsystem vs how the code does so for the other subsystems.

Thanks!
Kilian
Comment 4 Kilian Cavalotti 2017-03-21 07:56:15 MDT
Created attachment 4228 [details]
cgroup.conf
Comment 5 Kilian Cavalotti 2017-03-21 07:56:35 MDT
Created attachment 4229 [details]
slurm.conf
Comment 6 Kilian Cavalotti 2017-03-21 07:57:20 MDT
Created attachment 4230 [details]
slurmd.log
Comment 7 Alejandro Sanchez 2017-03-21 09:01:21 MDT
Kilian, I'm able to reproduce now with these conditions:

1. Multi-node jobs (thanks for narrowing that down) _and_
2. PrologFlags=contain

We don't need more info from your side for now. Thanks for your collaboration, we're gonna investigate this further and come back to you.
Comment 8 Kilian Cavalotti 2017-03-21 09:22:41 MDT
(In reply to Alejandro Sanchez from comment #7)
> Kilian, I'm able to reproduce now with these conditions:
> 
> 1. Multi-node jobs (thanks for narrowing that down) _and_
> 2. PrologFlags=contain
> 
> We don't need more info from your side for now. Thanks for your
> collaboration, we're gonna investigate this further and come back to you.

Excellent, thanks!
And I forgot to mention but I'm sure you noticed, there is some leftover in the devices subsystem too.

Cheers,
Kilian
Comment 9 Alejandro Sanchez 2017-03-21 09:52:31 MDT
(In reply to Kilian Cavalotti from comment #8)
> And I forgot to mention but I'm sure you noticed, there is some leftover in
> the devices subsystem too.

Yes I did, probably this is related on how these two are cleaned up since their removal code was introduced together in the mentioned commit above.
Comment 14 Alejandro Sanchez 2017-03-28 05:52:04 MDT
Kilian, just to update that in the tests I did locally, proctrack/cgroup is also necessary to reproduce and it seems ProtrackType/TaskPlugin race/conflict each other when cleaning up cgroup resources. We're still looking on ways to address this issue. Thanks.
Comment 15 Kilian Cavalotti 2017-03-28 09:29:14 MDT
(In reply to Alejandro Sanchez from comment #14)
> Kilian, just to update that in the tests I did locally, proctrack/cgroup is
> also necessary to reproduce and it seems ProtrackType/TaskPlugin
> race/conflict each other when cleaning up cgroup resources. We're still
> looking on ways to address this issue. Thanks.

Thanks for the update, much appreciated.
Comment 32 Alejandro Sanchez 2017-04-13 10:24:50 MDT
Kilian, I want to update you that I prepared a patch which seems to solve this issue after doing some tests. It is pending review by some members team members so that it does not introduce any undesired side effects. I'll let you know as soon as we've something more solid. Thanks.
Comment 33 Kilian Cavalotti 2017-04-13 10:36:07 MDT
On 04/13/2017 09:24 AM, bugs@schedmd.com wrote:
> Kilian, I want to update you that I prepared a patch which seems to solve this
> issue after doing some tests. It is pending review by some members team members
> so that it does not introduce any undesired side effects. I'll let you know as
> soon as we've something more solid. Thanks.

Thanks for the update!

Cheers,
Comment 58 Alejandro Sanchez 2017-04-18 12:26:09 MDT
Kilian, after a while this has been fixed in the following commit, which will be available in the next 17.02.3 version when released:

https://github.com/SchedMD/slurm/commit/24e2cb07e8e3

Adding Doug from NERSC to CC since he might be interested in this commit as well. Please, let me know if any of you experience further cgroup cleanup issues after the patch. Otherwise I'd go ahead and mark this as resolved. Thanks!