| Summary: | cpuset cgroups don't seem to be cleaned up after job exits | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
| Component: | slurmd | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex, dmjacobsen, tim |
| Version: | 17.02.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=8911 https://bugs.schedmd.com/show_bug.cgi?id=9429 |
||
| Site: | Stanford | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 17.02.3 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
cgroup.conf
slurm.conf slurmd.log |
||
Hi Kilian. We're gonna take a look at this and will come back to you. Kilian, the automatic cleanup of task/cgroup cpuset and devices subsystems after steps are done was introduced in the following commit: https://github.com/SchedMD/slurm/commit/66beca68217f which is available since 16.05.5 and 17.02.0pre2. I've just build a Slurm 17.02.1 to test if anything was accidentally broken, but the cleanup is properly done automatically in my testbed without the need of any release agent configuration: While job is being executed: $ lscgroup | grep slurm freezer:/slurm_compute1 freezer:/slurm_compute1/uid_1001 freezer:/slurm_compute1/uid_1001/job_20004 freezer:/slurm_compute1/uid_1001/job_20004/step_0 memory:/slurm_compute1 memory:/slurm_compute1/uid_1001 memory:/slurm_compute1/uid_1001/job_20004 memory:/slurm_compute1/uid_1001/job_20004/step_0 cpuset:/slurm_compute1 cpuset:/slurm_compute1/uid_1001 cpuset:/slurm_compute1/uid_1001/job_20004 cpuset:/slurm_compute1/uid_1001/job_20004/step_0 devices:/slurm_compute1 devices:/slurm_compute1/uid_1001 devices:/slurm_compute1/uid_1001/job_20004 devices:/slurm_compute1/uid_1001/job_20004/step_0 After the job finishes: $ lscgroup | grep slurm freezer:/slurm_compute1 memory:/slurm_compute1 cpuset:/slurm_compute1 devices:/slurm_compute1 Your behavior is definitely _not_ expected. I'm wondering: 1. Are you sure these directories weren't created before the upgrade? 2. Can you check the slurmd version on this node 'slurmd -V'? 3. Is this happening in this specific node or in all nodes? 4. Which OS and Linux kernel version is being used in this node? 5. Could you attach your slurm.conf, cgroup.conf, the slurmd.log messages in the timeframe while a test job is executed until right after it finishes (to catch up any cgroup related relevant info), a job submission example and the output of 'lscgroup | grep slurm' while the job is being executed and again when it finishes? In the meantime, I'm gonna see if there's any difference in how the code cleans up cpuset subsystem vs how the code does so for the other subsystems. Thanks. Hi Alejandro, Thanks for looking into this. Replies inline below. (In reply to Alejandro Sanchez from comment #2) > Your behavior is definitely _not_ expected. I'm wondering: > > 1. Are you sure these directories weren't created before the upgrade? That's a brand new system (we're building our next gen cluster), that was directly installed with 17.02. > 2. Can you check the slurmd version on this node 'slurmd -V'? [root@sh-101-37 ~]# slurmd -V slurm 17.02.1-2 > 3. Is this happening in this specific node or in all nodes? I noticed it on all nodes, actually. > 4. Which OS and Linux kernel version is being used in this node? # cat /etc/redhat-release CentOS Linux release 7.3.1611 (Core) # uname -a Linux sh-101-37.int 3.10.0-514.10.2.el7.x86_64 #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux > 5. Could you attach your slurm.conf, cgroup.conf, the slurmd.log messages in > the timeframe while a test job is executed until right after it finishes (to > catch up any cgroup related relevant info), a job submission example and the > output of 'lscgroup | grep slurm' while the job is being executed and again > when it finishes? Sure, here they are (next post). To make sure everything was fresh, I rebooted the node right before submitting the job and taking the logs. And it allowed me to narrow it down a little, as I think it only happens when a multi-node job is executed (doesn't seem to happen with single-node jobs): Job submission: $ srun -N 2 -n 2 sleep 10 While the job is running: [root@sh-101-37 ~]# lscgroup | grep slurm cpu,cpuacct:/slurm cpu,cpuacct:/slurm/uid_215845 cpu,cpuacct:/slurm/uid_215845/job_312 cpu,cpuacct:/slurm/uid_215845/job_312/step_0 cpu,cpuacct:/slurm/uid_215845/job_312/step_0/task_0 cpu,cpuacct:/slurm/uid_215845/job_312/step_extern cpu,cpuacct:/slurm/uid_215845/job_312/step_extern/task_0 freezer:/slurm freezer:/slurm/uid_215845 freezer:/slurm/uid_215845/job_312 freezer:/slurm/uid_215845/job_312/step_0 freezer:/slurm/uid_215845/job_312/step_extern cpuset:/slurm cpuset:/slurm/uid_215845 cpuset:/slurm/uid_215845/job_312 cpuset:/slurm/uid_215845/job_312/step_0 cpuset:/slurm/uid_215845/job_312/step_extern memory:/slurm memory:/slurm/uid_215845 memory:/slurm/uid_215845/job_312 memory:/slurm/uid_215845/job_312/step_0 memory:/slurm/uid_215845/job_312/step_0/task_0 memory:/slurm/uid_215845/job_312/step_extern memory:/slurm/uid_215845/job_312/step_extern/task_0 memory:/slurm/system devices:/slurm devices:/slurm/uid_215845 devices:/slurm/uid_215845/job_312 devices:/slurm/uid_215845/job_312/step_0 devices:/slurm/uid_215845/job_312/step_extern After the job is done: [root@sh-101-37 ~]# lscgroup | grep slurm cpu,cpuacct:/slurm freezer:/slurm cpuset:/slurm cpuset:/slurm/uid_215845 cpuset:/slurm/uid_215845/job_312 cpuset:/slurm/uid_215845/job_312/step_0 cpuset:/slurm/uid_215845/job_312/step_extern memory:/slurm memory:/slurm/system devices:/slurm devices:/slurm/uid_215845 devices:/slurm/uid_215845/job_312 devices:/slurm/uid_215845/job_312/step_0 devices:/slurm/uid_215845/job_312/step_extern I did it several times, it's 100% reproducible. But on the other hand, when submitting a job with -N 1, all the cgroups are correctly cleaned up. > In the meantime, I'm gonna see if there's any difference in how the code > cleans up cpuset subsystem vs how the code does so for the other subsystems. Thanks! Kilian Created attachment 4228 [details]
cgroup.conf
Created attachment 4229 [details]
slurm.conf
Created attachment 4230 [details]
slurmd.log
Kilian, I'm able to reproduce now with these conditions: 1. Multi-node jobs (thanks for narrowing that down) _and_ 2. PrologFlags=contain We don't need more info from your side for now. Thanks for your collaboration, we're gonna investigate this further and come back to you. (In reply to Alejandro Sanchez from comment #7) > Kilian, I'm able to reproduce now with these conditions: > > 1. Multi-node jobs (thanks for narrowing that down) _and_ > 2. PrologFlags=contain > > We don't need more info from your side for now. Thanks for your > collaboration, we're gonna investigate this further and come back to you. Excellent, thanks! And I forgot to mention but I'm sure you noticed, there is some leftover in the devices subsystem too. Cheers, Kilian (In reply to Kilian Cavalotti from comment #8) > And I forgot to mention but I'm sure you noticed, there is some leftover in > the devices subsystem too. Yes I did, probably this is related on how these two are cleaned up since their removal code was introduced together in the mentioned commit above. Kilian, just to update that in the tests I did locally, proctrack/cgroup is also necessary to reproduce and it seems ProtrackType/TaskPlugin race/conflict each other when cleaning up cgroup resources. We're still looking on ways to address this issue. Thanks. (In reply to Alejandro Sanchez from comment #14) > Kilian, just to update that in the tests I did locally, proctrack/cgroup is > also necessary to reproduce and it seems ProtrackType/TaskPlugin > race/conflict each other when cleaning up cgroup resources. We're still > looking on ways to address this issue. Thanks. Thanks for the update, much appreciated. Kilian, I want to update you that I prepared a patch which seems to solve this issue after doing some tests. It is pending review by some members team members so that it does not introduce any undesired side effects. I'll let you know as soon as we've something more solid. Thanks. On 04/13/2017 09:24 AM, bugs@schedmd.com wrote: > Kilian, I want to update you that I prepared a patch which seems to solve this > issue after doing some tests. It is pending review by some members team members > so that it does not introduce any undesired side effects. I'll let you know as > soon as we've something more solid. Thanks. Thanks for the update! Cheers, Kilian, after a while this has been fixed in the following commit, which will be available in the next 17.02.3 version when released: https://github.com/SchedMD/slurm/commit/24e2cb07e8e3 Adding Doug from NERSC to CC since he might be interested in this commit as well. Please, let me know if any of you experience further cgroup cleanup issues after the patch. Otherwise I'd go ahead and mark this as resolved. Thanks! |
Hi! Unless I missed something, I understand that Slurm now automatically cleans up cgroups when jobs exit, without needing to define any release agent in cgroups.conf. It seems to work fine for the most part, except for the cpuset subsystem, where cgroup directories seem to persist after jobs are done. For instance, on compute node sh-101037, I can see that those subsystems are defined: [root@sh-101-37 ~]# find /sys/fs/cgroup/ -type d -name slurm /sys/fs/cgroup/memory/slurm /sys/fs/cgroup/devices/slurm /sys/fs/cgroup/cpuset/slurm /sys/fs/cgroup/freezer/slurm All jobs are done: [root@sh-101-37 ~]# squeue -w localhost JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) Yet I can still see job_XXX folders in the cpuset subsystem (but not in others) [root@sh-101-37 ~]# tree -L 1 /sys/fs/cgroup/*/slurm/uid* /sys/fs/cgroup/cpuset/slurm/uid_215845 ├── cgroup.clone_children ├── cgroup.event_control ├── cgroup.procs ├── cpuset.cpu_exclusive ├── cpuset.cpus ├── cpuset.mem_exclusive ├── cpuset.mem_hardwall ├── cpuset.memory_migrate ├── cpuset.memory_pressure ├── cpuset.memory_spread_page ├── cpuset.memory_spread_slab ├── cpuset.mems ├── cpuset.sched_load_balance ├── cpuset.sched_relax_domain_level ├── job_258 ├── job_280 ├── job_281 ├── job_283 ├── job_70 ├── job_79 ├── notify_on_release └── tasks Is it expected? Is there still some release agent to define for cpuset? Thanks! Kilian