cgroup is not completely cleaning up the files of finished jobs. In the directory /sys/fs/cgroup/system.slice/slurmstepd.scope, records of jobs that have already finished can be observed. #ls job_45304707 job_46043148 job_46236561 job_46540674 job_46548677 job_46553239 job_46705405 cgroup.events job_45307219 job_46043404 job_46239503 job_46540731 job_46548694 job_46553431 job_46705417 cgroup.freeze job_45307228 job_46044131 job_46339702 job_46541132 job_46548988 job_46553474 job_46761341 cgroup.kill job_45307237 job_46046151 job_46390017 job_46541170 job_46549130 job_46554100 job_46761501 cgroup.max.depth job_45311310 job_46075084 job_46391601 job_46541227 job_46549274 job_46564361 job_46830705 These records can also be observed by running the following command (the output is not complete): [root@sn013 slurmstepd.scope]# systemd-cgtop | grep slurmstep system.slice/slurmstepd.scope 146 - 36.6G - - system.slice/slurmstepd.scope/job_45290719 - - 468.0K - - system.slice/slurmstepd.scope/job_45299168 - - 236.0K - - system.slice/slurmstepd.scope/job_45304402 - - 460.0K - - system.slice/slurmstepd.scope/job_45304696 - - 460.0K - - system.slice/slurmstepd.scope/job_45304707 - - 460.0K - - system.slice/slurmstepd.scope/job_45307219 - - 460.0K - - system.slice/slurmstepd.scope/job_45307228 - - 468.0K - - system.slice/slurmstepd.scope/job_45307237 - - 460.0K - - The processes of the finished jobs are no longer running. To verify this, we ran the following command: [root@sn013 slurmstepd.scope]# ps aux | grep slurmstep root 1800 0.0 0.0 6868 2816 ? S Feb26 0:00 /usr/sbin/slurmstepd infinity root 1899750 0.0 0.0 617208 7392 ? Sl 10:41 0:00 slurmstepd: [46918979.extern] root 1899768 0.0 0.0 945604 7744 ? Sl 10:41 0:00 slurmstepd: [46918979.batch] root 1900888 0.0 0.0 1163752 15632 ? Sl 10:41 0:03 slurmstepd: [46918979.0] root 2048685 0.0 0.0 6412 2112 pts/0 S+ 12:31 0:00 grep --color=auto slurmstep [root@sn013 slurmstepd.scope]# However, cgroup v2 is unable to clean up these records, and this list can continue to grow over time, consuming RAM, as observed in systemd-cgtop. The only way we found to remove them is by stopping the slurmstepd daemon, but this would cancel all tasks on the node. Any hints on how to solve this? [root@sn013 slurmstepd.scope]# systemd-cgtop | grep slurmstep | wc -l 219
> Any hints on how to solve this? > We have a similar issue, also using cgroupsv2. We have not figured out how to stop it from happening, but they can be cleaned up on our systems by using `rmdir` (as root) - note however than `rm -rf` will _not_ work on cgroups because the sysfs directories contain a number of files that can't be deleted (but also don't really exist as they are just interfaces to kernel functionality). There are usually several levels of nested cgroups underneath the job cgroup, and you need to `rmdir` the most deeply nested cgroup first and then work your way up. Here is some output from a cleaning script we wrote to deal with this: ``` finding all cgroups for dead job '89993' under '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993' removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user/task_0' removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user' removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0' removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993' ``` The following bash one-liner works for us (though in production we are using something a bit more robust). This needs to be run on the node where the cgroups for dead jobs are located and run as a user who can sudo to root in order to do the `rmdir`: ``` cgroup_jobs="$(for cgroup_path in $(find /sys/fs/cgroup/system.slice/slurmstepd.scope -maxdepth 1 -name job_\*); do jobid="$(echo "${cgroup_path}" | cut -f2 -d_)"; echo "${jobid}"; done)"; node_jobs="$(scontrol --json=v0.0.44 listjobs | jq -r '.jobs[] | .job_id')"; dead_jobs="$(comm -2 -3 <(echo "${cgroup_jobs}" | sort) <(echo "${node_jobs}" | sort))"; (while IFS= read -r dead_job ; do dead_job_cgroup_path="/sys/fs/cgroup/system.slice/slurmstepd.scope/job_${dead_job}"; echo "finding all cgroups for dead job '${dead_job}' under '${dead_job_cgroup_path}'"; dead_cgroup_paths="$(find "${dead_job_cgroup_path}" -type d | tac)"; (while IFS= read -r dead_cgroup_path ; do echo "removing cgroup '${dead_cgroup_path}'"; sudo rmdir "${dead_cgroup_path}"; done <<< "${dead_cgroup_paths}"); done <<< "${dead_jobs}") ``` What it does is: - uses `find` to find all of the `job_*` cgroups under `/sys/fs/cgroup/system.slice/slurmstepd.scope` and extract the job ids from them - uses `scontrol` to list the jobs currently running on the node - uses `sort` and `comm` to get the list of cgroup jobs that are not also jobs currently running on the node (i.e. dead jobs) - uses `find` on each dead cgroup directory to find all of the directories under it, then `tac` to take them in reverse order (i.e. deepest first) - uses `rmdir` to remove each of the cgroups If this `rmdir` fails you should check the content of the most deeply nested cgroup directory. If `rmdir` does not work, there are likely still processes running in the cgroup.
Dear all, We (NLHPC) noticed this bug more than a year ago, I believe, while running jobs with Slurm 24.11.6. The main issue this causes us is the memory usage generated by these “zombie jobs” retained by cgroups. If a running job uses all the RAM available to Slurm, and cgroups keeps many completed jobs in memory as zombies, the jobs fail and, in some cases, the compute nodes themselves also fail. Evidence: *[root@mn002 ~]# squeue -w mn002* JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5357536 main pi root R 16:29 1 mn002 5064595 main NPG_Graf k*****e R 2-19:06:36 2 mn[002,017] 5064596 main NPG_grap k******e R 2-19:06:36 2 mn[002,017] BAD: [root@mn002 ~]# *systemd-cgtop | grep slurm* system.slice/slurmd.service 12 - 177.4M - - system.slice/slurmstepd.scope 1856 - 129.3G - - system.slice/slurmstepd.scope/job_1179424 - - 166.6M - - system.slice/slurmstepd.scope/job_1179426 - - 169.0M - - system.slice/slurmstepd.scope/job_1179427 - - 168.4M - - system.slice/slurmstepd.scope/job_1189620 - - 12.7M - - system.slice/slurmstepd.scope/job_1191573 - - 12.8M - - system.slice/slurmstepd.scope/job_1223950 - - 49.0M - - system.slice/slurmstepd.scope/job_1245792 - - 68.7M - - system.slice/slurmstepd.scope/job_1387385 - - 29.2M - - system.slice/slurmstepd.scope/job_1387726 - - 18.5M - - system.slice/slurmstepd.scope/job_1387806 - - 18.3M - - system.slice/slurmstepd.scope/job_1390635 - - 18.0M - - system.slice/slurmstepd.scope/job_1390695 - - 25.3M - - system.slice/slurmstepd.scope/job_1390747 - - 14.7M - - system.slice/slurmstepd.scope/job_1401254 - - 10.8M - - system.slice/slurmstepd.scope/job_1402093 - - 15.4M - - system.slice/slurmstepd.scope/job_1402397 - - 15.2M - - system.slice/slurmstepd.scope/job_1423977 - - 44.3M - - system.slice/slurmstepd.scope/job_1424128 - - 45.5M - - system.slice/slurmstepd.scope/job_1429921 - - 12.6M - - system.slice/slurmstepd.scope/job_1429927 - - 12.8M - - system.slice/slurmstepd.scope/job_1429928 - - 12.9M - - system.slice/slurmstepd.scope/job_1430316 - - 12.3M - - system.slice/slurmstepd.scope/job_1430317 - - 12.3M - - system.slice/slurmstepd.scope/job_1430319 - - 12.3M - - system.slice/slurmstepd.scope/job_1473145 - - 4.9M - - system.slice/slurmstepd.scope/job_1497795 - - 4.9M - - system.slice/slurmstepd.scope/job_170931 - - 10.3M - - system.slice/slurmstepd.scope/job_1760182 - - 151.7M - - system.slice/slurmstepd.scope/job_1879121 - - 24.8M - - system.slice/slurmstepd.scope/job_1879889 - - 37.7M - - system.slice/slurmstepd.scope/job_2059199 - - 4.9M - - system.slice/slurmstepd.scope/job_2060449 - - 4.9M - - system.slice/slurmstepd.scope/job_2073587 - - 24.6M - - system.slice/slurmstepd.scope/job_2103666 - - 7.3G - - system.slice/slurmstepd.scope/job_2161401 - - 118.8M - - system.slice/slurmstepd.scope/job_2193711 - - 28.0M - - system.slice/slurmstepd.scope/job_2201407 - - 15.6M - - N Saludos El vie, 8 may 2026 a la(s) 8:04 p.m., <bugs@schedmd.com> escribió: > Joshua C. Randall <jcrandall@alum.mit.edu> changed ticket 22449 > <https://support.schedmd.com/show_bug.cgi?id=22449> > What Removed Added > CC jcrandall@alum.mit.edu > > *Comment # 1 <https://support.schedmd.com/show_bug.cgi?id=22449#c1> on > ticket 22449 <https://support.schedmd.com/show_bug.cgi?id=22449> from > Joshua C. Randall <jcrandall@alum.mit.edu> * > > > Any hints on how to solve this? > > > We have a similar issue, also using cgroupsv2. We have not figured out how to > stop it from happening, but they can be cleaned up on our systems by using > `rmdir` (as root) - note however than `rm -rf` will _not_ work on cgroups > because the sysfs directories contain a number of files that can't be deleted > (but also don't really exist as they are just interfaces to kernel > functionality). > > There are usually several levels of nested cgroups underneath the job cgroup, > and you need to `rmdir` the most deeply nested cgroup first and then work your > way up. > > Here is some output from a cleaning script we wrote to deal with this: > ``` > finding all cgroups for dead job '89993' under > '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993' > removing cgroup > '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user/task_0' > removing cgroup > '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user' > removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0' > removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993' > ``` > > The following bash one-liner works for us (though in production we are using > something a bit more robust). This needs to be run on the node where the > cgroups for dead jobs are located and run as a user who can sudo to root in > order to do the `rmdir`: > ``` > cgroup_jobs="$(for cgroup_path in $(find > /sys/fs/cgroup/system.slice/slurmstepd.scope -maxdepth 1 -name job_\*); do > jobid="$(echo "${cgroup_path}" | cut -f2 -d_)"; echo "${jobid}"; done)"; > node_jobs="$(scontrol --json=v0.0.44 listjobs | jq -r '.jobs[] | .job_id')"; > dead_jobs="$(comm -2 -3 <(echo "${cgroup_jobs}" | sort) <(echo "${node_jobs}" | > sort))"; (while IFS= read -r dead_job ; do > dead_job_cgroup_path="/sys/fs/cgroup/system.slice/slurmstepd.scope/job_${dead_job}"; > echo "finding all cgroups for dead job '${dead_job}' under > '${dead_job_cgroup_path}'"; dead_cgroup_paths="$(find "${dead_job_cgroup_path}" > -type d | tac)"; (while IFS= read -r dead_cgroup_path ; do echo "removing > cgroup '${dead_cgroup_path}'"; sudo rmdir "${dead_cgroup_path}"; done <<< > "${dead_cgroup_paths}"); done <<< "${dead_jobs}") > ``` > > What it does is: > - uses `find` to find all of the `job_*` cgroups under > `/sys/fs/cgroup/system.slice/slurmstepd.scope` and extract the job ids from > them > - uses `scontrol` to list the jobs currently running on the node > - uses `sort` and `comm` to get the list of cgroup jobs that are not also jobs > currently running on the node (i.e. dead jobs) > - uses `find` on each dead cgroup directory to find all of the directories > under it, then `tac` to take them in reverse order (i.e. deepest first) > - uses `rmdir` to remove each of the cgroups > > If this `rmdir` fails you should check the content of the most deeply nested > cgroup directory. If `rmdir` does not work, there are likely still processes > running in the cgroup. > > ------------------------------ > You are receiving this mail because: > > - You reported the ticket. > >
Hi With the script you sent, I am still getting the same message when trying to delete the folder. Could you share the patch you are using in production? rmdir: failed to remove '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_5482424/step_batch/user/task_0': Device or resource busy Saludos Pablo Flores El lun, 11 may 2026 a la(s) 10:21 a.m., Pablo Flores Aravena ( pflores@nlhpc.cl) escribió: > Dear all, > > We (NLHPC) noticed this bug more than a year ago, I believe, while running > jobs with Slurm 24.11.6. > > The main issue this causes us is the memory usage generated by these > “zombie jobs” retained by cgroups. If a running job uses all the RAM > available to Slurm, and cgroups keeps many completed jobs in memory as > zombies, the jobs fail and, in some cases, the compute nodes themselves > also fail. > > Evidence: > > *[root@mn002 ~]# squeue -w mn002* JOBID PARTITION NAME > USER ST TIME NODES NODELIST(REASON) > 5357536 main pi root R 16:29 1 mn002 > 5064595 main NPG_Graf k*****e R 2-19:06:36 2 > mn[002,017] > 5064596 main NPG_grap k******e R 2-19:06:36 2 > mn[002,017] > > BAD: > [root@mn002 ~]# *systemd-cgtop | grep slurm* > system.slice/slurmd.service > 12 - 177.4M - - > system.slice/slurmstepd.scope > 1856 - 129.3G - - > system.slice/slurmstepd.scope/job_1179424 > - - 166.6M - - > system.slice/slurmstepd.scope/job_1179426 > - - 169.0M - - > system.slice/slurmstepd.scope/job_1179427 > - - 168.4M - - > system.slice/slurmstepd.scope/job_1189620 > - - 12.7M - - > system.slice/slurmstepd.scope/job_1191573 > - - 12.8M - - > system.slice/slurmstepd.scope/job_1223950 > - - 49.0M - - > system.slice/slurmstepd.scope/job_1245792 > - - 68.7M - - > system.slice/slurmstepd.scope/job_1387385 > - - 29.2M - - > system.slice/slurmstepd.scope/job_1387726 > - - 18.5M - - > system.slice/slurmstepd.scope/job_1387806 > - - 18.3M - - > system.slice/slurmstepd.scope/job_1390635 > - - 18.0M - - > system.slice/slurmstepd.scope/job_1390695 > - - 25.3M - - > system.slice/slurmstepd.scope/job_1390747 > - - 14.7M - - > system.slice/slurmstepd.scope/job_1401254 > - - 10.8M - - > system.slice/slurmstepd.scope/job_1402093 > - - 15.4M - - > system.slice/slurmstepd.scope/job_1402397 > - - 15.2M - - > system.slice/slurmstepd.scope/job_1423977 > - - 44.3M - - > system.slice/slurmstepd.scope/job_1424128 > - - 45.5M - - > system.slice/slurmstepd.scope/job_1429921 > - - 12.6M - - > system.slice/slurmstepd.scope/job_1429927 > - - 12.8M - - > system.slice/slurmstepd.scope/job_1429928 > - - 12.9M - - > system.slice/slurmstepd.scope/job_1430316 > - - 12.3M - - > system.slice/slurmstepd.scope/job_1430317 > - - 12.3M - - > system.slice/slurmstepd.scope/job_1430319 > - - 12.3M - - > system.slice/slurmstepd.scope/job_1473145 > - - 4.9M - - > system.slice/slurmstepd.scope/job_1497795 > - - 4.9M - - > system.slice/slurmstepd.scope/job_170931 > - - 10.3M - - > system.slice/slurmstepd.scope/job_1760182 > - - 151.7M - - > system.slice/slurmstepd.scope/job_1879121 > - - 24.8M - - > system.slice/slurmstepd.scope/job_1879889 > - - 37.7M - - > system.slice/slurmstepd.scope/job_2059199 > - - 4.9M - - > system.slice/slurmstepd.scope/job_2060449 > - - 4.9M - - > system.slice/slurmstepd.scope/job_2073587 > - - 24.6M - - > system.slice/slurmstepd.scope/job_2103666 > - - 7.3G - - > system.slice/slurmstepd.scope/job_2161401 > - - 118.8M - - > system.slice/slurmstepd.scope/job_2193711 > - - 28.0M - - > system.slice/slurmstepd.scope/job_2201407 > - - 15.6M - - > > N > > Saludos > > El vie, 8 may 2026 a la(s) 8:04 p.m., <bugs@schedmd.com> escribió: > >> Joshua C. Randall <jcrandall@alum.mit.edu> changed ticket 22449 >> <https://support.schedmd.com/show_bug.cgi?id=22449> >> What Removed Added >> CC jcrandall@alum.mit.edu >> >> *Comment # 1 <https://support.schedmd.com/show_bug.cgi?id=22449#c1> on >> ticket 22449 <https://support.schedmd.com/show_bug.cgi?id=22449> from >> Joshua C. Randall <jcrandall@alum.mit.edu> * >> >> > Any hints on how to solve this? >> > >> We have a similar issue, also using cgroupsv2. We have not figured out how to >> stop it from happening, but they can be cleaned up on our systems by using >> `rmdir` (as root) - note however than `rm -rf` will _not_ work on cgroups >> because the sysfs directories contain a number of files that can't be deleted >> (but also don't really exist as they are just interfaces to kernel >> functionality). >> >> There are usually several levels of nested cgroups underneath the job cgroup, >> and you need to `rmdir` the most deeply nested cgroup first and then work your >> way up. >> >> Here is some output from a cleaning script we wrote to deal with this: >> ``` >> finding all cgroups for dead job '89993' under >> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993' >> removing cgroup >> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user/task_0' >> removing cgroup >> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user' >> removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0' >> removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993' >> ``` >> >> The following bash one-liner works for us (though in production we are using >> something a bit more robust). This needs to be run on the node where the >> cgroups for dead jobs are located and run as a user who can sudo to root in >> order to do the `rmdir`: >> ``` >> cgroup_jobs="$(for cgroup_path in $(find >> /sys/fs/cgroup/system.slice/slurmstepd.scope -maxdepth 1 -name job_\*); do >> jobid="$(echo "${cgroup_path}" | cut -f2 -d_)"; echo "${jobid}"; done)"; >> node_jobs="$(scontrol --json=v0.0.44 listjobs | jq -r '.jobs[] | .job_id')"; >> dead_jobs="$(comm -2 -3 <(echo "${cgroup_jobs}" | sort) <(echo "${node_jobs}" | >> sort))"; (while IFS= read -r dead_job ; do >> dead_job_cgroup_path="/sys/fs/cgroup/system.slice/slurmstepd.scope/job_${dead_job}"; >> echo "finding all cgroups for dead job '${dead_job}' under >> '${dead_job_cgroup_path}'"; dead_cgroup_paths="$(find "${dead_job_cgroup_path}" >> -type d | tac)"; (while IFS= read -r dead_cgroup_path ; do echo "removing >> cgroup '${dead_cgroup_path}'"; sudo rmdir "${dead_cgroup_path}"; done <<< >> "${dead_cgroup_paths}"); done <<< "${dead_jobs}") >> ``` >> >> What it does is: >> - uses `find` to find all of the `job_*` cgroups under >> `/sys/fs/cgroup/system.slice/slurmstepd.scope` and extract the job ids from >> them >> - uses `scontrol` to list the jobs currently running on the node >> - uses `sort` and `comm` to get the list of cgroup jobs that are not also jobs >> currently running on the node (i.e. dead jobs) >> - uses `find` on each dead cgroup directory to find all of the directories >> under it, then `tac` to take them in reverse order (i.e. deepest first) >> - uses `rmdir` to remove each of the cgroups >> >> If this `rmdir` fails you should check the content of the most deeply nested >> cgroup directory. If `rmdir` does not work, there are likely still processes >> running in the cgroup. >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the ticket. >> >> > > -- > Equipo de Soporte del NLHPC | soporte@nlhpc.cl > National Laboratory for High Performance Computing (NLHPC) | www.nlhpc.cl > Centro de Modelamiento Matemático (CMM) > Facultad de Ciencias Físicas y Matemáticas. Universidad de Chile > Beauchef 851, 6º Piso > Teléfono oficina: +56 2 29784603 >