| Summary: | /sys/fs/cgroup/system.slice/slurmstepd.scope | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Pablo Flores <pflores> |
| Component: | slurmstepd | Assignee: | Jacob Jenson <jacob> |
| Status: | OPEN --- | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | CC: | jcrandall |
| Version: | 24.11.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Pablo Flores
2025-03-27 09:52:49 MDT
> Any hints on how to solve this?
>
We have a similar issue, also using cgroupsv2. We have not figured out how to stop it from happening, but they can be cleaned up on our systems by using `rmdir` (as root) - note however than `rm -rf` will _not_ work on cgroups because the sysfs directories contain a number of files that can't be deleted (but also don't really exist as they are just interfaces to kernel functionality).
There are usually several levels of nested cgroups underneath the job cgroup, and you need to `rmdir` the most deeply nested cgroup first and then work your way up.
Here is some output from a cleaning script we wrote to deal with this:
```
finding all cgroups for dead job '89993' under '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993'
removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user/task_0'
removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user'
removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0'
removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993'
```
The following bash one-liner works for us (though in production we are using something a bit more robust). This needs to be run on the node where the cgroups for dead jobs are located and run as a user who can sudo to root in order to do the `rmdir`:
```
cgroup_jobs="$(for cgroup_path in $(find /sys/fs/cgroup/system.slice/slurmstepd.scope -maxdepth 1 -name job_\*); do jobid="$(echo "${cgroup_path}" | cut -f2 -d_)"; echo "${jobid}"; done)"; node_jobs="$(scontrol --json=v0.0.44 listjobs | jq -r '.jobs[] | .job_id')"; dead_jobs="$(comm -2 -3 <(echo "${cgroup_jobs}" | sort) <(echo "${node_jobs}" | sort))"; (while IFS= read -r dead_job ; do dead_job_cgroup_path="/sys/fs/cgroup/system.slice/slurmstepd.scope/job_${dead_job}"; echo "finding all cgroups for dead job '${dead_job}' under '${dead_job_cgroup_path}'"; dead_cgroup_paths="$(find "${dead_job_cgroup_path}" -type d | tac)"; (while IFS= read -r dead_cgroup_path ; do echo "removing cgroup '${dead_cgroup_path}'"; sudo rmdir "${dead_cgroup_path}"; done <<< "${dead_cgroup_paths}"); done <<< "${dead_jobs}")
```
What it does is:
- uses `find` to find all of the `job_*` cgroups under `/sys/fs/cgroup/system.slice/slurmstepd.scope` and extract the job ids from them
- uses `scontrol` to list the jobs currently running on the node
- uses `sort` and `comm` to get the list of cgroup jobs that are not also jobs currently running on the node (i.e. dead jobs)
- uses `find` on each dead cgroup directory to find all of the directories under it, then `tac` to take them in reverse order (i.e. deepest first)
- uses `rmdir` to remove each of the cgroups
If this `rmdir` fails you should check the content of the most deeply nested cgroup directory. If `rmdir` does not work, there are likely still processes running in the cgroup.
Dear all,
We (NLHPC) noticed this bug more than a year ago, I believe, while running
jobs with Slurm 24.11.6.
The main issue this causes us is the memory usage generated by these
“zombie jobs” retained by cgroups. If a running job uses all the RAM
available to Slurm, and cgroups keeps many completed jobs in memory as
zombies, the jobs fail and, in some cases, the compute nodes themselves
also fail.
Evidence:
*[root@mn002 ~]# squeue -w mn002* JOBID PARTITION NAME
USER ST TIME NODES NODELIST(REASON)
5357536 main pi root R 16:29 1 mn002
5064595 main NPG_Graf k*****e R 2-19:06:36 2
mn[002,017]
5064596 main NPG_grap k******e R 2-19:06:36 2
mn[002,017]
BAD:
[root@mn002 ~]# *systemd-cgtop | grep slurm*
system.slice/slurmd.service
12 - 177.4M - -
system.slice/slurmstepd.scope
1856 - 129.3G - -
system.slice/slurmstepd.scope/job_1179424
- - 166.6M - -
system.slice/slurmstepd.scope/job_1179426
- - 169.0M - -
system.slice/slurmstepd.scope/job_1179427
- - 168.4M - -
system.slice/slurmstepd.scope/job_1189620
- - 12.7M - -
system.slice/slurmstepd.scope/job_1191573
- - 12.8M - -
system.slice/slurmstepd.scope/job_1223950
- - 49.0M - -
system.slice/slurmstepd.scope/job_1245792
- - 68.7M - -
system.slice/slurmstepd.scope/job_1387385
- - 29.2M - -
system.slice/slurmstepd.scope/job_1387726
- - 18.5M - -
system.slice/slurmstepd.scope/job_1387806
- - 18.3M - -
system.slice/slurmstepd.scope/job_1390635
- - 18.0M - -
system.slice/slurmstepd.scope/job_1390695
- - 25.3M - -
system.slice/slurmstepd.scope/job_1390747
- - 14.7M - -
system.slice/slurmstepd.scope/job_1401254
- - 10.8M - -
system.slice/slurmstepd.scope/job_1402093
- - 15.4M - -
system.slice/slurmstepd.scope/job_1402397
- - 15.2M - -
system.slice/slurmstepd.scope/job_1423977
- - 44.3M - -
system.slice/slurmstepd.scope/job_1424128
- - 45.5M - -
system.slice/slurmstepd.scope/job_1429921
- - 12.6M - -
system.slice/slurmstepd.scope/job_1429927
- - 12.8M - -
system.slice/slurmstepd.scope/job_1429928
- - 12.9M - -
system.slice/slurmstepd.scope/job_1430316
- - 12.3M - -
system.slice/slurmstepd.scope/job_1430317
- - 12.3M - -
system.slice/slurmstepd.scope/job_1430319
- - 12.3M - -
system.slice/slurmstepd.scope/job_1473145
- - 4.9M - -
system.slice/slurmstepd.scope/job_1497795
- - 4.9M - -
system.slice/slurmstepd.scope/job_170931
- - 10.3M - -
system.slice/slurmstepd.scope/job_1760182
- - 151.7M - -
system.slice/slurmstepd.scope/job_1879121
- - 24.8M - -
system.slice/slurmstepd.scope/job_1879889
- - 37.7M - -
system.slice/slurmstepd.scope/job_2059199
- - 4.9M - -
system.slice/slurmstepd.scope/job_2060449
- - 4.9M - -
system.slice/slurmstepd.scope/job_2073587
- - 24.6M - -
system.slice/slurmstepd.scope/job_2103666
- - 7.3G - -
system.slice/slurmstepd.scope/job_2161401
- - 118.8M - -
system.slice/slurmstepd.scope/job_2193711
- - 28.0M - -
system.slice/slurmstepd.scope/job_2201407
- - 15.6M - -
N
Saludos
El vie, 8 may 2026 a la(s) 8:04 p.m., <bugs@schedmd.com> escribió:
> Joshua C. Randall <jcrandall@alum.mit.edu> changed ticket 22449
> <https://support.schedmd.com/show_bug.cgi?id=22449>
> What Removed Added
> CC jcrandall@alum.mit.edu
>
> *Comment # 1 <https://support.schedmd.com/show_bug.cgi?id=22449#c1> on
> ticket 22449 <https://support.schedmd.com/show_bug.cgi?id=22449> from
> Joshua C. Randall <jcrandall@alum.mit.edu> *
>
> > Any hints on how to solve this?
> >
> We have a similar issue, also using cgroupsv2. We have not figured out how to
> stop it from happening, but they can be cleaned up on our systems by using
> `rmdir` (as root) - note however than `rm -rf` will _not_ work on cgroups
> because the sysfs directories contain a number of files that can't be deleted
> (but also don't really exist as they are just interfaces to kernel
> functionality).
>
> There are usually several levels of nested cgroups underneath the job cgroup,
> and you need to `rmdir` the most deeply nested cgroup first and then work your
> way up.
>
> Here is some output from a cleaning script we wrote to deal with this:
> ```
> finding all cgroups for dead job '89993' under
> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993'
> removing cgroup
> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user/task_0'
> removing cgroup
> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user'
> removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0'
> removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993'
> ```
>
> The following bash one-liner works for us (though in production we are using
> something a bit more robust). This needs to be run on the node where the
> cgroups for dead jobs are located and run as a user who can sudo to root in
> order to do the `rmdir`:
> ```
> cgroup_jobs="$(for cgroup_path in $(find
> /sys/fs/cgroup/system.slice/slurmstepd.scope -maxdepth 1 -name job_\*); do
> jobid="$(echo "${cgroup_path}" | cut -f2 -d_)"; echo "${jobid}"; done)";
> node_jobs="$(scontrol --json=v0.0.44 listjobs | jq -r '.jobs[] | .job_id')";
> dead_jobs="$(comm -2 -3 <(echo "${cgroup_jobs}" | sort) <(echo "${node_jobs}" |
> sort))"; (while IFS= read -r dead_job ; do
> dead_job_cgroup_path="/sys/fs/cgroup/system.slice/slurmstepd.scope/job_${dead_job}";
> echo "finding all cgroups for dead job '${dead_job}' under
> '${dead_job_cgroup_path}'"; dead_cgroup_paths="$(find "${dead_job_cgroup_path}"
> -type d | tac)"; (while IFS= read -r dead_cgroup_path ; do echo "removing
> cgroup '${dead_cgroup_path}'"; sudo rmdir "${dead_cgroup_path}"; done <<<
> "${dead_cgroup_paths}"); done <<< "${dead_jobs}")
> ```
>
> What it does is:
> - uses `find` to find all of the `job_*` cgroups under
> `/sys/fs/cgroup/system.slice/slurmstepd.scope` and extract the job ids from
> them
> - uses `scontrol` to list the jobs currently running on the node
> - uses `sort` and `comm` to get the list of cgroup jobs that are not also jobs
> currently running on the node (i.e. dead jobs)
> - uses `find` on each dead cgroup directory to find all of the directories
> under it, then `tac` to take them in reverse order (i.e. deepest first)
> - uses `rmdir` to remove each of the cgroups
>
> If this `rmdir` fails you should check the content of the most deeply nested
> cgroup directory. If `rmdir` does not work, there are likely still processes
> running in the cgroup.
>
> ------------------------------
> You are receiving this mail because:
>
> - You reported the ticket.
>
>
Hi With the script you sent, I am still getting the same message when trying to delete the folder. Could you share the patch you are using in production? rmdir: failed to remove '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_5482424/step_batch/user/task_0': Device or resource busy Saludos Pablo Flores El lun, 11 may 2026 a la(s) 10:21 a.m., Pablo Flores Aravena ( pflores@nlhpc.cl) escribió: > Dear all, > > We (NLHPC) noticed this bug more than a year ago, I believe, while running > jobs with Slurm 24.11.6. > > The main issue this causes us is the memory usage generated by these > “zombie jobs” retained by cgroups. If a running job uses all the RAM > available to Slurm, and cgroups keeps many completed jobs in memory as > zombies, the jobs fail and, in some cases, the compute nodes themselves > also fail. > > Evidence: > > *[root@mn002 ~]# squeue -w mn002* JOBID PARTITION NAME > USER ST TIME NODES NODELIST(REASON) > 5357536 main pi root R 16:29 1 mn002 > 5064595 main NPG_Graf k*****e R 2-19:06:36 2 > mn[002,017] > 5064596 main NPG_grap k******e R 2-19:06:36 2 > mn[002,017] > > BAD: > [root@mn002 ~]# *systemd-cgtop | grep slurm* > system.slice/slurmd.service > 12 - 177.4M - - > system.slice/slurmstepd.scope > 1856 - 129.3G - - > system.slice/slurmstepd.scope/job_1179424 > - - 166.6M - - > system.slice/slurmstepd.scope/job_1179426 > - - 169.0M - - > system.slice/slurmstepd.scope/job_1179427 > - - 168.4M - - > system.slice/slurmstepd.scope/job_1189620 > - - 12.7M - - > system.slice/slurmstepd.scope/job_1191573 > - - 12.8M - - > system.slice/slurmstepd.scope/job_1223950 > - - 49.0M - - > system.slice/slurmstepd.scope/job_1245792 > - - 68.7M - - > system.slice/slurmstepd.scope/job_1387385 > - - 29.2M - - > system.slice/slurmstepd.scope/job_1387726 > - - 18.5M - - > system.slice/slurmstepd.scope/job_1387806 > - - 18.3M - - > system.slice/slurmstepd.scope/job_1390635 > - - 18.0M - - > system.slice/slurmstepd.scope/job_1390695 > - - 25.3M - - > system.slice/slurmstepd.scope/job_1390747 > - - 14.7M - - > system.slice/slurmstepd.scope/job_1401254 > - - 10.8M - - > system.slice/slurmstepd.scope/job_1402093 > - - 15.4M - - > system.slice/slurmstepd.scope/job_1402397 > - - 15.2M - - > system.slice/slurmstepd.scope/job_1423977 > - - 44.3M - - > system.slice/slurmstepd.scope/job_1424128 > - - 45.5M - - > system.slice/slurmstepd.scope/job_1429921 > - - 12.6M - - > system.slice/slurmstepd.scope/job_1429927 > - - 12.8M - - > system.slice/slurmstepd.scope/job_1429928 > - - 12.9M - - > system.slice/slurmstepd.scope/job_1430316 > - - 12.3M - - > system.slice/slurmstepd.scope/job_1430317 > - - 12.3M - - > system.slice/slurmstepd.scope/job_1430319 > - - 12.3M - - > system.slice/slurmstepd.scope/job_1473145 > - - 4.9M - - > system.slice/slurmstepd.scope/job_1497795 > - - 4.9M - - > system.slice/slurmstepd.scope/job_170931 > - - 10.3M - - > system.slice/slurmstepd.scope/job_1760182 > - - 151.7M - - > system.slice/slurmstepd.scope/job_1879121 > - - 24.8M - - > system.slice/slurmstepd.scope/job_1879889 > - - 37.7M - - > system.slice/slurmstepd.scope/job_2059199 > - - 4.9M - - > system.slice/slurmstepd.scope/job_2060449 > - - 4.9M - - > system.slice/slurmstepd.scope/job_2073587 > - - 24.6M - - > system.slice/slurmstepd.scope/job_2103666 > - - 7.3G - - > system.slice/slurmstepd.scope/job_2161401 > - - 118.8M - - > system.slice/slurmstepd.scope/job_2193711 > - - 28.0M - - > system.slice/slurmstepd.scope/job_2201407 > - - 15.6M - - > > N > > Saludos > > El vie, 8 may 2026 a la(s) 8:04 p.m., <bugs@schedmd.com> escribió: > >> Joshua C. Randall <jcrandall@alum.mit.edu> changed ticket 22449 >> <https://support.schedmd.com/show_bug.cgi?id=22449> >> What Removed Added >> CC jcrandall@alum.mit.edu >> >> *Comment # 1 <https://support.schedmd.com/show_bug.cgi?id=22449#c1> on >> ticket 22449 <https://support.schedmd.com/show_bug.cgi?id=22449> from >> Joshua C. Randall <jcrandall@alum.mit.edu> * >> >> > Any hints on how to solve this? >> > >> We have a similar issue, also using cgroupsv2. We have not figured out how to >> stop it from happening, but they can be cleaned up on our systems by using >> `rmdir` (as root) - note however than `rm -rf` will _not_ work on cgroups >> because the sysfs directories contain a number of files that can't be deleted >> (but also don't really exist as they are just interfaces to kernel >> functionality). >> >> There are usually several levels of nested cgroups underneath the job cgroup, >> and you need to `rmdir` the most deeply nested cgroup first and then work your >> way up. >> >> Here is some output from a cleaning script we wrote to deal with this: >> ``` >> finding all cgroups for dead job '89993' under >> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993' >> removing cgroup >> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user/task_0' >> removing cgroup >> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user' >> removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0' >> removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993' >> ``` >> >> The following bash one-liner works for us (though in production we are using >> something a bit more robust). This needs to be run on the node where the >> cgroups for dead jobs are located and run as a user who can sudo to root in >> order to do the `rmdir`: >> ``` >> cgroup_jobs="$(for cgroup_path in $(find >> /sys/fs/cgroup/system.slice/slurmstepd.scope -maxdepth 1 -name job_\*); do >> jobid="$(echo "${cgroup_path}" | cut -f2 -d_)"; echo "${jobid}"; done)"; >> node_jobs="$(scontrol --json=v0.0.44 listjobs | jq -r '.jobs[] | .job_id')"; >> dead_jobs="$(comm -2 -3 <(echo "${cgroup_jobs}" | sort) <(echo "${node_jobs}" | >> sort))"; (while IFS= read -r dead_job ; do >> dead_job_cgroup_path="/sys/fs/cgroup/system.slice/slurmstepd.scope/job_${dead_job}"; >> echo "finding all cgroups for dead job '${dead_job}' under >> '${dead_job_cgroup_path}'"; dead_cgroup_paths="$(find "${dead_job_cgroup_path}" >> -type d | tac)"; (while IFS= read -r dead_cgroup_path ; do echo "removing >> cgroup '${dead_cgroup_path}'"; sudo rmdir "${dead_cgroup_path}"; done <<< >> "${dead_cgroup_paths}"); done <<< "${dead_jobs}") >> ``` >> >> What it does is: >> - uses `find` to find all of the `job_*` cgroups under >> `/sys/fs/cgroup/system.slice/slurmstepd.scope` and extract the job ids from >> them >> - uses `scontrol` to list the jobs currently running on the node >> - uses `sort` and `comm` to get the list of cgroup jobs that are not also jobs >> currently running on the node (i.e. dead jobs) >> - uses `find` on each dead cgroup directory to find all of the directories >> under it, then `tac` to take them in reverse order (i.e. deepest first) >> - uses `rmdir` to remove each of the cgroups >> >> If this `rmdir` fails you should check the content of the most deeply nested >> cgroup directory. If `rmdir` does not work, there are likely still processes >> running in the cgroup. >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the ticket. >> >> > > -- > Equipo de Soporte del NLHPC | soporte@nlhpc.cl > National Laboratory for High Performance Computing (NLHPC) | www.nlhpc.cl > Centro de Modelamiento Matemático (CMM) > Facultad de Ciencias Físicas y Matemáticas. Universidad de Chile > Beauchef 851, 6º Piso > Teléfono oficina: +56 2 29784603 > |