Ticket 22449

Summary:	/sys/fs/cgroup/system.slice/slurmstepd.scope
Product:	Slurm	Reporter:	Pablo Flores <pflores>
Component:	slurmstepd	Assignee:	Jacob Jenson <jacob>
Status:	OPEN ---	QA Contact:
Severity:	6 - No support contract
Priority:	---	CC:	jcrandall
Version:	24.11.3
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Pablo Flores 2025-03-27 09:52:49 MDT

cgroup is not completely cleaning up the files of finished jobs.

In the directory /sys/fs/cgroup/system.slice/slurmstepd.scope, records of jobs that have already finished can be observed.


#ls
job_45304707  job_46043148  job_46236561  job_46540674  job_46548677  job_46553239  job_46705405  
cgroup.events                    job_45307219  job_46043404  job_46239503  job_46540731  job_46548694  job_46553431  job_46705417  
cgroup.freeze                    job_45307228  job_46044131  job_46339702  job_46541132  job_46548988  job_46553474  job_46761341  
cgroup.kill                      job_45307237  job_46046151  job_46390017  job_46541170  job_46549130  job_46554100  job_46761501  
cgroup.max.depth                 job_45311310  job_46075084  job_46391601  job_46541227  job_46549274  job_46564361  job_46830705  

These records can also be observed by running the following command (the output is not complete):

[root@sn013 slurmstepd.scope]# systemd-cgtop | grep slurmstep
system.slice/slurmstepd.scope                                            146      -    36.6G        -        -  
system.slice/slurmstepd.scope/job_45290719                                 -      -   468.0K        -        -  
system.slice/slurmstepd.scope/job_45299168                                 -      -   236.0K        -        -  
system.slice/slurmstepd.scope/job_45304402                                 -      -   460.0K        -        -  
system.slice/slurmstepd.scope/job_45304696                                 -      -   460.0K        -        -  
system.slice/slurmstepd.scope/job_45304707                                 -      -   460.0K        -        -  
system.slice/slurmstepd.scope/job_45307219                                 -      -   460.0K        -        -  
system.slice/slurmstepd.scope/job_45307228                                 -      -   468.0K        -        -  
system.slice/slurmstepd.scope/job_45307237                                 -      -   460.0K        -        -  


The processes of the finished jobs are no longer running. To verify this, we ran the following command:

[root@sn013 slurmstepd.scope]# ps aux | grep slurmstep
root        1800  0.0  0.0   6868  2816 ?        S    Feb26   0:00 /usr/sbin/slurmstepd infinity  
root     1899750  0.0  0.0 617208  7392 ?        Sl   10:41   0:00 slurmstepd: [46918979.extern]  
root     1899768  0.0  0.0 945604  7744 ?        Sl   10:41   0:00 slurmstepd: [46918979.batch]  
root     1900888  0.0  0.0 1163752 15632 ?       Sl   10:41   0:03 slurmstepd: [46918979.0]  
root     2048685  0.0  0.0   6412  2112 pts/0    S+   12:31   0:00 grep --color=auto slurmstep  


[root@sn013 slurmstepd.scope]#  
However, cgroup v2 is unable to clean up these records, and this list can continue to grow over time, consuming RAM, as observed in systemd-cgtop.

The only way we found to remove them is by stopping the slurmstepd daemon, but this would cancel all tasks on the node.

Any hints on how to solve this?

[root@sn013 slurmstepd.scope]# systemd-cgtop | grep slurmstep | wc -l
219

Comment 1 Joshua C. Randall 2026-05-08 18:04:13 MDT

> Any hints on how to solve this?
> 
We have a similar issue, also using cgroupsv2. We have not figured out how to stop it from happening, but they can be cleaned up on our systems by using `rmdir` (as root) - note however than `rm -rf` will _not_ work on cgroups because the sysfs directories contain a number of files that can't be deleted (but also don't really exist as they are just interfaces to kernel functionality). 

There are usually several levels of nested cgroups underneath the job cgroup, and you need to `rmdir` the most deeply nested cgroup first and then work your way up. 

Here is some output from a cleaning script we wrote to deal with this:
```
finding all cgroups for dead job '89993' under '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993'
removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user/task_0'
removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user'
removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0'
removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993'
```

The following bash one-liner works for us (though in production we are using something a bit more robust). This needs to be run on the node where the cgroups for dead jobs are located and run as a user who can sudo to root in order to do the `rmdir`:
```
cgroup_jobs="$(for cgroup_path in $(find /sys/fs/cgroup/system.slice/slurmstepd.scope -maxdepth 1 -name job_\*); do jobid="$(echo "${cgroup_path}" | cut -f2 -d_)"; echo "${jobid}"; done)"; node_jobs="$(scontrol --json=v0.0.44 listjobs | jq -r '.jobs[] | .job_id')"; dead_jobs="$(comm -2 -3 <(echo "${cgroup_jobs}" | sort) <(echo "${node_jobs}" | sort))"; (while IFS= read -r dead_job ; do dead_job_cgroup_path="/sys/fs/cgroup/system.slice/slurmstepd.scope/job_${dead_job}"; echo "finding all cgroups for dead job '${dead_job}' under '${dead_job_cgroup_path}'"; dead_cgroup_paths="$(find "${dead_job_cgroup_path}" -type d | tac)"; (while IFS= read -r dead_cgroup_path ; do echo "removing cgroup '${dead_cgroup_path}'"; sudo rmdir "${dead_cgroup_path}"; done <<< "${dead_cgroup_paths}"); done <<< "${dead_jobs}")
```

What it does is:
 - uses `find` to find all of the `job_*` cgroups under `/sys/fs/cgroup/system.slice/slurmstepd.scope` and extract the job ids from them
 - uses `scontrol` to list the jobs currently running on the node
 - uses `sort` and `comm` to get the list of cgroup jobs that are not also jobs currently running on the node (i.e. dead jobs)
 - uses `find` on each dead cgroup directory to find all of the directories under it, then `tac` to take them in reverse order (i.e. deepest first)
 - uses `rmdir` to remove each of the cgroups

If this `rmdir` fails you should check the content of the most deeply nested cgroup directory. If `rmdir` does not work, there are likely still processes running in the cgroup.

Comment 2 Pablo Flores 2026-05-11 08:21:38 MDT

Dear all,

We (NLHPC) noticed this bug more than a year ago, I believe, while running
jobs with Slurm 24.11.6.

The main issue this causes us is the memory usage generated by these
“zombie jobs” retained by cgroups. If a running job uses all the RAM
available to Slurm, and cgroups keeps many completed jobs in memory as
zombies, the jobs fail and, in some cases, the compute nodes themselves
also fail.

Evidence:

*[root@mn002 ~]# squeue -w mn002*             JOBID PARTITION     NAME
USER ST       TIME  NODES NODELIST(REASON)
           5357536      main       pi     root  R      16:29      1 mn002
           5064595      main NPG_Graf k*****e  R 2-19:06:36      2
mn[002,017]
           5064596      main NPG_grap k******e  R 2-19:06:36      2
mn[002,017]

BAD:
[root@mn002 ~]# *systemd-cgtop | grep slurm*
system.slice/slurmd.service
12      -   177.4M        -        -
system.slice/slurmstepd.scope
1856      -   129.3G        -        -
system.slice/slurmstepd.scope/job_1179424
 -      -   166.6M        -        -
system.slice/slurmstepd.scope/job_1179426
 -      -   169.0M        -        -
system.slice/slurmstepd.scope/job_1179427
 -      -   168.4M        -        -
system.slice/slurmstepd.scope/job_1189620
 -      -    12.7M        -        -
system.slice/slurmstepd.scope/job_1191573
 -      -    12.8M        -        -
system.slice/slurmstepd.scope/job_1223950
 -      -    49.0M        -        -
system.slice/slurmstepd.scope/job_1245792
 -      -    68.7M        -        -
system.slice/slurmstepd.scope/job_1387385
 -      -    29.2M        -        -
system.slice/slurmstepd.scope/job_1387726
 -      -    18.5M        -        -
system.slice/slurmstepd.scope/job_1387806
 -      -    18.3M        -        -
system.slice/slurmstepd.scope/job_1390635
 -      -    18.0M        -        -
system.slice/slurmstepd.scope/job_1390695
 -      -    25.3M        -        -
system.slice/slurmstepd.scope/job_1390747
 -      -    14.7M        -        -
system.slice/slurmstepd.scope/job_1401254
 -      -    10.8M        -        -
system.slice/slurmstepd.scope/job_1402093
 -      -    15.4M        -        -
system.slice/slurmstepd.scope/job_1402397
 -      -    15.2M        -        -
system.slice/slurmstepd.scope/job_1423977
 -      -    44.3M        -        -
system.slice/slurmstepd.scope/job_1424128
 -      -    45.5M        -        -
system.slice/slurmstepd.scope/job_1429921
 -      -    12.6M        -        -
system.slice/slurmstepd.scope/job_1429927
 -      -    12.8M        -        -
system.slice/slurmstepd.scope/job_1429928
 -      -    12.9M        -        -
system.slice/slurmstepd.scope/job_1430316
 -      -    12.3M        -        -
system.slice/slurmstepd.scope/job_1430317
 -      -    12.3M        -        -
system.slice/slurmstepd.scope/job_1430319
 -      -    12.3M        -        -
system.slice/slurmstepd.scope/job_1473145
 -      -     4.9M        -        -
system.slice/slurmstepd.scope/job_1497795
 -      -     4.9M        -        -
system.slice/slurmstepd.scope/job_170931
-      -    10.3M        -        -
system.slice/slurmstepd.scope/job_1760182
 -      -   151.7M        -        -
system.slice/slurmstepd.scope/job_1879121
 -      -    24.8M        -        -
system.slice/slurmstepd.scope/job_1879889
 -      -    37.7M        -        -
system.slice/slurmstepd.scope/job_2059199
 -      -     4.9M        -        -
system.slice/slurmstepd.scope/job_2060449
 -      -     4.9M        -        -
system.slice/slurmstepd.scope/job_2073587
 -      -    24.6M        -        -
system.slice/slurmstepd.scope/job_2103666
 -      -     7.3G        -        -
system.slice/slurmstepd.scope/job_2161401
 -      -   118.8M        -        -
system.slice/slurmstepd.scope/job_2193711
 -      -    28.0M        -        -
system.slice/slurmstepd.scope/job_2201407
 -      -    15.6M        -        -

N

Saludos

El vie, 8 may 2026 a la(s) 8:04 p.m., <bugs@schedmd.com> escribió:

> Joshua C. Randall <jcrandall@alum.mit.edu> changed ticket 22449
> <https://support.schedmd.com/show_bug.cgi?id=22449>
> What Removed Added
> CC   jcrandall@alum.mit.edu
>
> *Comment # 1 <https://support.schedmd.com/show_bug.cgi?id=22449#c1> on
> ticket 22449 <https://support.schedmd.com/show_bug.cgi?id=22449> from
> Joshua C. Randall <jcrandall@alum.mit.edu> *
>
> > Any hints on how to solve this?
> >
> We have a similar issue, also using cgroupsv2. We have not figured out how to
> stop it from happening, but they can be cleaned up on our systems by using
> `rmdir` (as root) - note however than `rm -rf` will _not_ work on cgroups
> because the sysfs directories contain a number of files that can't be deleted
> (but also don't really exist as they are just interfaces to kernel
> functionality).
>
> There are usually several levels of nested cgroups underneath the job cgroup,
> and you need to `rmdir` the most deeply nested cgroup first and then work your
> way up.
>
> Here is some output from a cleaning script we wrote to deal with this:
> ```
> finding all cgroups for dead job '89993' under
> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993'
> removing cgroup
> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user/task_0'
> removing cgroup
> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user'
> removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0'
> removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993'
> ```
>
> The following bash one-liner works for us (though in production we are using
> something a bit more robust). This needs to be run on the node where the
> cgroups for dead jobs are located and run as a user who can sudo to root in
> order to do the `rmdir`:
> ```
> cgroup_jobs="$(for cgroup_path in $(find
> /sys/fs/cgroup/system.slice/slurmstepd.scope -maxdepth 1 -name job_\*); do
> jobid="$(echo "${cgroup_path}" | cut -f2 -d_)"; echo "${jobid}"; done)";
> node_jobs="$(scontrol --json=v0.0.44 listjobs | jq -r '.jobs[] | .job_id')";
> dead_jobs="$(comm -2 -3 <(echo "${cgroup_jobs}" | sort) <(echo "${node_jobs}" |
> sort))"; (while IFS= read -r dead_job ; do
> dead_job_cgroup_path="/sys/fs/cgroup/system.slice/slurmstepd.scope/job_${dead_job}";
> echo "finding all cgroups for dead job '${dead_job}' under
> '${dead_job_cgroup_path}'"; dead_cgroup_paths="$(find "${dead_job_cgroup_path}"
> -type d | tac)"; (while IFS= read -r dead_cgroup_path ; do echo "removing
> cgroup '${dead_cgroup_path}'"; sudo rmdir "${dead_cgroup_path}"; done <<<
> "${dead_cgroup_paths}"); done <<< "${dead_jobs}")
> ```
>
> What it does is:
>  - uses `find` to find all of the `job_*` cgroups under
> `/sys/fs/cgroup/system.slice/slurmstepd.scope` and extract the job ids from
> them
>  - uses `scontrol` to list the jobs currently running on the node
>  - uses `sort` and `comm` to get the list of cgroup jobs that are not also jobs
> currently running on the node (i.e. dead jobs)
>  - uses `find` on each dead cgroup directory to find all of the directories
> under it, then `tac` to take them in reverse order (i.e. deepest first)
>  - uses `rmdir` to remove each of the cgroups
>
> If this `rmdir` fails you should check the content of the most deeply nested
> cgroup directory. If `rmdir` does not work, there are likely still processes
> running in the cgroup.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the ticket.
>
>

Comment 3 Pablo Flores 2026-05-12 15:27:13 MDT

Hi

With the script you sent, I am still getting the same message when trying
to delete the folder.

Could you share the patch you are using in production?
rmdir: failed to remove
'/sys/fs/cgroup/system.slice/slurmstepd.scope/job_5482424/step_batch/user/task_0':
Device or resource busy


Saludos

Pablo Flores

El lun, 11 may 2026 a la(s) 10:21 a.m., Pablo Flores Aravena (
pflores@nlhpc.cl) escribió:

> Dear all,
>
> We (NLHPC) noticed this bug more than a year ago, I believe, while running
> jobs with Slurm 24.11.6.
>
> The main issue this causes us is the memory usage generated by these
> “zombie jobs” retained by cgroups. If a running job uses all the RAM
> available to Slurm, and cgroups keeps many completed jobs in memory as
> zombies, the jobs fail and, in some cases, the compute nodes themselves
> also fail.
>
> Evidence:
>
> *[root@mn002 ~]# squeue -w mn002*             JOBID PARTITION     NAME
>   USER ST       TIME  NODES NODELIST(REASON)
>            5357536      main       pi     root  R      16:29      1 mn002
>            5064595      main NPG_Graf k*****e  R 2-19:06:36      2
> mn[002,017]
>            5064596      main NPG_grap k******e  R 2-19:06:36      2
> mn[002,017]
>
> BAD:
> [root@mn002 ~]# *systemd-cgtop | grep slurm*
> system.slice/slurmd.service
> 12      -   177.4M        -        -
> system.slice/slurmstepd.scope
> 1856      -   129.3G        -        -
> system.slice/slurmstepd.scope/job_1179424
>  -      -   166.6M        -        -
> system.slice/slurmstepd.scope/job_1179426
>  -      -   169.0M        -        -
> system.slice/slurmstepd.scope/job_1179427
>  -      -   168.4M        -        -
> system.slice/slurmstepd.scope/job_1189620
>  -      -    12.7M        -        -
> system.slice/slurmstepd.scope/job_1191573
>  -      -    12.8M        -        -
> system.slice/slurmstepd.scope/job_1223950
>  -      -    49.0M        -        -
> system.slice/slurmstepd.scope/job_1245792
>  -      -    68.7M        -        -
> system.slice/slurmstepd.scope/job_1387385
>  -      -    29.2M        -        -
> system.slice/slurmstepd.scope/job_1387726
>  -      -    18.5M        -        -
> system.slice/slurmstepd.scope/job_1387806
>  -      -    18.3M        -        -
> system.slice/slurmstepd.scope/job_1390635
>  -      -    18.0M        -        -
> system.slice/slurmstepd.scope/job_1390695
>  -      -    25.3M        -        -
> system.slice/slurmstepd.scope/job_1390747
>  -      -    14.7M        -        -
> system.slice/slurmstepd.scope/job_1401254
>  -      -    10.8M        -        -
> system.slice/slurmstepd.scope/job_1402093
>  -      -    15.4M        -        -
> system.slice/slurmstepd.scope/job_1402397
>  -      -    15.2M        -        -
> system.slice/slurmstepd.scope/job_1423977
>  -      -    44.3M        -        -
> system.slice/slurmstepd.scope/job_1424128
>  -      -    45.5M        -        -
> system.slice/slurmstepd.scope/job_1429921
>  -      -    12.6M        -        -
> system.slice/slurmstepd.scope/job_1429927
>  -      -    12.8M        -        -
> system.slice/slurmstepd.scope/job_1429928
>  -      -    12.9M        -        -
> system.slice/slurmstepd.scope/job_1430316
>  -      -    12.3M        -        -
> system.slice/slurmstepd.scope/job_1430317
>  -      -    12.3M        -        -
> system.slice/slurmstepd.scope/job_1430319
>  -      -    12.3M        -        -
> system.slice/slurmstepd.scope/job_1473145
>  -      -     4.9M        -        -
> system.slice/slurmstepd.scope/job_1497795
>  -      -     4.9M        -        -
> system.slice/slurmstepd.scope/job_170931
> -      -    10.3M        -        -
> system.slice/slurmstepd.scope/job_1760182
>  -      -   151.7M        -        -
> system.slice/slurmstepd.scope/job_1879121
>  -      -    24.8M        -        -
> system.slice/slurmstepd.scope/job_1879889
>  -      -    37.7M        -        -
> system.slice/slurmstepd.scope/job_2059199
>  -      -     4.9M        -        -
> system.slice/slurmstepd.scope/job_2060449
>  -      -     4.9M        -        -
> system.slice/slurmstepd.scope/job_2073587
>  -      -    24.6M        -        -
> system.slice/slurmstepd.scope/job_2103666
>  -      -     7.3G        -        -
> system.slice/slurmstepd.scope/job_2161401
>  -      -   118.8M        -        -
> system.slice/slurmstepd.scope/job_2193711
>  -      -    28.0M        -        -
> system.slice/slurmstepd.scope/job_2201407
>  -      -    15.6M        -        -
>
> N
>
> Saludos
>
> El vie, 8 may 2026 a la(s) 8:04 p.m., <bugs@schedmd.com> escribió:
>
>> Joshua C. Randall <jcrandall@alum.mit.edu> changed ticket 22449
>> <https://support.schedmd.com/show_bug.cgi?id=22449>
>> What Removed Added
>> CC   jcrandall@alum.mit.edu
>>
>> *Comment # 1 <https://support.schedmd.com/show_bug.cgi?id=22449#c1> on
>> ticket 22449 <https://support.schedmd.com/show_bug.cgi?id=22449> from
>> Joshua C. Randall <jcrandall@alum.mit.edu> *
>>
>> > Any hints on how to solve this?
>> >
>> We have a similar issue, also using cgroupsv2. We have not figured out how to
>> stop it from happening, but they can be cleaned up on our systems by using
>> `rmdir` (as root) - note however than `rm -rf` will _not_ work on cgroups
>> because the sysfs directories contain a number of files that can't be deleted
>> (but also don't really exist as they are just interfaces to kernel
>> functionality).
>>
>> There are usually several levels of nested cgroups underneath the job cgroup,
>> and you need to `rmdir` the most deeply nested cgroup first and then work your
>> way up.
>>
>> Here is some output from a cleaning script we wrote to deal with this:
>> ```
>> finding all cgroups for dead job '89993' under
>> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993'
>> removing cgroup
>> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user/task_0'
>> removing cgroup
>> '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0/user'
>> removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993/step_0'
>> removing cgroup '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_89993'
>> ```
>>
>> The following bash one-liner works for us (though in production we are using
>> something a bit more robust). This needs to be run on the node where the
>> cgroups for dead jobs are located and run as a user who can sudo to root in
>> order to do the `rmdir`:
>> ```
>> cgroup_jobs="$(for cgroup_path in $(find
>> /sys/fs/cgroup/system.slice/slurmstepd.scope -maxdepth 1 -name job_\*); do
>> jobid="$(echo "${cgroup_path}" | cut -f2 -d_)"; echo "${jobid}"; done)";
>> node_jobs="$(scontrol --json=v0.0.44 listjobs | jq -r '.jobs[] | .job_id')";
>> dead_jobs="$(comm -2 -3 <(echo "${cgroup_jobs}" | sort) <(echo "${node_jobs}" |
>> sort))"; (while IFS= read -r dead_job ; do
>> dead_job_cgroup_path="/sys/fs/cgroup/system.slice/slurmstepd.scope/job_${dead_job}";
>> echo "finding all cgroups for dead job '${dead_job}' under
>> '${dead_job_cgroup_path}'"; dead_cgroup_paths="$(find "${dead_job_cgroup_path}"
>> -type d | tac)"; (while IFS= read -r dead_cgroup_path ; do echo "removing
>> cgroup '${dead_cgroup_path}'"; sudo rmdir "${dead_cgroup_path}"; done <<<
>> "${dead_cgroup_paths}"); done <<< "${dead_jobs}")
>> ```
>>
>> What it does is:
>>  - uses `find` to find all of the `job_*` cgroups under
>> `/sys/fs/cgroup/system.slice/slurmstepd.scope` and extract the job ids from
>> them
>>  - uses `scontrol` to list the jobs currently running on the node
>>  - uses `sort` and `comm` to get the list of cgroup jobs that are not also jobs
>> currently running on the node (i.e. dead jobs)
>>  - uses `find` on each dead cgroup directory to find all of the directories
>> under it, then `tac` to take them in reverse order (i.e. deepest first)
>>  - uses `rmdir` to remove each of the cgroups
>>
>> If this `rmdir` fails you should check the content of the most deeply nested
>> cgroup directory. If `rmdir` does not work, there are likely still processes
>> running in the cgroup.
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the ticket.
>>
>>
>
> --
> Equipo de Soporte del NLHPC | soporte@nlhpc.cl
> National Laboratory for High Performance Computing (NLHPC) | www.nlhpc.cl
> Centro de Modelamiento Matemático (CMM)
> Facultad de Ciencias Físicas y Matemáticas. Universidad de Chile
> Beauchef 851, 6º Piso
> Teléfono oficina: +56 2 29784603
>