Ticket 14945

Summary:	jobacct_gather/cgroup plugin leaks memory
Product:	Slurm	Reporter:	CSC sysadmins <csc-slurm-tickets>
Component:	Accounting	Assignee:	Director of Support <support>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	rjukkara
Version:	21.08.7
Hardware:	Linux
OS:	Linux
Site:	CSC - IT Center for Science	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf Slabinfos Slurmd logs openmpi based hello MPI run under valgrind slurmstepd vg logs

Description CSC sysadmins 2022-09-13 05:09:15 MDT

Created attachment 26749 [details]
slurm.conf

Hi,

I changed jobacct_gather plugin from linux to cgroup due this problem: https://bugs.schedmd.com/show_bug.cgi?id=14179

But seems that jobacct_gather/cgroup plugin leaks memory on every job step and which is worse it's unreclaimable. Each simple 128 MPI hello step leaks >100MB  and node will run out of memory eventually. 

[root@c1234 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done
[root@c1234 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@c1234 ~]# grep SUn /proc/meminfo
SUnreclaim:      4523352 kB


[root@c1234 ~]# perl -pi -e 's,jobacct_gather/cgroup,jobacct_gather/linux,' /etc/slurm/slurm.conf
[root@c1234 ~]# systemctl restart slurmd

# run 10 MPI-hello steps with linux-plugin:
#SBATCH --ntasks-per-node=128
for i in {1..10}; do grep  SU /proc/meminfo; srun ~/mpi/ompi_hello > /dev/null ; done

SUnreclaim:      4521944 kB
SUnreclaim:      4676624 kB
SUnreclaim:      4698292 kB
SUnreclaim:      4716888 kB
SUnreclaim:      4734068 kB
SUnreclaim:      4745232 kB
SUnreclaim:      4743288 kB
SUnreclaim:      4737748 kB
SUnreclaim:      4744592 kB
SUnreclaim:      4758236 kB


[root@c1234 ~]# grep SUn /proc/meminfo
SUnreclaim:      4767132 kB
[root@c1234 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done
[root@c1234 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@c1234 ~]# grep SUn /proc/meminfo
SUnreclaim:      4540208 kB

# Memory usage went back to original value

[root@c1234 ~]# perl -pi -e 's,jobacct_gather/linux,jobacct_gather/cgroup,' /etc/slurm/slurm.conf
[root@c1234 ~]# systemctl restart slurmd
[root@c1234 ~]# grep SUn /proc/meminfo
SUnreclaim:      4545180 kB

# run 10 MPI-hello steps with cgroup-plugin:

#SBATCH --ntasks-per-node=128
for i in {1..10}; do grep  SU /proc/meminfo; srun ~/mpi/ompi_hello > /dev/null ; done

SUnreclaim:      4548192 kB
SUnreclaim:      4805052 kB
SUnreclaim:      4936492 kB
SUnreclaim:      5065328 kB
SUnreclaim:      5192264 kB
SUnreclaim:      5313412 kB
SUnreclaim:      5419656 kB
SUnreclaim:      5517932 kB
SUnreclaim:      5635436 kB
SUnreclaim:      5757160 kB

[root@c1234 ~]# grep SUn /proc/meminfo
SUnreclaim:      5875204 kB
[root@c1234 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done
[root@c1234 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@c1234 ~]# grep SUn /proc/meminfo
SUnreclaim:      5625956 kB

# Unreclaimable memory usage grow over 1GB

Comment 1 Albert Gil 2022-09-14 12:13:56 MDT

Hi Tommi,


Thanks for the detailed example.
Could you post what kernel version are you using?

Thanks,
Albert

Comment 2 CSC sysadmins 2022-09-15 00:19:47 MDT

Hi,

This is quite standard RHEL 8.6 compute node:

kernel-4.18.0-372.19.1.el8_6.x86_64
mlnx-ofa_kernel-5.6-OFED.5.6.1.0.3.1.rhel8u6.x86_64

I tested also with RHEL 8.5 based image but this seems not to be recent regression.

BR,
Tommi

Comment 3 CSC sysadmins 2022-09-15 00:46:59 MDT

Forgot to ask if it's possible to attach valgrind to slurmstepd, I already tried to check slurmd memleaks but that did not reveal anything interesting?

Comment 4 CSC sysadmins 2022-09-22 02:25:24 MDT

I tried without these parameters which could cause this leak but SUnreclaim memory still grows. 

LaunchParameters=slurmstepd_memlock
SLURMD_OPTIONS="-M"

-Tommi

Comment 6 Albert Gil 2022-09-29 09:06:42 MDT

Hi Tommi,

So far I've not being able to reproduce your issue.

> Forgot to ask if it's possible to attach valgrind to slurmstepd, I already
> tried to check slurmd memleaks but that did not reveal anything interesting?

Yes, althought it's meant to be used only for developers, there is a way to run slurmstepds under valgrind.
You have to manually edit src/slurmd/common/slurmstepd_init.h file and uncomment the desired SLURMSTEPD_MEMCHECK, rebuild Slurm and restart slurmd.
Then when slurmd starts a new slurmstepd it starts within a valgrind call saving the valgrind option into a file /tmp/slurmstepd_valgrind_$jobid.$step.

Please note that this is not meant to be use in any production environment.

> I tried without these parameters which could cause this leak but SUnreclaim
> memory still grows. 
> 
> LaunchParameters=slurmstepd_memlock
> SLURMD_OPTIONS="-M"

Good try.

As mentioned, I'm not able to reproduce it, it seems tied to your environment.
If you are able to reproduce the issue in a test environment and you can get the valgrind out, please share it.
Also attach the slurmd logs.
I'll keep investigating, though.

Thanks,
Albert

Comment 7 Albert Gil 2022-09-29 09:52:55 MDT

Tommi,

If you can reproduce the issue, could you monitor the node with "slabtop" and attach the process growing/leaking?
And please attach slurmd logs.

Regards,
Albert

Comment 10 Albert Gil 2022-10-04 09:53:38 MDT

Tommi,

Some extra questions:
- Could you reproduce the problem without MPI?
  - That is, running the same amount of tasks, but with a simple linux command like sleep or hostname.
- Could you attach your cgroup.conf and the output of "scontrol show config"?

Thanks,
Albert

Comment 11 CSC sysadmins 2022-10-04 09:59:28 MDT

Hi,

We do have a service break ongoing so I'll answer questions later on this week.


cgroup.conf:
CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainDevices=yes
ConstrainCores=yes
#Taskaffinity is handled by cpusets
TaskAffinity=no
ConstrainRAMSpace=yes
ConstrainKmemSpace=no
AllowedKmemSpace=250131589120
MinKmemSpace=200000
AllowedSwapSpace=0

Comment 12 Albert Gil 2022-10-04 10:28:03 MDT

Hi Tommi,

> We do have a service break ongoing so I'll answer questions later on this
> week.

Sorry about that.. hope you can restore it soon and easy!

> cgroup.conf:
> ConstrainKmemSpace=no
> AllowedKmemSpace=250131589120
> MinKmemSpace=200000

I'm looking for clues, so could you try to comment out the above options and see if you still reproduce the issue?

> CgroupMountpoint=/sys/fs/cgroup
> CgroupAutomount=yes

These two are totally ok and unrelated to your issue, but I would also recommend to comment them out in general.

> CgroupReleaseAgentDir="/etc/slurm/cgroup"

Also unrelated, but I recommend you to remove this one because is deprecated since 17.11 and daemon will fatal if it's still present in the config from 22.05 on.

Regards,
Albert

Comment 13 CSC sysadmins 2022-10-07 01:05:04 MDT

Hi,

I cleaned cgroup.conf and tested with orterun intead of srun, seems that massive leak happens only with srun. Tried also with ConstrainRAMSpace=no, same result.

echo "running with orterun"
for i in {1..10}; do grep  SU /proc/meminfo; orterun ~/mpi/ompi_hello > /dev/null ; done

echo "running with srun"
for i in {1..10}; do grep  SU /proc/meminfo; srun ~/mpi/ompi_hello > /dev/null ; done

running with orterun
SUnreclaim:      3313232 kB
SUnreclaim:      3340752 kB
SUnreclaim:      3349004 kB
SUnreclaim:      3356880 kB
SUnreclaim:      3361200 kB
SUnreclaim:      3363324 kB
SUnreclaim:      3366520 kB
SUnreclaim:      3370948 kB
SUnreclaim:      3372088 kB
SUnreclaim:      3373188 kB
running with srun
SUnreclaim:      3374832 kB
SUnreclaim:      3498664 kB
SUnreclaim:      3616456 kB
SUnreclaim:      3731868 kB
SUnreclaim:      3845192 kB
SUnreclaim:      3953904 kB
SUnreclaim:      4060096 kB
SUnreclaim:      4176536 kB
SUnreclaim:      4288844 kB
SUnreclaim:      4397744 kB




[root@c1336 ~]# grep -v ^# /etc/slurm/cgroup.conf

ConstrainDevices=yes
ConstrainCores=yes
TaskAffinity=no
ConstrainRAMSpace=yes



 Active / Total Objects (% used)    : 9146172 / 9184470 (99,6%)
 Active / Total Slabs (% used)      : 208823 / 208823 (100,0%)
 Active / Total Caches (% used)     : 181 / 263 (68,8%)
 Active / Total Size (% used)       : 4564604,39K / 4573236,27K (99,8%)
 Minimum / Average / Maximum Object : 0,01K / 0,50K / 10,00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
2165696 2162367  99%    0,50K  33839       64   1082848K kmalloc-512
1142652 1139488  99%    0,09K  27206       42    108824K kmalloc-96
1060144 1059712  99%    2,00K  66259       16   2120288K kmalloc-2k
568064 565148  99%    0,03K   4438      128     17752K kmalloc-32
405960 398510  98%    0,13K   6766       60     54128K kernfs_node_cache
315490 315490 100%    0,23K   4507       70     72112K vm_area_struct
302260 299189  98%    0,02K   1778      170      7112K lsm_inode_cache

Comment 14 Albert Gil 2022-10-07 05:20:17 MDT

Hi Tommi,

Thanks for the effort.

> I cleaned cgroup.conf and tested with orterun intead of srun, seems that
> massive leak happens only with srun. Tried also with ConstrainRAMSpace=no,
> same result.
> 
> echo "running with orterun"
> for i in {1..10}; do grep  SU /proc/meminfo; orterun ~/mpi/ompi_hello >
> /dev/null ; done
> 
> echo "running with srun"
> for i in {1..10}; do grep  SU /proc/meminfo; srun ~/mpi/ompi_hello >
> /dev/null ; done
> 
> running with orterun
> SUnreclaim:      3313232 kB
> SUnreclaim:      3340752 kB
> SUnreclaim:      3349004 kB
> SUnreclaim:      3356880 kB
> SUnreclaim:      3361200 kB
> SUnreclaim:      3363324 kB
> SUnreclaim:      3366520 kB
> SUnreclaim:      3370948 kB
> SUnreclaim:      3372088 kB
> SUnreclaim:      3373188 kB
> running with srun
> SUnreclaim:      3374832 kB
> SUnreclaim:      3498664 kB
> SUnreclaim:      3616456 kB
> SUnreclaim:      3731868 kB
> SUnreclaim:      3845192 kB
> SUnreclaim:      3953904 kB
> SUnreclaim:      4060096 kB
> SUnreclaim:      4176536 kB
> SUnreclaim:      4288844 kB
> SUnreclaim:      4397744 kB

Just to be sure:
- I assume that this is a batch script submitted with a command like "sbatch -N1 -n 128", right?
  - For both, with srun and orterun, right?
- Do you get the same results using salloc instead of sbatch?

Could you also try using "hostname" or "sleep 10" instead of ~/mpi/ompi_hello?
I want to totally discard this to be MPI related.


>  Active / Total Objects (% used)    : 9146172 / 9184470 (99,6%)
>  Active / Total Slabs (% used)      : 208823 / 208823 (100,0%)
>  Active / Total Caches (% used)     : 181 / 263 (68,8%)
>  Active / Total Size (% used)       : 4564604,39K / 4573236,27K (99,8%)
>  Minimum / Average / Maximum Object : 0,01K / 0,50K / 10,00K
> 
>    OBJS  ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 2165696 2162367  99%    0,50K  33839       64   1082848K kmalloc-512
> 1142652 1139488  99%    0,09K  27206       42    108824K kmalloc-96
> 1060144 1059712  99%    2,00K  66259       16   2120288K kmalloc-2k
>  568064  565148  99%    0,03K   4438      128     17752K kmalloc-32
>  405960  398510  98%    0,13K   6766       60     54128K kernfs_node_cache
>  315490  315490 100%    0,23K   4507       70     72112K vm_area_struct
>  302260  299189  98%    0,02K   1778      170      7112K lsm_inode_cache

I think that we'll need more debug info.
Could you run a bacth like this as root/sudoer to obtain more detailed slabinfo:

for i in {0..9}; do sudo cat /proc/slabinfo > ./slabinfo.${i}; srun ./mpi_hello > /dev/null ; done

Thanks,
Albert

Comment 16 CSC sysadmins 2022-10-07 06:36:06 MDT

Created attachment 27168 [details]
Slabinfos

Comment 17 CSC sysadmins 2022-10-07 06:36:40 MDT

> - I assume that this is a batch script submitted with a command like "sbatch
> -N1 -n 128", right?
>   - For both, with srun and orterun, right?

Yes, it was same job, first 10 steps was orteruns and second 10 steps were from sruns.

> - Do you get the same results using salloc instead of sbatch?

Yes, salloc + srun leaks as well

> Could you also try using "hostname" or "sleep 10" instead of
> ~/mpi/ompi_hello?
> I want to totally discard this to be MPI related.

With srun hostname memory usage looks quite stable. Tested also mpich hello_mpi and it caused leak as well.

> I think that we'll need more debug info.
> Could you run a bacth like this as root/sudoer to obtain more detailed
> slabinfo:
> 
> for i in {0..9}; do sudo cat /proc/slabinfo > ./slabinfo.${i}; srun
> ./mpi_hello > /dev/null ; done

Ouput attached, biggest differences looks to be on kmalloc-512/kmalloc-2k

Best Regards,
Tommi

Comment 19 Albert Gil 2022-10-07 06:54:57 MDT

Tommi,

> > - I assume that this is a batch script submitted with a command like "sbatch
> > -N1 -n 128", right?
> >   - For both, with srun and orterun, right?
> 
> Yes, it was same job, first 10 steps was orteruns and second 10 steps were
> from sruns.
> 
> > - Do you get the same results using salloc instead of sbatch?
> 
> Yes, salloc + srun leaks as well

Thanks for the confirmation.


> > Could you also try using "hostname" or "sleep 10" instead of
> > ~/mpi/ompi_hello?
> > I want to totally discard this to be MPI related.
> 
> With srun hostname memory usage looks quite stable. Tested also mpich
> hello_mpi and it caused leak as well.

Maybe "hostname" is too quick to trigger the issue, could you try sleep or some non mpi command that take some seconds of computation?


> > I think that we'll need more debug info.
> > Could you run a bacth like this as root/sudoer to obtain more detailed
> > slabinfo:
> > 
> > for i in {0..9}; do sudo cat /proc/slabinfo > ./slabinfo.${i}; srun
> > ./mpi_hello > /dev/null ; done
> 
> Ouput attached, biggest differences looks to be on kmalloc-512/kmalloc-2k

I need to look deeper on them.
As an extra check, although your kernel is not impacted with this old issue (https://bugzilla.redhat.com/show_bug.cgi?id=1507149), could you check if you reproduce the issue if you boot your node with cgroup.memory=nokmem (GRUB_CMDLINE_LINUX)?

Also, could you attach your slurmd logs?

Comment 22 CSC sysadmins 2022-10-10 03:55:44 MDT

Adding cgroup.memory=nokmem boot option did not make any difference.

Running sleep does not trigger leak:
for i in {1..10}; do grep  SU /proc/meminfo; srun sleep 15 > /dev/null ; done

SUnreclaim:      2917308 kB
SUnreclaim:      3094960 kB
SUnreclaim:      3113032 kB
SUnreclaim:      3128524 kB
SUnreclaim:      3143072 kB
SUnreclaim:      3150400 kB
SUnreclaim:      3147296 kB
SUnreclaim:      3141252 kB
SUnreclaim:      3141104 kB
SUnreclaim:      3150368 kB

For a first glance it looks that leak is present but memory usage will go to original level when slabs are shrink.

[root@c2362 ~]#  for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done
SUnreclaim:      2920240 kB

Comment 23 CSC sysadmins 2022-10-10 03:56:33 MDT

Created attachment 27190 [details]
Slurmd logs

Comment 24 Albert Gil 2022-10-10 09:01:01 MDT

Hi Tommi,

> Adding cgroup.memory=nokmem boot option did not make any difference.

Thanks for the check.

> Running sleep does not trigger leak:
> for i in {1..10}; do grep  SU /proc/meminfo; srun sleep 15 > /dev/null ; done
> 
> SUnreclaim:      2917308 kB
> SUnreclaim:      3094960 kB
> SUnreclaim:      3113032 kB
> SUnreclaim:      3128524 kB
> SUnreclaim:      3143072 kB
> SUnreclaim:      3150400 kB
> SUnreclaim:      3147296 kB
> SUnreclaim:      3141252 kB
> SUnreclaim:      3141104 kB
> SUnreclaim:      3150368 kB
> 
> For a first glance it looks that leak is present but memory usage will go to
> original level when slabs are shrink.
> 
> [root@c2362 ~]#  for f in $(find /sys/kernel/slab -name shrink -type f) ;do
> echo 1000 > $f ; done
> SUnreclaim:      2920240 kB

Yes, I assume that that's not a leak but kernel doing normal operation and taking its time to free some space.

So, it seems that to trigger it we need a specific combo of MPI+Slurm+cgroup.
Any of them individually seems to not trigger it.

What MPI version are you using?

I assume that the srun command is not related, but maybe we can run it under valgrind, just in case we get some clue.
By the way, have you been able to reproduce it with valgrind for stepd as explained in comment 6?

Thanks,
Albert

Comment 25 CSC sysadmins 2022-10-11 01:47:37 MDT

> What MPI version are you using?

 "mpich/4.0.1" and "openmpi/4.1.2".

> I assume that the srun command is not related, but maybe we can run it under
> valgrind, just in case we get some clue.

I'll attach logs, I ran one 128 rank openmpi hello job with it.

==655483== LEAK SUMMARY:
==655483==    definitely lost: 1,926 bytes in 10 blocks
==655483==    indirectly lost: 22,239 bytes in 155 blocks
==655483==      possibly lost: 2,200,260 bytes in 20,600 blocks
==655483==    still reachable: 61,840 bytes in 584 blocks


> By the way, have you been able to reproduce it with valgrind for stepd as
> explained in comment 6?

I'll try to recompile slurmd today if nothing more urgent appears to my desk.

-Tommi

Comment 26 CSC sysadmins 2022-10-11 01:48:24 MDT

Created attachment 27209 [details]
openmpi based hello MPI run under valgrind

Comment 27 CSC sysadmins 2022-10-11 03:02:27 MDT

Created attachment 27210 [details]
slurmstepd vg logs

Seems that one needs to add also --enable-memory-leak-debug option.

Comment 29 Caden Ellis 2023-07-28 14:02:26 MDT

Tommi, 

Sorry it has been so long. I am taking this over and doing a last effort to reproduce before we close this out.

Have you upgraded slurm since this ticket and are you still seeing the error?
Are you using cgroup v1 or v2?

Using 23.02 impi 4.1.5 with cgroupv2 I am not seeing the error.

Caden

Comment 30 Caden Ellis 2023-08-25 14:38:03 MDT

Feel free to reach out if this continues being an issue. 

Caden

Comment 31 Roope Jukkara 2023-09-26 02:39:05 MDT

Hello,
Looks like we're still hitting this issue with Slurm version 22.05 while using cgroup/v2

[root@c1101 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done
[root@c1101 ~]# echo 3 > /proc/sys/vm/drop_caches 
[root@c1101 ~]# grep SUn /proc/meminfo 
SUnreclaim:       559296 kB

# test job
SUnreclaim:       565396 kB
SUnreclaim:       920844 kB
SUnreclaim:      1076020 kB
SUnreclaim:      1213476 kB
SUnreclaim:      1344720 kB
SUnreclaim:      1459300 kB
SUnreclaim:      1553724 kB
SUnreclaim:      1680260 kB
SUnreclaim:      1795104 kB
SUnreclaim:      1918156 kB

[root@c1101 ~]# grep SUn /proc/meminfo 
SUnreclaim:      2035480 kB

[root@c1101 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done
[root@c1101 ~]# echo 3 > /proc/sys/vm/drop_caches 
[root@c1101 ~]# grep SUn /proc/meminfo 
SUnreclaim:      1770208 kB

Best regards,
Roope Jukkara

Comment 32 Roope Jukkara 2023-09-26 02:40:14 MDT

(In reply to Roope Jukkara from comment #31)

> Looks like we're still hitting this issue with Slurm version 22.05 while
> using cgroup/v2

To clarify, issue also happens with cgroup/v1.

Best,
Roope Jukkara

Comment 33 CSC sysadmins 2023-09-26 06:54:37 MDT

Hi,

I think this bug report can be closed because it's not Slurm fault. Using cgroup -plugin just triggers this leak which seems to be in the Mellanox OFED 5.6 drivers. I just tested this test case on the node with Mellanox OFED 5.8 LTS and Unreclaimable memory does not grow fast like it did with MOFED 5.6

It could be this fix: 

https://docs.nvidia.com/networking/display/MLNXOFEDv583070LTS/Bug+Fixes+in+This+Version

3229002
Description: Creating and deleting MRs, caused a kernel slab cache leak issue.
Keywords: RDMA, Cache

Discovered in Release: 5.7-1.0.2.0
Fixed in Release: 5.8-1.0.1.1

Best Regards,
Tommi Tervo
CSC