| Summary: | jobacct_gather/cgroup plugin leaks memory | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | CSC sysadmins <csc-slurm-tickets> |
| Component: | Accounting | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | rjukkara |
| Version: | 21.08.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CSC - IT Center for Science | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
Slabinfos Slurmd logs openmpi based hello MPI run under valgrind slurmstepd vg logs |
||
|
Description
CSC sysadmins
2022-09-13 05:09:15 MDT
Hi Tommi, Thanks for the detailed example. Could you post what kernel version are you using? Thanks, Albert Hi, This is quite standard RHEL 8.6 compute node: kernel-4.18.0-372.19.1.el8_6.x86_64 mlnx-ofa_kernel-5.6-OFED.5.6.1.0.3.1.rhel8u6.x86_64 I tested also with RHEL 8.5 based image but this seems not to be recent regression. BR, Tommi Forgot to ask if it's possible to attach valgrind to slurmstepd, I already tried to check slurmd memleaks but that did not reveal anything interesting? I tried without these parameters which could cause this leak but SUnreclaim memory still grows. LaunchParameters=slurmstepd_memlock SLURMD_OPTIONS="-M" -Tommi Hi Tommi, So far I've not being able to reproduce your issue. > Forgot to ask if it's possible to attach valgrind to slurmstepd, I already > tried to check slurmd memleaks but that did not reveal anything interesting? Yes, althought it's meant to be used only for developers, there is a way to run slurmstepds under valgrind. You have to manually edit src/slurmd/common/slurmstepd_init.h file and uncomment the desired SLURMSTEPD_MEMCHECK, rebuild Slurm and restart slurmd. Then when slurmd starts a new slurmstepd it starts within a valgrind call saving the valgrind option into a file /tmp/slurmstepd_valgrind_$jobid.$step. Please note that this is not meant to be use in any production environment. > I tried without these parameters which could cause this leak but SUnreclaim > memory still grows. > > LaunchParameters=slurmstepd_memlock > SLURMD_OPTIONS="-M" Good try. As mentioned, I'm not able to reproduce it, it seems tied to your environment. If you are able to reproduce the issue in a test environment and you can get the valgrind out, please share it. Also attach the slurmd logs. I'll keep investigating, though. Thanks, Albert Tommi, If you can reproduce the issue, could you monitor the node with "slabtop" and attach the process growing/leaking? And please attach slurmd logs. Regards, Albert Tommi, Some extra questions: - Could you reproduce the problem without MPI? - That is, running the same amount of tasks, but with a simple linux command like sleep or hostname. - Could you attach your cgroup.conf and the output of "scontrol show config"? Thanks, Albert Hi, We do have a service break ongoing so I'll answer questions later on this week. cgroup.conf: CgroupMountpoint=/sys/fs/cgroup CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainDevices=yes ConstrainCores=yes #Taskaffinity is handled by cpusets TaskAffinity=no ConstrainRAMSpace=yes ConstrainKmemSpace=no AllowedKmemSpace=250131589120 MinKmemSpace=200000 AllowedSwapSpace=0 Hi Tommi, > We do have a service break ongoing so I'll answer questions later on this > week. Sorry about that.. hope you can restore it soon and easy! > cgroup.conf: > ConstrainKmemSpace=no > AllowedKmemSpace=250131589120 > MinKmemSpace=200000 I'm looking for clues, so could you try to comment out the above options and see if you still reproduce the issue? > CgroupMountpoint=/sys/fs/cgroup > CgroupAutomount=yes These two are totally ok and unrelated to your issue, but I would also recommend to comment them out in general. > CgroupReleaseAgentDir="/etc/slurm/cgroup" Also unrelated, but I recommend you to remove this one because is deprecated since 17.11 and daemon will fatal if it's still present in the config from 22.05 on. Regards, Albert Hi,
I cleaned cgroup.conf and tested with orterun intead of srun, seems that massive leak happens only with srun. Tried also with ConstrainRAMSpace=no, same result.
echo "running with orterun"
for i in {1..10}; do grep SU /proc/meminfo; orterun ~/mpi/ompi_hello > /dev/null ; done
echo "running with srun"
for i in {1..10}; do grep SU /proc/meminfo; srun ~/mpi/ompi_hello > /dev/null ; done
running with orterun
SUnreclaim: 3313232 kB
SUnreclaim: 3340752 kB
SUnreclaim: 3349004 kB
SUnreclaim: 3356880 kB
SUnreclaim: 3361200 kB
SUnreclaim: 3363324 kB
SUnreclaim: 3366520 kB
SUnreclaim: 3370948 kB
SUnreclaim: 3372088 kB
SUnreclaim: 3373188 kB
running with srun
SUnreclaim: 3374832 kB
SUnreclaim: 3498664 kB
SUnreclaim: 3616456 kB
SUnreclaim: 3731868 kB
SUnreclaim: 3845192 kB
SUnreclaim: 3953904 kB
SUnreclaim: 4060096 kB
SUnreclaim: 4176536 kB
SUnreclaim: 4288844 kB
SUnreclaim: 4397744 kB
[root@c1336 ~]# grep -v ^# /etc/slurm/cgroup.conf
ConstrainDevices=yes
ConstrainCores=yes
TaskAffinity=no
ConstrainRAMSpace=yes
Active / Total Objects (% used) : 9146172 / 9184470 (99,6%)
Active / Total Slabs (% used) : 208823 / 208823 (100,0%)
Active / Total Caches (% used) : 181 / 263 (68,8%)
Active / Total Size (% used) : 4564604,39K / 4573236,27K (99,8%)
Minimum / Average / Maximum Object : 0,01K / 0,50K / 10,00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
2165696 2162367 99% 0,50K 33839 64 1082848K kmalloc-512
1142652 1139488 99% 0,09K 27206 42 108824K kmalloc-96
1060144 1059712 99% 2,00K 66259 16 2120288K kmalloc-2k
568064 565148 99% 0,03K 4438 128 17752K kmalloc-32
405960 398510 98% 0,13K 6766 60 54128K kernfs_node_cache
315490 315490 100% 0,23K 4507 70 72112K vm_area_struct
302260 299189 98% 0,02K 1778 170 7112K lsm_inode_cache
Hi Tommi, Thanks for the effort. > I cleaned cgroup.conf and tested with orterun intead of srun, seems that > massive leak happens only with srun. Tried also with ConstrainRAMSpace=no, > same result. > > echo "running with orterun" > for i in {1..10}; do grep SU /proc/meminfo; orterun ~/mpi/ompi_hello > > /dev/null ; done > > echo "running with srun" > for i in {1..10}; do grep SU /proc/meminfo; srun ~/mpi/ompi_hello > > /dev/null ; done > > running with orterun > SUnreclaim: 3313232 kB > SUnreclaim: 3340752 kB > SUnreclaim: 3349004 kB > SUnreclaim: 3356880 kB > SUnreclaim: 3361200 kB > SUnreclaim: 3363324 kB > SUnreclaim: 3366520 kB > SUnreclaim: 3370948 kB > SUnreclaim: 3372088 kB > SUnreclaim: 3373188 kB > running with srun > SUnreclaim: 3374832 kB > SUnreclaim: 3498664 kB > SUnreclaim: 3616456 kB > SUnreclaim: 3731868 kB > SUnreclaim: 3845192 kB > SUnreclaim: 3953904 kB > SUnreclaim: 4060096 kB > SUnreclaim: 4176536 kB > SUnreclaim: 4288844 kB > SUnreclaim: 4397744 kB Just to be sure: - I assume that this is a batch script submitted with a command like "sbatch -N1 -n 128", right? - For both, with srun and orterun, right? - Do you get the same results using salloc instead of sbatch? Could you also try using "hostname" or "sleep 10" instead of ~/mpi/ompi_hello? I want to totally discard this to be MPI related. > Active / Total Objects (% used) : 9146172 / 9184470 (99,6%) > Active / Total Slabs (% used) : 208823 / 208823 (100,0%) > Active / Total Caches (% used) : 181 / 263 (68,8%) > Active / Total Size (% used) : 4564604,39K / 4573236,27K (99,8%) > Minimum / Average / Maximum Object : 0,01K / 0,50K / 10,00K > > OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME > 2165696 2162367 99% 0,50K 33839 64 1082848K kmalloc-512 > 1142652 1139488 99% 0,09K 27206 42 108824K kmalloc-96 > 1060144 1059712 99% 2,00K 66259 16 2120288K kmalloc-2k > 568064 565148 99% 0,03K 4438 128 17752K kmalloc-32 > 405960 398510 98% 0,13K 6766 60 54128K kernfs_node_cache > 315490 315490 100% 0,23K 4507 70 72112K vm_area_struct > 302260 299189 98% 0,02K 1778 170 7112K lsm_inode_cache I think that we'll need more debug info. Could you run a bacth like this as root/sudoer to obtain more detailed slabinfo: for i in {0..9}; do sudo cat /proc/slabinfo > ./slabinfo.${i}; srun ./mpi_hello > /dev/null ; done Thanks, Albert Created attachment 27168 [details]
Slabinfos
> - I assume that this is a batch script submitted with a command like "sbatch > -N1 -n 128", right? > - For both, with srun and orterun, right? Yes, it was same job, first 10 steps was orteruns and second 10 steps were from sruns. > - Do you get the same results using salloc instead of sbatch? Yes, salloc + srun leaks as well > Could you also try using "hostname" or "sleep 10" instead of > ~/mpi/ompi_hello? > I want to totally discard this to be MPI related. With srun hostname memory usage looks quite stable. Tested also mpich hello_mpi and it caused leak as well. > I think that we'll need more debug info. > Could you run a bacth like this as root/sudoer to obtain more detailed > slabinfo: > > for i in {0..9}; do sudo cat /proc/slabinfo > ./slabinfo.${i}; srun > ./mpi_hello > /dev/null ; done Ouput attached, biggest differences looks to be on kmalloc-512/kmalloc-2k Best Regards, Tommi Tommi, > > - I assume that this is a batch script submitted with a command like "sbatch > > -N1 -n 128", right? > > - For both, with srun and orterun, right? > > Yes, it was same job, first 10 steps was orteruns and second 10 steps were > from sruns. > > > - Do you get the same results using salloc instead of sbatch? > > Yes, salloc + srun leaks as well Thanks for the confirmation. > > Could you also try using "hostname" or "sleep 10" instead of > > ~/mpi/ompi_hello? > > I want to totally discard this to be MPI related. > > With srun hostname memory usage looks quite stable. Tested also mpich > hello_mpi and it caused leak as well. Maybe "hostname" is too quick to trigger the issue, could you try sleep or some non mpi command that take some seconds of computation? > > I think that we'll need more debug info. > > Could you run a bacth like this as root/sudoer to obtain more detailed > > slabinfo: > > > > for i in {0..9}; do sudo cat /proc/slabinfo > ./slabinfo.${i}; srun > > ./mpi_hello > /dev/null ; done > > Ouput attached, biggest differences looks to be on kmalloc-512/kmalloc-2k I need to look deeper on them. As an extra check, although your kernel is not impacted with this old issue (https://bugzilla.redhat.com/show_bug.cgi?id=1507149), could you check if you reproduce the issue if you boot your node with cgroup.memory=nokmem (GRUB_CMDLINE_LINUX)? Also, could you attach your slurmd logs? Adding cgroup.memory=nokmem boot option did not make any difference.
Running sleep does not trigger leak:
for i in {1..10}; do grep SU /proc/meminfo; srun sleep 15 > /dev/null ; done
SUnreclaim: 2917308 kB
SUnreclaim: 3094960 kB
SUnreclaim: 3113032 kB
SUnreclaim: 3128524 kB
SUnreclaim: 3143072 kB
SUnreclaim: 3150400 kB
SUnreclaim: 3147296 kB
SUnreclaim: 3141252 kB
SUnreclaim: 3141104 kB
SUnreclaim: 3150368 kB
For a first glance it looks that leak is present but memory usage will go to original level when slabs are shrink.
[root@c2362 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done
SUnreclaim: 2920240 kB
Created attachment 27190 [details]
Slurmd logs
Hi Tommi, > Adding cgroup.memory=nokmem boot option did not make any difference. Thanks for the check. > Running sleep does not trigger leak: > for i in {1..10}; do grep SU /proc/meminfo; srun sleep 15 > /dev/null ; done > > SUnreclaim: 2917308 kB > SUnreclaim: 3094960 kB > SUnreclaim: 3113032 kB > SUnreclaim: 3128524 kB > SUnreclaim: 3143072 kB > SUnreclaim: 3150400 kB > SUnreclaim: 3147296 kB > SUnreclaim: 3141252 kB > SUnreclaim: 3141104 kB > SUnreclaim: 3150368 kB > > For a first glance it looks that leak is present but memory usage will go to > original level when slabs are shrink. > > [root@c2362 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do > echo 1000 > $f ; done > SUnreclaim: 2920240 kB Yes, I assume that that's not a leak but kernel doing normal operation and taking its time to free some space. So, it seems that to trigger it we need a specific combo of MPI+Slurm+cgroup. Any of them individually seems to not trigger it. What MPI version are you using? I assume that the srun command is not related, but maybe we can run it under valgrind, just in case we get some clue. By the way, have you been able to reproduce it with valgrind for stepd as explained in comment 6? Thanks, Albert > What MPI version are you using? "mpich/4.0.1" and "openmpi/4.1.2". > I assume that the srun command is not related, but maybe we can run it under > valgrind, just in case we get some clue. I'll attach logs, I ran one 128 rank openmpi hello job with it. ==655483== LEAK SUMMARY: ==655483== definitely lost: 1,926 bytes in 10 blocks ==655483== indirectly lost: 22,239 bytes in 155 blocks ==655483== possibly lost: 2,200,260 bytes in 20,600 blocks ==655483== still reachable: 61,840 bytes in 584 blocks > By the way, have you been able to reproduce it with valgrind for stepd as > explained in comment 6? I'll try to recompile slurmd today if nothing more urgent appears to my desk. -Tommi Created attachment 27209 [details]
openmpi based hello MPI run under valgrind
Created attachment 27210 [details]
slurmstepd vg logs
Seems that one needs to add also --enable-memory-leak-debug option.
Tommi, Sorry it has been so long. I am taking this over and doing a last effort to reproduce before we close this out. Have you upgraded slurm since this ticket and are you still seeing the error? Are you using cgroup v1 or v2? Using 23.02 impi 4.1.5 with cgroupv2 I am not seeing the error. Caden Feel free to reach out if this continues being an issue. Caden Hello, Looks like we're still hitting this issue with Slurm version 22.05 while using cgroup/v2 [root@c1101 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done [root@c1101 ~]# echo 3 > /proc/sys/vm/drop_caches [root@c1101 ~]# grep SUn /proc/meminfo SUnreclaim: 559296 kB # test job SUnreclaim: 565396 kB SUnreclaim: 920844 kB SUnreclaim: 1076020 kB SUnreclaim: 1213476 kB SUnreclaim: 1344720 kB SUnreclaim: 1459300 kB SUnreclaim: 1553724 kB SUnreclaim: 1680260 kB SUnreclaim: 1795104 kB SUnreclaim: 1918156 kB [root@c1101 ~]# grep SUn /proc/meminfo SUnreclaim: 2035480 kB [root@c1101 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done [root@c1101 ~]# echo 3 > /proc/sys/vm/drop_caches [root@c1101 ~]# grep SUn /proc/meminfo SUnreclaim: 1770208 kB Best regards, Roope Jukkara (In reply to Roope Jukkara from comment #31) > Looks like we're still hitting this issue with Slurm version 22.05 while > using cgroup/v2 To clarify, issue also happens with cgroup/v1. Best, Roope Jukkara Hi, I think this bug report can be closed because it's not Slurm fault. Using cgroup -plugin just triggers this leak which seems to be in the Mellanox OFED 5.6 drivers. I just tested this test case on the node with Mellanox OFED 5.8 LTS and Unreclaimable memory does not grow fast like it did with MOFED 5.6 It could be this fix: https://docs.nvidia.com/networking/display/MLNXOFEDv583070LTS/Bug+Fixes+in+This+Version 3229002 Description: Creating and deleting MRs, caused a kernel slab cache leak issue. Keywords: RDMA, Cache Discovered in Release: 5.7-1.0.2.0 Fixed in Release: 5.8-1.0.1.1 Best Regards, Tommi Tervo CSC |