Created attachment 26749 [details] slurm.conf Hi, I changed jobacct_gather plugin from linux to cgroup due this problem: https://bugs.schedmd.com/show_bug.cgi?id=14179 But seems that jobacct_gather/cgroup plugin leaks memory on every job step and which is worse it's unreclaimable. Each simple 128 MPI hello step leaks >100MB and node will run out of memory eventually. [root@c1234 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done [root@c1234 ~]# echo 3 > /proc/sys/vm/drop_caches [root@c1234 ~]# grep SUn /proc/meminfo SUnreclaim: 4523352 kB [root@c1234 ~]# perl -pi -e 's,jobacct_gather/cgroup,jobacct_gather/linux,' /etc/slurm/slurm.conf [root@c1234 ~]# systemctl restart slurmd # run 10 MPI-hello steps with linux-plugin: #SBATCH --ntasks-per-node=128 for i in {1..10}; do grep SU /proc/meminfo; srun ~/mpi/ompi_hello > /dev/null ; done SUnreclaim: 4521944 kB SUnreclaim: 4676624 kB SUnreclaim: 4698292 kB SUnreclaim: 4716888 kB SUnreclaim: 4734068 kB SUnreclaim: 4745232 kB SUnreclaim: 4743288 kB SUnreclaim: 4737748 kB SUnreclaim: 4744592 kB SUnreclaim: 4758236 kB [root@c1234 ~]# grep SUn /proc/meminfo SUnreclaim: 4767132 kB [root@c1234 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done [root@c1234 ~]# echo 3 > /proc/sys/vm/drop_caches [root@c1234 ~]# grep SUn /proc/meminfo SUnreclaim: 4540208 kB # Memory usage went back to original value [root@c1234 ~]# perl -pi -e 's,jobacct_gather/linux,jobacct_gather/cgroup,' /etc/slurm/slurm.conf [root@c1234 ~]# systemctl restart slurmd [root@c1234 ~]# grep SUn /proc/meminfo SUnreclaim: 4545180 kB # run 10 MPI-hello steps with cgroup-plugin: #SBATCH --ntasks-per-node=128 for i in {1..10}; do grep SU /proc/meminfo; srun ~/mpi/ompi_hello > /dev/null ; done SUnreclaim: 4548192 kB SUnreclaim: 4805052 kB SUnreclaim: 4936492 kB SUnreclaim: 5065328 kB SUnreclaim: 5192264 kB SUnreclaim: 5313412 kB SUnreclaim: 5419656 kB SUnreclaim: 5517932 kB SUnreclaim: 5635436 kB SUnreclaim: 5757160 kB [root@c1234 ~]# grep SUn /proc/meminfo SUnreclaim: 5875204 kB [root@c1234 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done [root@c1234 ~]# echo 3 > /proc/sys/vm/drop_caches [root@c1234 ~]# grep SUn /proc/meminfo SUnreclaim: 5625956 kB # Unreclaimable memory usage grow over 1GB
Hi Tommi, Thanks for the detailed example. Could you post what kernel version are you using? Thanks, Albert
Hi, This is quite standard RHEL 8.6 compute node: kernel-4.18.0-372.19.1.el8_6.x86_64 mlnx-ofa_kernel-5.6-OFED.5.6.1.0.3.1.rhel8u6.x86_64 I tested also with RHEL 8.5 based image but this seems not to be recent regression. BR, Tommi
Forgot to ask if it's possible to attach valgrind to slurmstepd, I already tried to check slurmd memleaks but that did not reveal anything interesting?
I tried without these parameters which could cause this leak but SUnreclaim memory still grows. LaunchParameters=slurmstepd_memlock SLURMD_OPTIONS="-M" -Tommi
Hi Tommi, So far I've not being able to reproduce your issue. > Forgot to ask if it's possible to attach valgrind to slurmstepd, I already > tried to check slurmd memleaks but that did not reveal anything interesting? Yes, althought it's meant to be used only for developers, there is a way to run slurmstepds under valgrind. You have to manually edit src/slurmd/common/slurmstepd_init.h file and uncomment the desired SLURMSTEPD_MEMCHECK, rebuild Slurm and restart slurmd. Then when slurmd starts a new slurmstepd it starts within a valgrind call saving the valgrind option into a file /tmp/slurmstepd_valgrind_$jobid.$step. Please note that this is not meant to be use in any production environment. > I tried without these parameters which could cause this leak but SUnreclaim > memory still grows. > > LaunchParameters=slurmstepd_memlock > SLURMD_OPTIONS="-M" Good try. As mentioned, I'm not able to reproduce it, it seems tied to your environment. If you are able to reproduce the issue in a test environment and you can get the valgrind out, please share it. Also attach the slurmd logs. I'll keep investigating, though. Thanks, Albert
Tommi, If you can reproduce the issue, could you monitor the node with "slabtop" and attach the process growing/leaking? And please attach slurmd logs. Regards, Albert
Tommi, Some extra questions: - Could you reproduce the problem without MPI? - That is, running the same amount of tasks, but with a simple linux command like sleep or hostname. - Could you attach your cgroup.conf and the output of "scontrol show config"? Thanks, Albert
Hi, We do have a service break ongoing so I'll answer questions later on this week. cgroup.conf: CgroupMountpoint=/sys/fs/cgroup CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainDevices=yes ConstrainCores=yes #Taskaffinity is handled by cpusets TaskAffinity=no ConstrainRAMSpace=yes ConstrainKmemSpace=no AllowedKmemSpace=250131589120 MinKmemSpace=200000 AllowedSwapSpace=0
Hi Tommi, > We do have a service break ongoing so I'll answer questions later on this > week. Sorry about that.. hope you can restore it soon and easy! > cgroup.conf: > ConstrainKmemSpace=no > AllowedKmemSpace=250131589120 > MinKmemSpace=200000 I'm looking for clues, so could you try to comment out the above options and see if you still reproduce the issue? > CgroupMountpoint=/sys/fs/cgroup > CgroupAutomount=yes These two are totally ok and unrelated to your issue, but I would also recommend to comment them out in general. > CgroupReleaseAgentDir="/etc/slurm/cgroup" Also unrelated, but I recommend you to remove this one because is deprecated since 17.11 and daemon will fatal if it's still present in the config from 22.05 on. Regards, Albert
Hi, I cleaned cgroup.conf and tested with orterun intead of srun, seems that massive leak happens only with srun. Tried also with ConstrainRAMSpace=no, same result. echo "running with orterun" for i in {1..10}; do grep SU /proc/meminfo; orterun ~/mpi/ompi_hello > /dev/null ; done echo "running with srun" for i in {1..10}; do grep SU /proc/meminfo; srun ~/mpi/ompi_hello > /dev/null ; done running with orterun SUnreclaim: 3313232 kB SUnreclaim: 3340752 kB SUnreclaim: 3349004 kB SUnreclaim: 3356880 kB SUnreclaim: 3361200 kB SUnreclaim: 3363324 kB SUnreclaim: 3366520 kB SUnreclaim: 3370948 kB SUnreclaim: 3372088 kB SUnreclaim: 3373188 kB running with srun SUnreclaim: 3374832 kB SUnreclaim: 3498664 kB SUnreclaim: 3616456 kB SUnreclaim: 3731868 kB SUnreclaim: 3845192 kB SUnreclaim: 3953904 kB SUnreclaim: 4060096 kB SUnreclaim: 4176536 kB SUnreclaim: 4288844 kB SUnreclaim: 4397744 kB [root@c1336 ~]# grep -v ^# /etc/slurm/cgroup.conf ConstrainDevices=yes ConstrainCores=yes TaskAffinity=no ConstrainRAMSpace=yes Active / Total Objects (% used) : 9146172 / 9184470 (99,6%) Active / Total Slabs (% used) : 208823 / 208823 (100,0%) Active / Total Caches (% used) : 181 / 263 (68,8%) Active / Total Size (% used) : 4564604,39K / 4573236,27K (99,8%) Minimum / Average / Maximum Object : 0,01K / 0,50K / 10,00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 2165696 2162367 99% 0,50K 33839 64 1082848K kmalloc-512 1142652 1139488 99% 0,09K 27206 42 108824K kmalloc-96 1060144 1059712 99% 2,00K 66259 16 2120288K kmalloc-2k 568064 565148 99% 0,03K 4438 128 17752K kmalloc-32 405960 398510 98% 0,13K 6766 60 54128K kernfs_node_cache 315490 315490 100% 0,23K 4507 70 72112K vm_area_struct 302260 299189 98% 0,02K 1778 170 7112K lsm_inode_cache
Hi Tommi, Thanks for the effort. > I cleaned cgroup.conf and tested with orterun intead of srun, seems that > massive leak happens only with srun. Tried also with ConstrainRAMSpace=no, > same result. > > echo "running with orterun" > for i in {1..10}; do grep SU /proc/meminfo; orterun ~/mpi/ompi_hello > > /dev/null ; done > > echo "running with srun" > for i in {1..10}; do grep SU /proc/meminfo; srun ~/mpi/ompi_hello > > /dev/null ; done > > running with orterun > SUnreclaim: 3313232 kB > SUnreclaim: 3340752 kB > SUnreclaim: 3349004 kB > SUnreclaim: 3356880 kB > SUnreclaim: 3361200 kB > SUnreclaim: 3363324 kB > SUnreclaim: 3366520 kB > SUnreclaim: 3370948 kB > SUnreclaim: 3372088 kB > SUnreclaim: 3373188 kB > running with srun > SUnreclaim: 3374832 kB > SUnreclaim: 3498664 kB > SUnreclaim: 3616456 kB > SUnreclaim: 3731868 kB > SUnreclaim: 3845192 kB > SUnreclaim: 3953904 kB > SUnreclaim: 4060096 kB > SUnreclaim: 4176536 kB > SUnreclaim: 4288844 kB > SUnreclaim: 4397744 kB Just to be sure: - I assume that this is a batch script submitted with a command like "sbatch -N1 -n 128", right? - For both, with srun and orterun, right? - Do you get the same results using salloc instead of sbatch? Could you also try using "hostname" or "sleep 10" instead of ~/mpi/ompi_hello? I want to totally discard this to be MPI related. > Active / Total Objects (% used) : 9146172 / 9184470 (99,6%) > Active / Total Slabs (% used) : 208823 / 208823 (100,0%) > Active / Total Caches (% used) : 181 / 263 (68,8%) > Active / Total Size (% used) : 4564604,39K / 4573236,27K (99,8%) > Minimum / Average / Maximum Object : 0,01K / 0,50K / 10,00K > > OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME > 2165696 2162367 99% 0,50K 33839 64 1082848K kmalloc-512 > 1142652 1139488 99% 0,09K 27206 42 108824K kmalloc-96 > 1060144 1059712 99% 2,00K 66259 16 2120288K kmalloc-2k > 568064 565148 99% 0,03K 4438 128 17752K kmalloc-32 > 405960 398510 98% 0,13K 6766 60 54128K kernfs_node_cache > 315490 315490 100% 0,23K 4507 70 72112K vm_area_struct > 302260 299189 98% 0,02K 1778 170 7112K lsm_inode_cache I think that we'll need more debug info. Could you run a bacth like this as root/sudoer to obtain more detailed slabinfo: for i in {0..9}; do sudo cat /proc/slabinfo > ./slabinfo.${i}; srun ./mpi_hello > /dev/null ; done Thanks, Albert
Created attachment 27168 [details] Slabinfos
> - I assume that this is a batch script submitted with a command like "sbatch > -N1 -n 128", right? > - For both, with srun and orterun, right? Yes, it was same job, first 10 steps was orteruns and second 10 steps were from sruns. > - Do you get the same results using salloc instead of sbatch? Yes, salloc + srun leaks as well > Could you also try using "hostname" or "sleep 10" instead of > ~/mpi/ompi_hello? > I want to totally discard this to be MPI related. With srun hostname memory usage looks quite stable. Tested also mpich hello_mpi and it caused leak as well. > I think that we'll need more debug info. > Could you run a bacth like this as root/sudoer to obtain more detailed > slabinfo: > > for i in {0..9}; do sudo cat /proc/slabinfo > ./slabinfo.${i}; srun > ./mpi_hello > /dev/null ; done Ouput attached, biggest differences looks to be on kmalloc-512/kmalloc-2k Best Regards, Tommi
Tommi, > > - I assume that this is a batch script submitted with a command like "sbatch > > -N1 -n 128", right? > > - For both, with srun and orterun, right? > > Yes, it was same job, first 10 steps was orteruns and second 10 steps were > from sruns. > > > - Do you get the same results using salloc instead of sbatch? > > Yes, salloc + srun leaks as well Thanks for the confirmation. > > Could you also try using "hostname" or "sleep 10" instead of > > ~/mpi/ompi_hello? > > I want to totally discard this to be MPI related. > > With srun hostname memory usage looks quite stable. Tested also mpich > hello_mpi and it caused leak as well. Maybe "hostname" is too quick to trigger the issue, could you try sleep or some non mpi command that take some seconds of computation? > > I think that we'll need more debug info. > > Could you run a bacth like this as root/sudoer to obtain more detailed > > slabinfo: > > > > for i in {0..9}; do sudo cat /proc/slabinfo > ./slabinfo.${i}; srun > > ./mpi_hello > /dev/null ; done > > Ouput attached, biggest differences looks to be on kmalloc-512/kmalloc-2k I need to look deeper on them. As an extra check, although your kernel is not impacted with this old issue (https://bugzilla.redhat.com/show_bug.cgi?id=1507149), could you check if you reproduce the issue if you boot your node with cgroup.memory=nokmem (GRUB_CMDLINE_LINUX)? Also, could you attach your slurmd logs?
Adding cgroup.memory=nokmem boot option did not make any difference. Running sleep does not trigger leak: for i in {1..10}; do grep SU /proc/meminfo; srun sleep 15 > /dev/null ; done SUnreclaim: 2917308 kB SUnreclaim: 3094960 kB SUnreclaim: 3113032 kB SUnreclaim: 3128524 kB SUnreclaim: 3143072 kB SUnreclaim: 3150400 kB SUnreclaim: 3147296 kB SUnreclaim: 3141252 kB SUnreclaim: 3141104 kB SUnreclaim: 3150368 kB For a first glance it looks that leak is present but memory usage will go to original level when slabs are shrink. [root@c2362 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done SUnreclaim: 2920240 kB
Created attachment 27190 [details] Slurmd logs
Hi Tommi, > Adding cgroup.memory=nokmem boot option did not make any difference. Thanks for the check. > Running sleep does not trigger leak: > for i in {1..10}; do grep SU /proc/meminfo; srun sleep 15 > /dev/null ; done > > SUnreclaim: 2917308 kB > SUnreclaim: 3094960 kB > SUnreclaim: 3113032 kB > SUnreclaim: 3128524 kB > SUnreclaim: 3143072 kB > SUnreclaim: 3150400 kB > SUnreclaim: 3147296 kB > SUnreclaim: 3141252 kB > SUnreclaim: 3141104 kB > SUnreclaim: 3150368 kB > > For a first glance it looks that leak is present but memory usage will go to > original level when slabs are shrink. > > [root@c2362 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do > echo 1000 > $f ; done > SUnreclaim: 2920240 kB Yes, I assume that that's not a leak but kernel doing normal operation and taking its time to free some space. So, it seems that to trigger it we need a specific combo of MPI+Slurm+cgroup. Any of them individually seems to not trigger it. What MPI version are you using? I assume that the srun command is not related, but maybe we can run it under valgrind, just in case we get some clue. By the way, have you been able to reproduce it with valgrind for stepd as explained in comment 6? Thanks, Albert
> What MPI version are you using? "mpich/4.0.1" and "openmpi/4.1.2". > I assume that the srun command is not related, but maybe we can run it under > valgrind, just in case we get some clue. I'll attach logs, I ran one 128 rank openmpi hello job with it. ==655483== LEAK SUMMARY: ==655483== definitely lost: 1,926 bytes in 10 blocks ==655483== indirectly lost: 22,239 bytes in 155 blocks ==655483== possibly lost: 2,200,260 bytes in 20,600 blocks ==655483== still reachable: 61,840 bytes in 584 blocks > By the way, have you been able to reproduce it with valgrind for stepd as > explained in comment 6? I'll try to recompile slurmd today if nothing more urgent appears to my desk. -Tommi
Created attachment 27209 [details] openmpi based hello MPI run under valgrind
Created attachment 27210 [details] slurmstepd vg logs Seems that one needs to add also --enable-memory-leak-debug option.
Tommi, Sorry it has been so long. I am taking this over and doing a last effort to reproduce before we close this out. Have you upgraded slurm since this ticket and are you still seeing the error? Are you using cgroup v1 or v2? Using 23.02 impi 4.1.5 with cgroupv2 I am not seeing the error. Caden
Feel free to reach out if this continues being an issue. Caden
Hello, Looks like we're still hitting this issue with Slurm version 22.05 while using cgroup/v2 [root@c1101 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done [root@c1101 ~]# echo 3 > /proc/sys/vm/drop_caches [root@c1101 ~]# grep SUn /proc/meminfo SUnreclaim: 559296 kB # test job SUnreclaim: 565396 kB SUnreclaim: 920844 kB SUnreclaim: 1076020 kB SUnreclaim: 1213476 kB SUnreclaim: 1344720 kB SUnreclaim: 1459300 kB SUnreclaim: 1553724 kB SUnreclaim: 1680260 kB SUnreclaim: 1795104 kB SUnreclaim: 1918156 kB [root@c1101 ~]# grep SUn /proc/meminfo SUnreclaim: 2035480 kB [root@c1101 ~]# for f in $(find /sys/kernel/slab -name shrink -type f) ;do echo 1000 > $f ; done [root@c1101 ~]# echo 3 > /proc/sys/vm/drop_caches [root@c1101 ~]# grep SUn /proc/meminfo SUnreclaim: 1770208 kB Best regards, Roope Jukkara
(In reply to Roope Jukkara from comment #31) > Looks like we're still hitting this issue with Slurm version 22.05 while > using cgroup/v2 To clarify, issue also happens with cgroup/v1. Best, Roope Jukkara
Hi, I think this bug report can be closed because it's not Slurm fault. Using cgroup -plugin just triggers this leak which seems to be in the Mellanox OFED 5.6 drivers. I just tested this test case on the node with Mellanox OFED 5.8 LTS and Unreclaimable memory does not grow fast like it did with MOFED 5.6 It could be this fix: https://docs.nvidia.com/networking/display/MLNXOFEDv583070LTS/Bug+Fixes+in+This+Version 3229002 Description: Creating and deleting MRs, caused a kernel slab cache leak issue. Keywords: RDMA, Cache Discovered in Release: 5.7-1.0.2.0 Fixed in Release: 5.8-1.0.1.1 Best Regards, Tommi Tervo CSC