Tommi, Can you please attach your slurmd log of the node where you tested this? Thanks, --Nate Created attachment 12395 [details]
slurmd log
Test which I ran:
[ttervo@c1 ~]$ srun -n1 -p small --mem-per-cpu=100k --pty $SHELL
[ttervo@c1 ~]$ ./memtest
Enter number of int(4 byte) you want to allocate:360000000
Allocating 1440000000 bytes......
Filling int into memory.....
Sleep 60 seconds......
Free memory.
(In reply to Tommi Tervo from comment #0) > JobAcctGatherParams = UsePss,OverMemoryKill > AccountingStorageEnforce = associations,limits,qos Please also set "MemLimitEnforce=yes" in your slurm.conf. ----- bugs@schedmd.com wrote: > https://bugs.schedmd.com/show_bug.cgi?id=8140 > > --- Comment #3 from Nate Rini <nate@schedmd.com> --- > Please also set "MemLimitEnforce=yes" in your slurm.conf. > Like I wrote, it is deprecated/removed from 19.05. It is on the config file but ignored. [root@slurmctl ~]# grep -i memlim /etc/slurm/slurm.conf MemLimitEnforce=YES [root@slurmctl ~]# systemctl restart slurmctld [root@slurmctl ~]# scontrol show config |grep -i memlim [root@slurmctl ~]# echo $? 1 Is this a Cray Aries cluster? (In reply to Nate Rini from comment #7) > Is this a Cray Aries cluster? No, Centos7 cluster. -Tommi Tommi I believe I have confirmed the bug. I will work on a patchset. Thanks, --Nate Tommi
Does this node have a swap enabled?
> cat /proc/swaps
Thanks,
--Nate
> Does this node have a swap enabled?
> > cat /proc/swaps
Hi,
Yes it has swap but swappiness seems to be zero:
[ttervo@c1 ~]$ cat /proc/sys/vm/swappiness
0
[ttervo@c1 ~]$ free
total used free shared buff/cache available
Mem: 2046892 165196 1705348 23160 176348 1706428
Swap: 1048572 2560 1046012
[ttervo@c1 ~]$ cat /proc/swaps
Filename Type Size Used Priority
/dev/dm-1 partition 1048572 2560 -2
(In reply to Tommi Tervo from comment #13) > Yes it has swap but swappiness seems to be zero: > [ttervo@c1 ~]$ cat /proc/swaps > Used=2560 Looks like it is still getting used. In my testing, I found that the rlimit was set on the process and all the memory above the requested allocation went to swap. Since swapped out pages don't count against the memory RSS usage, Slurm was not killing the processes. Is it possible to call this and try again? > swapoff /dev/dm-1 Thanks, --Nate > Is it possible to call this and try again?
> > swapoff /dev/dm-1
Hi,
It did not have any effect:
[root@c1 ~]# swapoff -a
[root@c1 ~]# free
total used free shared buff/cache available
Mem: 2046892 173868 1681852 33400 191172 1685220
Swap: 0 0 0
[root@c1 ~]# logout
[ttervo@c1 ~]$ srun -n1 -p small --mem-per-cpu=100k --pty $SHELL
[ttervo@c1 ~]$ ./memtest
Enter number of int(4 byte) you want to allocate:360000000
Allocating 1440000000 bytes......
Filling int into memory.....
Sleep 60 seconds......
Free memory.
[ttervo@c1 ~]$ exit
exit
[ttervo@c1 ~]$ sacct -j 53 -oreqmem,maxrss
ReqMem MaxRSS
---------- ----------
1Mc
1Mc 0
1Mc 1407896K
(In reply to Tommi Tervo from comment #15) > > Is it possible to call this and try again? > > > swapoff /dev/dm-1 > It did not have any effect: Thanks for verifying. Working on a patch set now. Tommi, A patch is undergoing review, please tell me if you need it soon than the normal review process allows. Thanks, --Nate (In reply to Nate Rini from comment #24) > Tommi, > > A patch is undergoing review, please tell me if you need it soon than the > normal review process allows. Hi, I could apply it to my test environment for additional testing. Thanks, Tommi (In reply to Nate Rini from comment #22) > Created attachment 12526 [details] > patch Please give it a try on your test system. Created attachment 12544 [details]
slurmd log with patch 12526
Hi,
I've a bad news, patch did not help. I verified that build is using patched source code:
[root@slurmctl slurm-19.05.4]# grep -A1 'clone slurmctld config' /root/rpmbuild/BUILD/slurm-19.05.4/src/slurmd/slurmd/slurmd.c
/* clone slurmctld config into slurmd config */
conf->job_acct_oom_kill = slurmctld_conf.job_acct_oom_kill;
[ttervo@c1 ~]$ srun -n1 -p small --mem-per-cpu=100k --pty $SHELL
[ttervo@c1 ~]$ ./memtest
Enter number of int(4 byte) you want to allocate:360000000
Allocating 1440000000 bytes......
Filling int into memory.....
Sleep 60 seconds......
Free memory.
[ttervo@c1 ~]$ exit
[ttervo@c1 ~]$ sacct -j 56 -o maxrss,reqmem
MaxRSS ReqMem
---------- ----------
1Mc
394K 1Mc
1408230K 1Mc
Tommi, Can you please verify that the slurmd daemon was fully restarted on the test node? > [2019-12-12T10:15:24.354] error: Error binding slurm stream socket: Address already in use > [2019-12-12T10:15:24.354] error: Unable to bind listen port (*:6818): Address already in use Can you please verify that the cgroups are mounted on the node in the expected location? > 2019-12-12T10:15:47.202] [55.0] debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/cpuset' entry '/sys/fs/cgroup/cpuset/slurm/system' properties: No such file or directory > [2019-12-12T10:15:47.202] [55.0] debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/memory' entry '/sys/fs/cgroup/memory/slurm/system' properties: No such file or directory While the job is sleeping for 60 seconds, can you please call gcore against the attached slurmstepd and provide 't a a bt full' output from gdb? Thanks, --Nate (In reply to Nate Rini from comment #28) > Tommi, > > Can you please verify that the slurmd daemon was fully restarted on the test > node? > > [2019-12-12T10:15:24.354] error: Error binding slurm stream socket: Address already in use > > [2019-12-12T10:15:24.354] error: Unable to bind listen port (*:6818): Address already in use Doh, seems that systemctl could not stop old slurmd and I did not catch that from verbose log, looked only for updated slurmd version string. After kill -9 `pidof slurmd` and systemctl start slurmd overmemorykill works fine on my test system. Thanks for fix. Best Regards, Tommi Tervo CSC Tommi, This is now fixed upstream by 4edf4a5898a2944. Please reply if you have any questions or issues. Thanks, --Nate |
Created attachment 12387 [details] slurm and cgroup.confs Hi, I tried to test old way to set memory limit instead of cgroup mem limits (which is a bit problematic for us). But on my test environment I could not get it working. I've set up Overmemorykill parameter and accounting but simple malloc test program can allocate more memory than the limit is. JobAcctGatherParams = UsePss,OverMemoryKill AccountingStorageEnforce = associations,limits,qos Here is example run: [ttervo@c1 ~]$ sacct -j 45 -omaxrss,reqmem,elapsed,exitcode MaxRSS ReqMem Elapsed ExitCode ---------- ---------- ---------- -------- 1Mc 00:04:25 0:0 0 1Mc 00:04:25 0:0 1407988K 1Mc 00:04:25 0:0 There is also old information on the slurm.conf man page, it conflicts with release notes: man slurm.conf MemLimitEnforce If set to yes then Slurm will terminate the job if it exceeds the value requested using the --mem-per-cpu option of salloc/sbatch/srun. This is useful in combination with JobAcctGatherParams=OverMemoryKill. RELEASE_NOTES: NOTE: MemLimitEnforce parameter has been removed and the functionality that was provided with it has been merged into a JobAcctGatherParams.