Ticket 8140

Summary:	OverMemoryKill does not work
Product:	Slurm	Reporter:	CSC sysadmins <csc-slurm-tickets>
Component:	Limits	Assignee:	Nate Rini <nate>
Status:	RESOLVED FIXED	QA Contact:	Douglas Wightman <wightman>
Severity:	4 - Minor Issue
Priority:	---
Version:	19.05.3
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5479 https://bugs.schedmd.com/show_bug.cgi?id=8269 https://bugs.schedmd.com/show_bug.cgi?id=8258
Site:	CSC - IT Center for Science	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.02, 19.05-6
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm and cgroup.confs slurmd log patch slurmd log with patch 12526

Description CSC sysadmins 2019-11-25 03:36:00 MST

Created attachment 12387 [details]
slurm and cgroup.confs

Hi,

I tried to test old way to set memory limit instead of cgroup mem limits (which is a bit problematic for us). But on my test environment I could not get it working. 

I've set up Overmemorykill parameter and accounting but simple malloc test program can allocate more memory than the limit is.

JobAcctGatherParams     = UsePss,OverMemoryKill
AccountingStorageEnforce = associations,limits,qos


Here is example run:

[ttervo@c1 ~]$ sacct -j 45 -omaxrss,reqmem,elapsed,exitcode
    MaxRSS     ReqMem    Elapsed ExitCode 
---------- ---------- ---------- -------- 
                  1Mc   00:04:25      0:0 
         0        1Mc   00:04:25      0:0 
  1407988K        1Mc   00:04:25      0:0 


There is also old information on the slurm.conf man page, it conflicts with release notes: 

man slurm.conf

       MemLimitEnforce
       If set to yes then Slurm will terminate the job if it exceeds the value  requested using  the --mem-per-cpu option of salloc/sbatch/srun.  This is useful in combination with JobAcctGatherParams=OverMemoryKill. 

RELEASE_NOTES:

NOTE: MemLimitEnforce parameter has been removed and the functionality that
      was provided with it has been merged into a JobAcctGatherParams.

Comment 1 Nate Rini 2019-11-25 11:54:14 MST

Tommi,

Can you please attach your slurmd log of the node where you tested this?

Thanks,
--Nate

Comment 2 CSC sysadmins 2019-11-26 00:49:05 MST

Created attachment 12395 [details]
slurmd log

Test which I ran:

[ttervo@c1 ~]$ srun -n1 -p small --mem-per-cpu=100k --pty $SHELL
[ttervo@c1 ~]$ ./memtest 
Enter number of int(4 byte) you want to allocate:360000000
Allocating 1440000000 bytes......
Filling int into memory.....
Sleep 60 seconds......
Free memory.

Comment 3 Nate Rini 2019-11-26 09:16:55 MST

(In reply to Tommi Tervo from comment #0)
> JobAcctGatherParams     = UsePss,OverMemoryKill
> AccountingStorageEnforce = associations,limits,qos

Please also set "MemLimitEnforce=yes" in your slurm.conf.

Comment 4 CSC sysadmins 2019-11-26 14:03:37 MST

----- bugs@schedmd.com wrote:
> https://bugs.schedmd.com/show_bug.cgi?id=8140
> 
> --- Comment #3 from Nate Rini <nate@schedmd.com> ---

> Please also set "MemLimitEnforce=yes" in your slurm.conf.
> 
 
Like I wrote, it  is deprecated/removed from 19.05. It is on the config file but ignored.

Comment 5 CSC sysadmins 2019-11-28 01:32:32 MST

[root@slurmctl ~]# grep -i memlim /etc/slurm/slurm.conf
MemLimitEnforce=YES
[root@slurmctl ~]# systemctl restart slurmctld
[root@slurmctl ~]# scontrol show config |grep -i memlim
[root@slurmctl ~]# echo $?
1

Comment 7 Nate Rini 2019-12-02 14:40:03 MST

Is this a Cray Aries cluster?

Comment 8 CSC sysadmins 2019-12-02 22:46:50 MST

(In reply to Nate Rini from comment #7)
> Is this a Cray Aries cluster?

No, Centos7 cluster.

-Tommi

Comment 9 Nate Rini 2019-12-03 15:02:12 MST

Tommi

I believe I have confirmed the bug. I will work on a patchset.

Thanks,
--Nate

Comment 12 Nate Rini 2019-12-03 16:40:42 MST

Tommi

Does this node have a swap enabled?
> cat /proc/swaps 

Thanks,
--Nate

Comment 13 CSC sysadmins 2019-12-04 00:20:43 MST

> Does this node have a swap enabled?
> > cat /proc/swaps 

Hi,

Yes it has swap but swappiness seems to be zero:

[ttervo@c1 ~]$ cat /proc/sys/vm/swappiness 
0

[ttervo@c1 ~]$ free
              total        used        free      shared  buff/cache   available
Mem:        2046892      165196     1705348       23160      176348     1706428
Swap:       1048572        2560     1046012
[ttervo@c1 ~]$ cat /proc/swaps 
Filename                                Type            Size    Used    Priority
/dev/dm-1                               partition       1048572 2560    -2

Comment 14 Nate Rini 2019-12-04 08:55:36 MST

(In reply to Tommi Tervo from comment #13)
> Yes it has swap but swappiness seems to be zero:
> [ttervo@c1 ~]$ cat /proc/swaps 
> Used=2560

Looks like it is still getting used. In my testing, I found that the rlimit was set on the process and all the memory above the requested allocation went to swap.  Since swapped out pages don't count against the memory RSS usage, Slurm was not killing the processes.

Is it possible to call this and try again?
> swapoff /dev/dm-1

Thanks,
--Nate

Comment 15 CSC sysadmins 2019-12-05 00:10:01 MST

> Is it possible to call this and try again?
> > swapoff /dev/dm-1

Hi,

It did not have any effect:

[root@c1 ~]# swapoff -a
[root@c1 ~]# free
              total        used        free      shared  buff/cache   available
Mem:        2046892      173868     1681852       33400      191172     1685220
Swap:             0           0           0
[root@c1 ~]# logout
[ttervo@c1 ~]$ srun -n1 -p small --mem-per-cpu=100k --pty $SHELL

[ttervo@c1 ~]$ ./memtest
Enter number of int(4 byte) you want to allocate:360000000
Allocating 1440000000 bytes......
Filling int into memory.....
Sleep 60 seconds......
Free memory.
[ttervo@c1 ~]$ exit
exit
[ttervo@c1 ~]$ sacct -j 53 -oreqmem,maxrss
    ReqMem     MaxRSS 
---------- ---------- 
       1Mc            
       1Mc          0 
       1Mc   1407896K

Comment 16 Nate Rini 2019-12-10 10:44:45 MST

(In reply to Tommi Tervo from comment #15)
> > Is it possible to call this and try again?
> > > swapoff /dev/dm-1
> It did not have any effect:

Thanks for verifying. Working on a patch set now.

Comment 24 Nate Rini 2019-12-10 15:00:15 MST

Tommi,

A patch is undergoing review, please tell me if you need it soon than the normal review process allows.

Thanks,
--Nate

Comment 25 CSC sysadmins 2019-12-11 00:41:32 MST

(In reply to Nate Rini from comment #24)
> Tommi,
> 
> A patch is undergoing review, please tell me if you need it soon than the
> normal review process allows.


Hi,

I could apply it to my test environment for additional testing.

Thanks,
Tommi

Comment 26 Nate Rini 2019-12-11 08:37:04 MST

(In reply to Nate Rini from comment #22)
> Created attachment 12526 [details]
> patch

Please give it a try on your test system.

Comment 27 CSC sysadmins 2019-12-12 01:29:40 MST

Created attachment 12544 [details]
slurmd log with patch 12526

Hi,

I've a bad news, patch did not help. I verified that build is using patched source code:

[root@slurmctl slurm-19.05.4]# grep -A1 'clone slurmctld config' /root/rpmbuild/BUILD/slurm-19.05.4/src/slurmd/slurmd/slurmd.c
        /* clone slurmctld config into slurmd config */
        conf->job_acct_oom_kill = slurmctld_conf.job_acct_oom_kill;

[ttervo@c1 ~]$ srun -n1 -p small --mem-per-cpu=100k --pty $SHELL
[ttervo@c1 ~]$ ./memtest 
Enter number of int(4 byte) you want to allocate:360000000
Allocating 1440000000 bytes......
Filling int into memory.....
Sleep 60 seconds......
Free memory.
[ttervo@c1 ~]$ exit
[ttervo@c1 ~]$ sacct -j 56 -o maxrss,reqmem
    MaxRSS     ReqMem 
---------- ---------- 
                  1Mc 
      394K        1Mc 
  1408230K        1Mc

Comment 28 Nate Rini 2019-12-12 09:19:56 MST

Tommi,

Can you please verify that the slurmd daemon was fully restarted on the test node?
> [2019-12-12T10:15:24.354] error: Error binding slurm stream socket: Address already in use
> [2019-12-12T10:15:24.354] error: Unable to bind listen port (*:6818): Address already in use

Can you please verify that the cgroups are mounted on the node in the expected location?
> 2019-12-12T10:15:47.202] [55.0] debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/cpuset' entry '/sys/fs/cgroup/cpuset/slurm/system' properties: No such file or directory
> [2019-12-12T10:15:47.202] [55.0] debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/memory' entry '/sys/fs/cgroup/memory/slurm/system' properties: No such file or directory

While the job is sleeping for 60 seconds, can you please call gcore against the attached slurmstepd and provide 't a a bt full' output from gdb?

Thanks,
--Nate

Comment 29 CSC sysadmins 2019-12-13 04:57:15 MST

(In reply to Nate Rini from comment #28)
> Tommi,
> 
> Can you please verify that the slurmd daemon was fully restarted on the test
> node?
> > [2019-12-12T10:15:24.354] error: Error binding slurm stream socket: Address already in use
> > [2019-12-12T10:15:24.354] error: Unable to bind listen port (*:6818): Address already in use


Doh, seems that systemctl could not stop old slurmd and I did not catch that from verbose log, looked only for updated slurmd version string. After kill -9 `pidof slurmd` and systemctl start slurmd overmemorykill works fine on my test system. Thanks for fix.

Best Regards,
Tommi Tervo
CSC

Comment 34 Nate Rini 2020-01-09 14:03:13 MST

Tommi,

This is now fixed upstream by 4edf4a5898a2944. Please reply if you have any questions or issues.

Thanks,
--Nate