Hi, I'm trying to setup cgroups support to disallow users jobs to use swap at all. It mostly works, but not completely. I have the following /etc/slurm/cgroup.conf: # general CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" # task/cgroup plugin TaskAffinity=yes # require hwloc ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes # prevent jobs from using swap space AllowedRAMSpace=100 # in % AllowedSwapSpace=0 # in % and in slurm.conf: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup The memory limitation part seems to work pretty well overall, but I can still find jobs that make use of swap space: # pwd /cgroup/memory/slurm/uid_15248/job_322107/step_4294967294 # grep 322107 /var/log/slurm/slurmd.log [2014-09-09T12:44:01.613] Launching batch job 322107 for UID 15248 [2014-09-09T12:44:01.630] [322107] checkpoint/blcr init [2014-09-09T12:44:01.663] [322107] task/cgroup: /slurm/uid_15248/job_322107: alloc=20000MB mem.limit=20000MB memsw.limit=20000MB [2014-09-09T12:44:01.663] [322107] task/cgroup: /slurm/uid_15248/job_322107/step_4294967294: alloc=20000MB mem.limit=20000MB memsw.limit=20000MB And then I have: memory.limit_in_bytes 20971520000 memory.usage_in_bytes 20851781632 memory.max_usage_in_bytes 20853907456 memory.failcnt 0 memory.memsw.limit_in_bytes 20971520000 memory.memsw.usage_in_bytes 20970475520 memory.memsw.max_usage_in_bytes 20971520000 memory.memsw.failcnt 103751 So it looks like the memsw limit has been hit a few times, and yet the process is still running. # cat cgroup.procs 14812 14827 14831 14834 14838 # grep VmSwap /proc/14834/status VmSwap: 119328 kB This PID definitely uses some swap. So, I was wondering if this is all normal or if there was a way to really prevent a user process to use any swap at all. There's this memory.swapiness control file in the cgroups, but I don't think it can be set from Slurm. Thanks.
Hi Kilian, I am working on this and will update you later on. David
Hi Kilian, Slurm sets the limits correctly. Since the memory.memsw.limit_in_bytes indicates the combined memory and swap limit if you set AllowedSwapSpace=0 then the values should indeed be equal. From Slurm perspective the things are all right. It is more difficult for me tell you why your kernel allowed some swap space use regardless... I am running a heavy memory benchmark with settings like your but I don't see swap being in use. I see the memory limit being hit few times >cat memory.failcnt 3191527 but not the swap. I am running on: >cat /etc/redhat-release CentOS release 6.5 (Final) >uname -a Linux prometeo 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux David
Hi David, Thanks for looking in to it. (In reply to David Bigagli from comment #2) > Slurm sets the limits correctly. Since the > memory.memsw.limit_in_bytes > indicates the combined memory and swap limit if you set AllowedSwapSpace=0 > then the values should indeed be equal. From Slurm perspective the things > are all right. Yes, memory.limit_in_bytes and memory.memsw.limit_in_bytes are the same value, which is good. > It is more difficult for me tell you why your kernel allowed some swap space > use regardless... I am running a heavy memory benchmark with settings like > your but I don't see swap being in use. I see the memory limit being hit few > times > > >cat memory.failcnt > 3191527 > > but not the swap. I think what happens here is that some memory pages get swapped off to disk while the overall usage is still under the limit. Maybe from pressure from other jobs in different cgroups. So we have memsw.usage < limit and memory.usage < limit, but memsw.usage > memory.usage. I guess I'm looking for a way to ensure that memsw.usage stays equals to memory.usage all the time, but I'm not sure that's even possible. > > I am running on: > > >cat /etc/redhat-release > CentOS release 6.5 (Final) > >uname -a > Linux prometeo 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 > x86_64 x86_64 x86_64 GNU/Linux > > David
What OS and kernel version do you have? David
(In reply to David Bigagli from comment #4) > What OS and kernel version do you have? Oh sorry, forgot about that: Red Hat Enterprise Linux Server release 6.5 (Santiago) Linux 2.6.32-431.23.3.el6.x86_64 #1 SMP Wed Jul 16 06:12:23 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
Thanks for the info. I used CentOS 6.5 which is equivalent. I suggest we close this ticket as not a Slurm problem. David
(In reply to David Bigagli from comment #6) > Thanks for the info. I used CentOS 6.5 which is equivalent. > I suggest we close this ticket as not a Slurm problem. That sounds ok, it indeed looks more like a kernel/OS issue. Thanks.