Ticket 17233

Summary: cgroup v2 swap limit problem
Product: Slurm Reporter: Lloyd Brown <lloyd_brown>
Component: slurmdAssignee: Felip Moll <felip.moll>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 22.05.7   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=17849
Site: BYU - Brigham Young University Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 23.02.6
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: cgroup v2 fix to swap limits

Description Lloyd Brown 2023-07-19 10:02:42 MDT
Created attachment 31302 [details]
cgroup v2 fix to swap limits

I believe there may be a bug in how the cgroups v2 task plugin handles swap limits.  Let me explain.

In cgroups v1, we have two relevant files in the memory cgroup for a process: "memory.limit_in_bytes" which limits the amount of RAM allocated, and "memory.memsw_limit_in_bytes", which limits the amount of RAM+Swap allocated.  Therefore, when we have "AllowedSwapSpace=5" (meaning 5% of the job's memory) in our "cgroup.conf" file, "memory.memsw_limit_in_bytes" should have a value 105% of the value in "memory.limit_in_bytes".  This appears to be how it works.

In cgroups v2, the corresponding files are "memory.max", which is still the limit of RAM allocated, and "memory.swap.max", which is now the limit of *just* the swap allocated.  Therefore, with the same "AllowedSwapSpace=5" from the previous example, the value in "memory.swap.max" should only be 5% of "memory.max", not 105%.

Reference: [1]


When I run a simple "salloc --mem=10G ...", and get onto a RHEL9 node running cgroups v2, I see the following values; notice that "memory.swap.max" is 105% of "memory.max" instead of the 5%:

bash-5.1$ cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_58632929/memory.swap.max 
11274289152
bash-5.1$ cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_58632929/memory.max 
10737418240
bash-5.1$ 

In this scenario, the process is being allowed to use 10G of memory, and 10.5G of swap (105% of memory) or 20.5G total

When I run a simple "eat_mem" process [2] inside that "salloc", it allocates RAM until it hits "memory.max".  It then slows down allocating swap (because swap is slow), but doesn't get OOM-killed until it fills up the system's swap, which occurs before it gets to swap = 1.05*memory.  On the other hand, if I manually put 512MB into the "memory.swap.max", it gets OOM-killed after allocating the 10G RAM I requested, and 512MB of swap I manually specified.  That's the behavior I was expecting from slurm.

We are running v22.05.7 currently, but based on a couple of lines in github ([3], [4]), I think this is still a problem in master.

There may be a better way to handle this, but after I applied the patch I'm attaching, I got these values for a similar "salloc"

bash-5.1$ cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_58632987/memory.swap.max
536870912
bash-5.1$ cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_58632987/memory.max
10737418240
bash-5.1$ 


As always, I'm happy to discuss, explain, etc.  And if I'm wrong about something, please let me know; I'm always happy to learn.

Lloyd





[1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt, look for "memory.swap.max"
[2] We're happy to share that code if you like, but it's basically just an infinity-increasing malloc/realloc, with periodic output, that we use to test if memory limits are being enforced.
[3] v1 plugin: https://github.com/SchedMD/slurm/blob/master/src/plugins/cgroup/v1/cgroup_v1.c#L1006-L1010
[4] v2 plugin: https://github.com/SchedMD/slurm/blob/master/src/plugins/cgroup/v2/cgroup_v2.c#L1610-L1613 - notice this uses the same "limits->memsw_limit_in_bytes" as the v1 plugin
Comment 2 Felip Moll 2023-07-24 09:44:59 MDT
Hi Lloyd, you are right. We did know about this but skipped this detail.
The behavior of cgroup/v1 is indeed different from cgroup/v2.
--
In cgroup/v1 there's one limit for RSS, and one limit for RSS+SWAP.
In cgroup/v2 there's one limit for RSS, and one limit for SWAP.

In cgroup/v1 there's Swappiness.
In cgroup/v2 swappiness does not exist.

In cgroup/v1 AllowedSwapSpace=0 means that RAM+Swap will be limited to AllowedRAMSpace
In cgroup/v2 AllowedSwapSpace=0 means that the job cannot use swap.

We should:
a) Document these differences and how all it works in cgroup.conf and possibly in the cgroups web page.
b) Apply a solution like your patch. But what I would really like is to modify the task/cgroup memory plugin to not do the calculations assuming the plugins will have memory+swap interfaces, but just calculate the Swap and the Memory separatelly. Then I'd like the specific plugins, like cgroup/v1, to use these values as needed. In that case, cgroup/v2 wouldn't need to do any modification.

Your patch is correct, but modifies cgroup/v2 and leaves particularities of the specific logic of v1 in the task/cgroup memory.
The functions to modify in b) are:

task_cgroup_memory.c: swap_limit_in_bytes() and _memcg_initialize()
cgroup_v1.c: cgroup_p_constrain_set()

Does my reasoning make sense to you? I can work on this patch, but if you feel comfortable and want to dedicate time I can wait for your contribution.
Comment 3 Lloyd Brown 2023-07-25 07:15:27 MDT
Felip,

TBH, it sounds like you have a better understanding of the differences than I do.  I'm content using my simple patch for the time being, if you want to take the time to do a better, more thorough/correct job.  I'm afraid that once the immediate problem is handled, I have enough other tasks that have to take higher priority, that I'll probably leave the details up to you.

Lloyd
Comment 4 Felip Moll 2023-07-25 11:02:12 MDT
(In reply to Lloyd Brown from comment #3)
> Felip,
> 
> TBH, it sounds like you have a better understanding of the differences than
> I do.  I'm content using my simple patch for the time being, if you want to
> take the time to do a better, more thorough/correct job.  I'm afraid that
> once the immediate problem is handled, I have enough other tasks that have
> to take higher priority, that I'll probably leave the details up to you.
> 
> Lloyd

Great, that's not a problem. Your patch is functionally correct so you can use it if in the meantime. I will add this to my queue and work on it asap.

Thanks Lloyd.
Comment 5 Felip Moll 2023-10-06 07:03:07 MDT
Hi Lloyd,

I included the idea of your fix into the next 23.02.6 release, I think it must be fixed in this release and then in master do bigger changes if needed, I will open an internal bug for this matter. I put your name on the commit.

Thanks for reporting.

commit ec54427c0b12e4a21309426b2c877d38efad28a6
Author:     Lloyd Brown <lloyd_brown@byu.edu>
AuthorDate: Fri Oct 6 14:53:14 2023 +0200
Commit:     Felip Moll <felip.moll@schedmd.com>
CommitDate: Fri Oct 6 14:57:19 2023 +0200

    Fix incorrect memory.swap.max setting in cgroup/v2
    
    In cgroup/v2 there are two independent limits, memory.max and
    memory.swap.max. In cgroup/v1 the swap limit was memory+swap, so it
    needed to be set to memory.limit_in_bytes + a percentage, so
    AllowedSwapSpace=5 implied setting memory.memsw_limit_in_bytes to 105%
    of memory.limit_in_bytes.
    
    In cgroup/v2 we need to separate these values, and just make memory.swap.max
    to be a 5% of memory.max, not 105%.
    
    Bug 17233