Summary: | fix rss in sacct to include tmpfs and match memcg behaviour | ||
---|---|---|---|
Product: | Slurm | Reporter: | Robin Humble <robin.humble+slurm> |
Component: | Accounting | Assignee: | Tim Wickberg <tim> |
Status: | OPEN --- | QA Contact: | |
Severity: | C - Contributions | ||
Priority: | --- | CC: | csamuel, scrosby |
Version: | 21.08.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | make accounting rss include tmpfs usage to match memcg behaviour |
Created attachment 24712 [details] make accounting rss include tmpfs usage to match memcg behaviour Hi, we've been using a patch to fix rss in sacct from memcg for a few years. the issue is that the kernel's memcg includes rss+tmpfs in its OOM decision making, but rss in sacct doesn't include tmpfs, so they don't always line up. if jobs use a lot of shared mem or write plain files to /dev/shm, then jobs can be killed by OOM, but rss in sacct is wrong so it's confusing why the job was killed. here's a patch to fix it. TBH I kinda thought I'd submitted this patch a few years ago. hopefully this isn't a dup. cheers, robin