| Summary: | fix rss in sacct to include tmpfs and match memcg behaviour | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Robin Humble <robin.humble+slurm> |
| Component: | Accounting | Assignee: | Tim Wickberg <tim> |
| Status: | OPEN --- | QA Contact: | |
| Severity: | C - Contributions | ||
| Priority: | --- | CC: | csamuel, scrosby |
| Version: | 21.08.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | make accounting rss include tmpfs usage to match memcg behaviour | ||
Created attachment 24712 [details] make accounting rss include tmpfs usage to match memcg behaviour Hi, we've been using a patch to fix rss in sacct from memcg for a few years. the issue is that the kernel's memcg includes rss+tmpfs in its OOM decision making, but rss in sacct doesn't include tmpfs, so they don't always line up. if jobs use a lot of shared mem or write plain files to /dev/shm, then jobs can be killed by OOM, but rss in sacct is wrong so it's confusing why the job was killed. here's a patch to fix it. TBH I kinda thought I'd submitted this patch a few years ago. hopefully this isn't a dup. cheers, robin