| Summary: | Inaccurate memory limit messages from task/cgroup | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | David Gloe <david.gloe> |
| Component: | Limits | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 14.11.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CRAY | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
This has been fixed along with documentation on how to enable the memsw subsystem in the kernel (it is turned off by default on debian systems, and appears your distro as well). See the cgroup.html page on our website (it was just updated). It doesn't totally appear it is needed, but you might want to take a look. In any case you shouldn't see this error anymore if the subsystem isn't enabled. |
After every srun on our systems updated with the latest code, we're getting inaccurate memory limit messages regarding the OOM killer: galaxy:~ # srun -N 1 /bin/hostname nid00331 slurmd[nid00331]: Exceeded step memory limit at some point. oom-killer likely killed a process. slurmd[nid00331]: Exceeded job memory limit at some point. oom-killer likely killed a process. We believe it's due to the task_cgroup_memory_check_oom function in src/plugins/task/cgroup/task_cgroup_memory.c. xcgroup_get_uint64_param(&step_memory_cg, "memory.memsw.failcnt", &memory_memsw_failcnt); On our systems we don't have the memory.memsw.failcnt parameter: # ls /dev/mcgroup/slurm tasks memory.failcnt cgroup.procs memory.stat notify_on_release memory.force_empty cgroup.event_control memory.use_hierarchy cgroup.clone_children memory.swappiness memory.usage_in_bytes memory.move_charge_at_immigrate memory.max_usage_in_bytes memory.oom_control memory.limit_in_bytes memory.numa_stat memory.soft_limit_in_bytes uid_29597 task_cgroup_memory_check_oom should check the return value of xcgroup_get_uint64_param to verify that the parameter was correctly retrieved, and if not simply check memory.failcnt instead.