Ticket 507 - Inaccurate memory limit messages from task/cgroup
Summary: Inaccurate memory limit messages from task/cgroup
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 14.11.x
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Danny Auble
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2013-11-11 08:58 MST by David Gloe
Modified: 2013-11-12 01:22 MST (History)
0 users

See Also:
Site: CRAY
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description David Gloe 2013-11-11 08:58:20 MST
After every srun on our systems updated with the latest code, we're getting inaccurate memory limit messages regarding the OOM killer:
galaxy:~ # srun -N 1 /bin/hostname 
nid00331
slurmd[nid00331]: Exceeded step memory limit at some point. oom-killer likely killed a process.
slurmd[nid00331]: Exceeded job memory limit at some point. oom-killer likely killed a process.

We believe it's due to the task_cgroup_memory_check_oom function in src/plugins/task/cgroup/task_cgroup_memory.c. 
xcgroup_get_uint64_param(&step_memory_cg,
    "memory.memsw.failcnt",
    &memory_memsw_failcnt);

On our systems we don't have the memory.memsw.failcnt parameter:

# ls /dev/mcgroup/slurm
tasks                            memory.failcnt
cgroup.procs                     memory.stat
notify_on_release                memory.force_empty
cgroup.event_control             memory.use_hierarchy
cgroup.clone_children            memory.swappiness
memory.usage_in_bytes            memory.move_charge_at_immigrate
memory.max_usage_in_bytes        memory.oom_control
memory.limit_in_bytes            memory.numa_stat
memory.soft_limit_in_bytes       uid_29597

task_cgroup_memory_check_oom should check the return value of xcgroup_get_uint64_param to verify that the parameter was correctly retrieved, and if not simply check memory.failcnt instead.
Comment 1 Danny Auble 2013-11-12 01:22:00 MST
This has been fixed along with documentation on how to enable the memsw subsystem in the kernel (it is turned off by default on debian systems, and appears your distro as well).  See the cgroup.html page on our website (it was just updated).  It doesn't totally appear it is needed, but you might want to take a look.

In any case you shouldn't see this error anymore if the subsystem isn't enabled.