Ticket 507

Summary: Inaccurate memory limit messages from task/cgroup
Product: Slurm Reporter: David Gloe <david.gloe>
Component: LimitsAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 14.11.x   
Hardware: Linux   
OS: Linux   
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description David Gloe 2013-11-11 08:58:20 MST
After every srun on our systems updated with the latest code, we're getting inaccurate memory limit messages regarding the OOM killer:
galaxy:~ # srun -N 1 /bin/hostname 
nid00331
slurmd[nid00331]: Exceeded step memory limit at some point. oom-killer likely killed a process.
slurmd[nid00331]: Exceeded job memory limit at some point. oom-killer likely killed a process.

We believe it's due to the task_cgroup_memory_check_oom function in src/plugins/task/cgroup/task_cgroup_memory.c. 
xcgroup_get_uint64_param(&step_memory_cg,
    "memory.memsw.failcnt",
    &memory_memsw_failcnt);

On our systems we don't have the memory.memsw.failcnt parameter:

# ls /dev/mcgroup/slurm
tasks                            memory.failcnt
cgroup.procs                     memory.stat
notify_on_release                memory.force_empty
cgroup.event_control             memory.use_hierarchy
cgroup.clone_children            memory.swappiness
memory.usage_in_bytes            memory.move_charge_at_immigrate
memory.max_usage_in_bytes        memory.oom_control
memory.limit_in_bytes            memory.numa_stat
memory.soft_limit_in_bytes       uid_29597

task_cgroup_memory_check_oom should check the return value of xcgroup_get_uint64_param to verify that the parameter was correctly retrieved, and if not simply check memory.failcnt instead.
Comment 1 Danny Auble 2013-11-12 01:22:00 MST
This has been fixed along with documentation on how to enable the memsw subsystem in the kernel (it is turned off by default on debian systems, and appears your distro as well).  See the cgroup.html page on our website (it was just updated).  It doesn't totally appear it is needed, but you might want to take a look.

In any case you shouldn't see this error anymore if the subsystem isn't enabled.