Ticket 2648

Summary: NoOverMemoryKill does not always work
Product: Slurm Reporter: Tomasz Janowski <t.j>
Component: slurmstepdAssignee: Jacob Jenson <jacob>
Status: RESOLVED WONTFIX QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: t.j
Version: 15.08.8   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Tomasz Janowski 2016-04-18 12:20:39 MDT
Hello SLURM developers,

I was very happy to upgrade to a newer version of SLURM due to a new feature: an option in slurm.conf that can prevent SLURM from killing a job that goes over the memory limit. This is an extremely useful feature if one uses cgroup to constrain memory usage. SLURM tends to double count memory if the jobacct_gather/linux plugin is used (forked processes with copy-on-write). Since jobacct_gather/cgroup does not produce reliable results when other cgroup plugins are in action, often reporting almost no memory usage for a job that consumes gigabytes of RAM, "NoOverMemory" option with jobacct_gather/linux seemed to be a natural solution - kernel will kill a job that exceeds its memory usage while jobacct_gather/linux will provide reasonable data for statistical purposes. The greatest benefit is that cgroup enables implementing grace amount of memory in swap space, so jobs exceeding memory can be swapped out to some degree and only very serious offenders are killed.

The option NoOverMemoryKill works as expected for a program started directly from a batch script, but a job started via srun is still being killed. This killing happens at unpredictable times from a few minutes to up to ten minutes after the offending job is started.

Would it be possible to update all the components of SLURM to respect NoOverMemoryKill as expected?

Thanks!
Tomasz
Comment 1 Jacob Jenson 2017-11-03 10:46:31 MDT
This version of Slurm is no longer supported.