Ticket 3562 - Exceeded job memory limit at some point ?
Summary: Exceeded job memory limit at some point ?
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 16.05.4
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-03-09 11:45 MST by NYU HPC Team
Modified: 2017-03-24 04:44 MDT (History)
2 users (show)

See Also:
Site: NYU
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm general conf and cgroup conf (10.00 KB, application/x-tar)
2017-03-09 16:58 MST, NYU HPC Team
Details
slurmd log when we seeing the error (3.07 MB, application/gzip)
2017-03-09 18:56 MST, NYU HPC Team
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description NYU HPC Team 2017-03-09 11:45:26 MST
Hi Slurm experts, 

We are using Slurm 16.05.4, and encountered similar issues as seen in these two threads:
https://github.com/chaos/slurm/issues/54
https://bugs.schedmd.com/show_bug.cgi?id=3214

We wonder if there are solutions, or upgrading our Slurm installation to a newer version is the direction to go. Thank you very much!


Regards,
Wensheng
Comment 1 Tim Wickberg 2017-03-09 14:28:31 MST
Can you attach your slurm.conf and cgroup.conf files?

The memory enforcement varies subtly based on a few of those settings, and it'd be good to know which is coming in.

I will say that 17.02 introduces a new JOB_OOM state to try to highlight when jobs are stopped due to memory consumption issues, and that may be worth upgrading for depending on your sites concerns.

- Tim
Comment 2 NYU HPC Team 2017-03-09 16:58:48 MST
Created attachment 4182 [details]
slurm general conf and cgroup conf

Sorry Tim, slurm.conf and cgroup.conf are in the attached tar file. Thank you!
Comment 3 Tim Wickberg 2017-03-09 18:09:56 MST
It looks like you do have task/cgroup and the associated settings in cgroup.conf set correctly for memory limit enforcement. That should strictly prevent the jobs from using more than their requested amount of memory.

Can you attach the slurmd.log from one of the nodes? I'm curious if there are messages there that would indicate why this can happen. If you're able to get it from a node that this occurred on at the time you saw the issue even better.

I will note that, if you're using RHEL7 or a related distribution, you may want to move to the latest 16.05 release or even 17.02. There are some fixes in there related to problematic interactions with systemd and cgroups, although I don't think that is a likely cause of this issue.

- Tim
Comment 4 NYU HPC Team 2017-03-09 18:56:12 MST
Created attachment 4183 [details]
slurmd log when we seeing the error

Hi Tim,

I have just attached a node slurmd log. In the job (jobid=154096) I copied a 5 GB file from Lustre to the node local disk.

[deng@log-1 dd_test]$ ls -lh $SCRATCH/dd_test/test11
-rw-rw-r-- 1 deng deng 5.0G Feb 16 20:49 /scratch/deng/dd_test/test11

[deng@log-1 dd_test]$ ls -l slurm-154096.out 
-rw-rw-r-- 1 deng deng 212 Mar  9 20:40 slurm-154096.out

[deng@log-1 dd_test]$ more slurm-154096.out 
Hostname: c26-01
cp /scratch/deng/dd_test/test11 /state/partition1/job-154096/staging-OaGg
slurmstepd: error: Exceeded job memory limit at some point.
slurmstepd: error: Exceeded job memory limit at some point.

[deng@log-1 dd_test]$ sacct -j 154096
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
154096          myLTest        c26 test_acco+          1  COMPLETED      0:0 
154096.batch      batch            test_acco+          1  COMPLETED      0:0 
154096.exte+     extern            test_acco+          1  COMPLETED      0:0 
154096.0             cp            test_acco+          1  COMPLETED      0:0 


Below is the job script used with the sbatch command -
[deng@log-1 dd_test]$ cat run-onefilecopying.sh 
#!/bin/bash

#SBATCH --job-name=myLTest
#SBATCH --nodes=1 --tasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=5GB
#SBATCH --time=1:30:00

if [ "$SLURM_JOBTMP" == "" ]; then
    export SLURM_JOBTMP=/state/partition1/$USER/$$
    mkdir -p $SLURM_JOBTMP
fi
export STAGING_DIR=$(mktemp -d $SLURM_JOBTMP/staging-XXXX)

echo
echo "Hostname: $(hostname)"

echo "cp $SCRATCH/dd_test/test11 $STAGING_DIR"
srun cp $SCRATCH/dd_test/test11 $STAGING_DIR

rm -rf $SLURM_JOBTMP/*


------------------------------------------------------------
Thank you!
Comment 5 Tim Wickberg 2017-03-09 19:38:05 MST
I think I can better explain this now. The job is being killed when it runs over the cgroup memory limit - the exit codes should reflect this. Although the 17.02 release with the new JOB_OOM status tries to make this situation much clearer.

I believe the cause for your trivial job running out of memory is that the kernel memory used by Lustre when servicing I/O to the process ("KMem") is included in the limit by default, and this can add up quickly even though the process itself is using very little RAM.

17.02 adds a mechanism to disable enforcement of this KMem limit; it looks like you may want those options in order to address this issue. The relevant setting in cgroup.conf that I believe you'd want is 'ConstrainKmemSpace=no'.

(See https://slurm.schedmd.com/cgroup.conf.html ; that page is the same as 'man cgroup.conf' on the 17.02 release.)

Bug 2748 has a bit more background on the introduction of that option.

- Tim
Comment 7 NYU HPC Team 2017-03-12 11:01:09 MDT
Hi Tim, I downloaded the latest stable version tarball slurm-17.02.1-2.tar.bz2, and wonder if it includes (an issue reporting turned to) the feature requested in
https://bugs.schedmd.com/show_bug.cgi?id=3460 . Thank you!
Comment 8 Tim Wickberg 2017-03-12 12:07:23 MDT
(In reply to NYU HPC Team from comment #7)
> Hi Tim, I downloaded the latest stable version tarball
> slurm-17.02.1-2.tar.bz2, and wonder if it includes (an issue reporting
> turned to) the feature requested in
> https://bugs.schedmd.com/show_bug.cgi?id=3460 . Thank you!

No, it does not. When/if we tackle that the bug would be updated, and eventually marked as resolved/fixed with the release the functionality has been added to. It'd be in 17.11 at the earliest (although you would likely be able to back-port the patch to 17.02 if desired).

Just to make sure I haven't inadvertently raised your expectations with regards to bug 3460: while we do try to address a number of customer-requested enhancements with each major release (especially ones that multiple sites have expressed an interest in), we cannot commit to resolving those issues on a specific timeline (if at all). With each release, our developers' priority is completing sponsored development work we've committed to, which does carry with it a specific timeline for completion and as well as functional requirements. If that is a something NYU may be interested in, please let me know and we can discuss how that process works.

- Tim
Comment 9 javicacheiro 2017-03-13 05:18:18 MDT
Hi Tim, Wensheng,

Just in case it can be useful, I have been also looking at the cause of these messages and this is what I have found:

There is a different interpretation of what memory means in the context of slurm and in the context of cgroups. 

Looking at sbatch man page you expect that the memory limit refers to RSS memory:

  --mem Specify  the  real  memory required per node in MegaBytes... In both cases memory use is based upon the job’s Resident Set Size (RSS).

But in cgropus this translates into setting the following cgroup memory limit for the step:

   memory.limit_in_bytes

According to cgroups documentation this sets the maximum amount of user memory (including file cache).

So RSS+Cache is included in the effective limit imposed by cgroups and not just RSS as expected by the --mem option.

The good news is that in most cases this is not a problem because according to cgroups documentation:

    When a cgroup goes over its limit, we first try
    to reclaim memory from the cgroup so as to make space for the new
    pages that the cgroup has touched. If the reclaim is unsuccessful,
    an OOM routine is invoked to select and kill the bulkiest task in the
    cgroup.

So the file cache will first be reclaimed and the cgroup will go again below the limit and nothing will actually happen (the OOM Killer will not be invoked).

In our case, I would say that in most cases the "exceeded job memory limit" messages do not reflect a real problem in the job (actually everything works fine and the jobs terminates successfully).

I think, it would be useful if additional information could be included in the slurm log message indicating if the "exceeded job memory limit" reflects cgroups has triggered a memory reclaim or cgroups is actually invoking the OOM Killer.

Also the slurm man page could be updated to refect the fact that, when using cgroups, not only RSS is considered.

Regards,
Javier
Comment 11 NYU HPC Team 2017-03-14 07:51:09 MDT
Hi Javier, Tim:

A test instance of 17.02 is set up here. I submitted the same job script with sbatch, and saw error message like below:
slurmstepd: error: Exceeded job memory limit at some point.
srun: error: c27-05: task 0: Out Of Memory
srun: Terminating job step 5.0
slurmstepd: error: Exceeded job memory limit at some point.

Also when I replaced the 'cp' command with a dd command as the following:
$ dd bs=4M iflag=direct oflag=direct if=<inputfile> of=<outputfile>
the job ran to completion without any error message. 


Thank you!
Comment 16 Alejandro Sanchez 2017-03-15 11:45:18 MDT
(In reply to javicacheiro from comment #9)
> Hi Tim, Wensheng,
> 
> Just in case it can be useful, I have been also looking at the cause of
> these messages and this is what I have found:
> 
> There is a different interpretation of what memory means in the context of
> slurm and in the context of cgroups. 
> 
> Looking at sbatch man page you expect that the memory limit refers to RSS
> memory:
> 
>   --mem Specify  the  real  memory required per node in MegaBytes... In both
> cases memory use is based upon the job’s Resident Set Size (RSS).
> 
> But in cgropus this translates into setting the following cgroup memory
> limit for the step:
> 
>    memory.limit_in_bytes
> 
> According to cgroups documentation this sets the maximum amount of user
> memory (including file cache).
> 
> So RSS+Cache is included in the effective limit imposed by cgroups and not
> just RSS as expected by the --mem option.

I agree with that, and the documentation has to be changed to reflect that the memory tracked through task/cgroup is the virtual memory size (or vsz in 'ps' terms), and _not_ just the resident set size (or rss). If I'm not wrong, vsz = rss + cache (+swap), and if you look at the memory.stats while a step is increasing its memory footprint you'll notice that the OOM is effectively triggered when vsz (and not just rss) reaches the limit set. The (+swap) component can be controlled with the three parameters:

ConstrainSwapSpace=yes
AllowedSwapSpace=X
MaxSwapPercent=0 <- if you don't want your step to use swap space

> The good news is that in most cases this is not a problem because according
> to cgroups documentation:
> 
>     When a cgroup goes over its limit, we first try
>     to reclaim memory from the cgroup so as to make space for the new
>     pages that the cgroup has touched. If the reclaim is unsuccessful,
>     an OOM routine is invoked to select and kill the bulkiest task in the
>     cgroup.
> 
> So the file cache will first be reclaimed and the cgroup will go again below
> the limit and nothing will actually happen (the OOM Killer will not be
> invoked).
> 
> In our case, I would say that in most cases the "exceeded job memory limit"
> messages do not reflect a real problem in the job (actually everything works
> fine and the jobs terminates successfully).

When the "exceeded job memory limit" message is shown but the program is not killed, I think it's because the program rss + cache has reached the configured limit, but is using swap space. But have to double check this.

> I think, it would be useful if additional information could be included in
> the slurm log message indicating if the "exceeded job memory limit" reflects
> cgroups has triggered a memory reclaim or cgroups is actually invoking the
> OOM Killer.

That would be a good enhancement idea.

> Also the slurm man page could be updated to refect the fact that, when using
> cgroups, not only RSS is considered.
> 
> Regards,
> Javier

I'm writing down a self note to change the documentation tomorrow morning.
Comment 18 Alejandro Sanchez 2017-03-15 12:23:07 MDT
(In reply to Alejandro Sanchez from comment #16)
> (In reply to javicacheiro from comment #9)
> > In our case, I would say that in most cases the "exceeded job memory limit"
> > messages do not reflect a real problem in the job (actually everything works
> > fine and the jobs terminates successfully).
> 
> When the "exceeded job memory limit" message is shown but the program is not
> killed, I think it's because the program rss + cache has reached the
> configured limit, but is using swap space. But have to double check this.

I've just checked and if a step program reaches rss + cache, and then starts using swap, the message is still not logged. It is logged when it's OOM-killed.
Comment 19 NYU HPC Team 2017-03-15 12:28:35 MDT
As previously attached, our cgroup.conf is as the following:
$ cat cgroup.conf
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes
CgroupReleaseAgentDir="/opt/slurm/etc/cgroup"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedSwapSpace=0
MaxSwapPercent=0
ConstrainDevices=yes
TaskAffinity=yes


I added these two lines to cgroup.conf in our test instance. Tried with new jobs, this does not change anything in the job standard output, as we all expected. 
< AllowedSwapSpace=0
< MaxSwapPercent=0


Thank you!
Comment 20 NYU HPC Team 2017-03-15 12:31:35 MDT
The original cgroup.conf should be like this:
$ cat /opt/slurm/etc/cgroup.conf
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes
CgroupReleaseAgentDir="/opt/slurm/etc/cgroup"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes
TaskAffinity=yes

Okay, I will enable swap space and see. Thank you!
Comment 21 Alejandro Sanchez 2017-03-15 12:37:41 MDT
Do you mind describing your current concern with this bug? Sorry but different things have been discussed: enhancements, swap config, KMem, between you and Tim and different bugs have been referenced. Now Tim is in a meeting so I'm taking this bug in the mean time, but not sure what's the main problem your main problem right now or what behavior are you experiencing with cgroups which seems to not work properly for you. Thanks.
Comment 23 NYU HPC Team 2017-03-15 12:48:40 MDT
Thank you Alejandro! The file copying job is a simple example. We can check easily and see that the copying is success. My main concern is that this message:
slurmstepd: error: Exceeded job memory limit at some point.

could appear in other circumstances. We want to convince our users that their jobs are finished okay, and they should trust the results. What can we do to archive that?
Comment 24 Alejandro Sanchez 2017-03-15 12:54:17 MDT
(In reply to NYU HPC Team from comment #23)
> Thank you Alejandro! The file copying job is a simple example. We can check
> easily and see that the copying is success. My main concern is that this
> message:
> slurmstepd: error: Exceeded job memory limit at some point.
> 
> could appear in other circumstances. We want to convince our users that
> their jobs are finished okay, and they should trust the results. What can we
> do to archive that?

They can use 

$ sacct -j <jobid> -o jobid,jobname,state,exitcode,derivedexitcode

as described in bug #3214, comment #4 (let me know if you can't see that bug) to decompose the states, exitcodes, derivedexitcodes of the job, batch step, and rest of the steps to know exactly if they should trust or not their results. Would that be sufficient to convince them?
Comment 25 NYU HPC Team 2017-03-15 13:11:21 MDT
Yes I can read bug #3214. I am still new to Slurm and learning. Is it possible to run input/output copying in SPANK plugin and/or prolog/epilog script under 'root' account, so that the memory used for file copying will not be charged to users' jobs? Thanks!
Comment 32 Alejandro Sanchez 2017-03-16 09:25:15 MDT
(In reply to NYU HPC Team from comment #25)
> Yes I can read bug #3214. I am still new to Slurm and learning. Is it
> possible to run input/output copying in SPANK plugin and/or prolog/epilog
> script under 'root' account, so that the memory used for file copying will
> not be charged to users' jobs? Thanks!

Memory consumed by prolog/epilog and/or SPANK plugins won't be charged to users job.
Comment 34 Alejandro Sanchez 2017-03-22 05:51:43 MDT
Is there anything else we can assist you with this bug?
Comment 35 NYU HPC Team 2017-03-22 07:43:53 MDT
No it is clear. Thank you!