Ticket 13407

Summary: RAM limitation in Slurm jobs
Product: Slurm Reporter: Matt Morgan <morga129>
Component: LimitsAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: nick, tripiana
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: Miami University Oxford Ohio Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmdlog
slurmconf

Description Matt Morgan 2022-02-10 13:45:46 MST
In a batch job I'm setting the memory and monitor the limit on the node with ulimit:

#!/bin/bash
# to be submitted by: sbatch slurm_job.txt
#SBATCH --time=1:00:00
#SBATCH --nodes=1 --ntasks-per-node=24
#SBATCH --job-name=hello
#SBATCH --partition=batch
#SBATCH --mem=60GB

cd $SLURM_SUBMIT_DIR
ulimit -a
module load anaconda-python3
python py_mem.py

The job's output correctly reports the requested limit, however the program crashes when it tries to allocate more than about 26GB of RAM. It's not specific to Python, other programs like Matlab crash in a similar fashion at the same barrier when trying to allocate memory. Interactive jobs with salloc show a similar behavior.

This is the program:
[muellej@mualhplp01:Slurm_transition2021] $ more py_mem.py
import numpy as np
import contextlib
#requires at least 32 GB
with contextlib.redirect_stdout(None):mya=np.random.rand(65000,65000)


[muellej@mualhplp01:Slurm_transition2021] $ more slurm-1141.out
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 380029
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) 62914560
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) 30000000
file locks                      (-x) unlimited
Please execute: source /software/python/anaconda3/etc/profile.d/conda.sh
Traceback (most recent call last):
  File "py_mem.py", line 4, in <module>
    with contextlib.redirect_stdout(None):mya=np.random.rand(65000,65000)
  File "mtrand.pyx", line 1154, in numpy.random.mtrand.RandomState.rand
  File "mtrand.pyx", line 420, in numpy.random.mtrand.RandomState.random_sample
  File "_common.pyx", line 256, in numpy.random._common.double_fill
MemoryError: Unable to allocate 31.5 GiB for an array with shape (65000, 65000) and data type float64
Comment 1 Carlos Tripiana Montes 2022-02-11 01:40:03 MST
Hi Matt,

Would you mind to attach your slurm.conf file and the slurmctld.conf plus slurmd.conf from the node running this job?

We need to set the right framework up to properly address your issue.

Additionally, take a look at the systems logs around the time of the job and see if there's any OOM killer going around.

Thanks,
Carlos.
Comment 2 Matt Morgan 2022-02-14 12:13:06 MST
Created attachment 23472 [details]
slurmdlog
Comment 3 Matt Morgan 2022-02-14 12:13:25 MST
Created attachment 23473 [details]
slurmconf
Comment 4 Matt Morgan 2022-02-14 12:15:52 MST
Hey Carlos,

Not seeing anything in the logs concerning OOM but I have attached the files requested, if you need more info on the jobs that were run I can provide that as well, thank you for your time and understanding

-Matt
Comment 5 Carlos Tripiana Montes 2022-02-15 01:13:15 MST
Hi Matt,

I think you need to check _why_ the _total_ amount of addressable space (AKA virtual memory), is lower than data size (AKA maximum memory size):

max memory size         (kbytes, -m) 62914560
virtual memory          (kbytes, -v) 30000000

If you look at virtual memory, 30000000KiB is around 28.61GiB. This is the _fuzzy value_ of "about 26GB of RAM" you stated in Comment 0.

As per slurm.conf provided, you are propagating ALL, which implies AS, DATA, and STACK [1][2][3]. DATA is lately restricted by "--mem=60GB" parameter. This is the memory space for dynamic allocation. AS implies all the usable memory addresses, including stack, heap, contextes, _everything_ [4]. If you subtract all this "overhead" from 28.61GiB, that's why you can't go higher than ~26GiB.

Again, because you are using PropagateResourceLimits=ALL without any "Except" [5], I think you are inheriting AS, DATA, and STACK from the login nodes _or similar_, where the users have a restricted amount of resources. I'd suggest you to set the virtual memory to unlimited if you already have the max memory size limited there. This way should suffice to prevent the users from eating memory and, because Slurm uses DefMemPerCPU, "--mem", --exclusive, etc. to set the amount of memory for a job, I think it should be enough.

Have a look at this and play a bit with the config. I think is is the problem you're facing.

Cheers,
Carlos.

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_AS
[2] https://slurm.schedmd.com/slurm.conf.html#OPT_DATA
[3] https://slurm.schedmd.com/slurm.conf.html#OPT_STACK
[4] https://pubs.opengroup.org/onlinepubs/9699919799/functions/setrlimit.html
[5] https://slurm.schedmd.com/slurm.conf.html#OPT_PropagateResourceLimitsExcept
Comment 6 Carlos Tripiana Montes 2022-02-17 02:14:12 MST
Hi Matt,

Even though this is a Sev-2, I'm going to close the issue as info given by now.

I think I've spotted your issue right, because I have no further urgent communications from your side after my reply.

If this is not the case, please feel free to reopen the bug.

Cheers,
Carlos.