Ticket 3460

Summary: Slurm kills processes but not killing the job when memory requested is not enough for the job
Product: Slurm Reporter: NYU HPC Team <hpc-staff>
Component: OtherAssignee: Unassigned Developer <dev-unassigned>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: ---    
Version: 16.05.4   
Hardware: Linux   
OS: Linux   
Site: NYU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf, cgroup.conf, a.out and the .C source code

Description NYU HPC Team 2017-02-10 12:11:34 MST
Created attachment 4040 [details]
slurm.conf, cgroup.conf, a.out and the .C source code

Hello Slurm experts:

We want to run 6 a.out processes in a job, each allocating 3 GB memory:
- when total memory requested is 20 GB, all 6 processes run happily; 
- when total memory requested is 17 GB, one process is killed and 5 processes keep running, and the job is not killed.

We expect that the job should be killed. We would like to learn if we miss some configurations or some command line options should be turned on.  

There are four files in the attachment:
slurm.conf, cgroup.conf, a.out, and the .C source code for a.out

Below is the job submission script and how we submit job.  

------------------------------------
$ cat run-memst.sh
#!/bin/bash
#
#SBATCH --nodes=1 --ntasks=1
#SBATCH --cpus-per-task=6
##SBATCH --mem=20GB
#SBATCH --mem=17GB
#SBATCH --job-name=mymtest

srun /bin/bash -c "
for((i=1; i<=6; i++)); do
    echo "i="\$i
    ./a.out 3 > \$i.log 2>&1 &
done
wait
"

$ sbatch run-memst.sh
Submitted batch job 32126
-------------------------------------

Thank you very much,
Wensheng
Comment 1 Tim Wickberg 2017-02-10 15:57:28 MST
Hey guys -

I'm marking this down to Sev3, as I don't think it's leading to sporadic outages on your system. Please see https://www.schedmd.com/support.php for an overview of the respective levels.

Unfortunately, this is a limitation of the current task/cgroup plugin. OOM is being enforced automatically by the Linux kernel when a job attempts to malloc() more memory than is available - that specific process is immediately killed.

Right now - you'd need to modify the job script to detect the abnormal termination of an individual process within the job, and then have the job script terminate early.

I do think this is a reasonable enhancement to request, although is not something we can address immediately. I could see a lot of utility in a configuration option allowing sites to immediate terminate the entire job if it hits OOM on the assumption that the user will not be monitoring and handling that appropriately internally. If you'd like, I will remark this bug to reflect that.

- Tim
Comment 2 Tim Wickberg 2017-03-07 18:49:20 MST
Reclassifying as Sev5.

It'd be nice to have a config option that takes the new JOB_OOM state in 17.02 and then cancels the entire job out (rather than just a single step).