Ticket 3460

Summary:	Slurm kills processes but not killing the job when memory requested is not enough for the job
Product:	Slurm	Reporter:	NYU HPC Team <hpc-staff>
Component:	Other	Assignee:	Oriol Vilarrubi <jvilarru>
Status:	RESOLVED FIXED	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	jvilarru
Version:	16.05.4
Hardware:	Linux
OS:	Linux
Site:	NYU	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	24.11.0
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf, cgroup.conf, a.out and the .C source code

Description NYU HPC Team 2017-02-10 12:11:34 MST

Created attachment 4040 [details]
slurm.conf, cgroup.conf, a.out and the .C source code

Hello Slurm experts:

We want to run 6 a.out processes in a job, each allocating 3 GB memory:
- when total memory requested is 20 GB, all 6 processes run happily; 
- when total memory requested is 17 GB, one process is killed and 5 processes keep running, and the job is not killed.

We expect that the job should be killed. We would like to learn if we miss some configurations or some command line options should be turned on.  

There are four files in the attachment:
slurm.conf, cgroup.conf, a.out, and the .C source code for a.out

Below is the job submission script and how we submit job.  

------------------------------------
$ cat run-memst.sh
#!/bin/bash
#
#SBATCH --nodes=1 --ntasks=1
#SBATCH --cpus-per-task=6
##SBATCH --mem=20GB
#SBATCH --mem=17GB
#SBATCH --job-name=mymtest

srun /bin/bash -c "
for((i=1; i<=6; i++)); do
    echo "i="\$i
    ./a.out 3 > \$i.log 2>&1 &
done
wait
"

$ sbatch run-memst.sh
Submitted batch job 32126
-------------------------------------

Thank you very much,
Wensheng

Comment 1 Tim Wickberg 2017-02-10 15:57:28 MST

Hey guys -

I'm marking this down to Sev3, as I don't think it's leading to sporadic outages on your system. Please see https://www.schedmd.com/support.php for an overview of the respective levels.

Unfortunately, this is a limitation of the current task/cgroup plugin. OOM is being enforced automatically by the Linux kernel when a job attempts to malloc() more memory than is available - that specific process is immediately killed.

Right now - you'd need to modify the job script to detect the abnormal termination of an individual process within the job, and then have the job script terminate early.

I do think this is a reasonable enhancement to request, although is not something we can address immediately. I could see a lot of utility in a configuration option allowing sites to immediate terminate the entire job if it hits OOM on the assumption that the user will not be monitoring and handling that appropriately internally. If you'd like, I will remark this bug to reflect that.

- Tim

Comment 2 Tim Wickberg 2017-03-07 18:49:20 MST

Reclassifying as Sev5.

It'd be nice to have a config option that takes the new JOB_OOM state in 17.02 and then cancels the entire job out (rather than just a single step).

Comment 3 Oriol Vilarrubi 2024-11-12 05:13:08 MST

In 24.11 we will be shipping a set of commits that include the new option --oom-step-kill, which when set terminates the whole step in case that one of the tasks of it has an OOM, you can read about that in the man page for srun, you can also set this globally for all jobs with the TaskPluginParam OOMKillStep, you can read more information in the man of slurm.conf

I am closing this one as fixed.