Ticket 10434

Summary:	Job with --mem-per-cpu doesn't get started.
Product:	Slurm	Reporter:	Marcin Stolarek <cinek>
Component:	Scheduling	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	bmundim
Version:	20.02.6
Hardware:	Linux
OS:	Linux
Site:	SciNet	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Marcin Stolarek 2020-12-14 05:43:34 MST

I have found an issue though. I have a user that has the largest priority on Niagara but his job doesn't run. It is stuck on the queue waiting for resources.
I simplified the script and tested on our TDS and I could reproduce the problem.
For example the following batch script:

>#!/bin/bash -l
>#SBATCH --ntasks=1
>#SBATCH --time=00-11:30
>#SBATCH --ntasks=1
>#SBATCH --nodes=1
>#SBATCH --cpus-per-task=1
>#SBATCH --mem-per-cpu=14000
>#SBATCH -J memory_per_cpu_test
>#SBATCH -A scinet
>#SBATCH --mail-type=ALL
>#SBATCH --output=slurm-%j.out
> 
>echo "#########################################"
>echo " SLURM submission batch script stdout "
>echo "#########################################"
> 
>source ~/scripts/sbatch_job_envs.sh 
> 
>env > env_${SLURM_JOB_ID}
> 
>srun -l hostname
>sleep 600

is very similar to the user's one. However it gets stuck on the queue:

>squeue
>   JOBID PARTITION       NAME                 USER              ACCOUNT ST TIME_LIMIT       TIME  TIME_LEFT           START_TIME   CPUS   PRIORITY  NODES NODELIST(REASON)
>    5008   compute memory_per              bmundim               scinet PD   11:30:00       0:00   11:30:00                  N/A      1      22739      1 (Resources)

Even thought TDS cluster is free of jobs. If I comment out the following
line:

>#SBATCH --mem-per-cpu=14000

the job runs normally. Do you know why? I will attach the slurm.conf and cgroup.conf in a minute.

Thanks,
Bruno.

Comment 4 Marshall Garey 2020-12-14 15:29:55 MST

Hi Bruno,

This looks like a duplicate of bug 9724, fixed by commit 49a7d7f9fb but only in 20.11. You can likely apply that commit in 20.02. Throw out the part in the NEWS file - it won't apply cleanly and you don't need that anyway. You can get the patchfile here:

https://github.com/SchedMD/slurm/commit/49a7d7f9fb.patch

Can you apply this patch and let us know if it fixes the problem?

Comment 5 Marshall Garey 2021-01-08 08:40:54 MST

I'm closing this as a duplicate of bug 9724. Let us know if you have any issues with the patch.

*** This ticket has been marked as a duplicate of ticket 9724 ***