10434 – Job with --mem-per-cpu doesn't get started.

Ticket 10434 - Job with --mem-per-cpu doesn't get started.

Summary: Job with --mem-per-cpu doesn't get started.

Status:	RESOLVED DUPLICATE of ticket 9724

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.02.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-12-14 05:43 MST by Marcin Stolarek
Modified:	2021-01-08 08:40 MST (History)
CC List:	1 user (show)

See Also:
Site:	SciNet
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Marcin Stolarek 2020-12-14 05:43:34 MST

I have found an issue though. I have a user that has the largest priority on Niagara but his job doesn't run. It is stuck on the queue waiting for resources.
I simplified the script and tested on our TDS and I could reproduce the problem.
For example the following batch script:

>#!/bin/bash -l
>#SBATCH --ntasks=1
>#SBATCH --time=00-11:30
>#SBATCH --ntasks=1
>#SBATCH --nodes=1
>#SBATCH --cpus-per-task=1
>#SBATCH --mem-per-cpu=14000
>#SBATCH -J memory_per_cpu_test
>#SBATCH -A scinet
>#SBATCH --mail-type=ALL
>#SBATCH --output=slurm-%j.out
> 
>echo "#########################################"
>echo " SLURM submission batch script stdout "
>echo "#########################################"
> 
>source ~/scripts/sbatch_job_envs.sh 
> 
>env > env_${SLURM_JOB_ID}
> 
>srun -l hostname
>sleep 600

is very similar to the user's one. However it gets stuck on the queue:

>squeue
>   JOBID PARTITION       NAME                 USER              ACCOUNT ST TIME_LIMIT       TIME  TIME_LEFT           START_TIME   CPUS   PRIORITY  NODES NODELIST(REASON)
>    5008   compute memory_per              bmundim               scinet PD   11:30:00       0:00   11:30:00                  N/A      1      22739      1 (Resources)

Even thought TDS cluster is free of jobs. If I comment out the following
line:

>#SBATCH --mem-per-cpu=14000

the job runs normally. Do you know why? I will attach the slurm.conf and cgroup.conf in a minute.

Thanks,
Bruno.

Comment 4 Marshall Garey 2020-12-14 15:29:55 MST

Hi Bruno,

This looks like a duplicate of bug 9724, fixed by commit 49a7d7f9fb but only in 20.11. You can likely apply that commit in 20.02. Throw out the part in the NEWS file - it won't apply cleanly and you don't need that anyway. You can get the patchfile here:

https://github.com/SchedMD/slurm/commit/49a7d7f9fb.patch

Can you apply this patch and let us know if it fixes the problem?

Comment 5 Marshall Garey 2021-01-08 08:40:54 MST

I'm closing this as a duplicate of bug 9724. Let us know if you have any issues with the patch.

*** This ticket has been marked as a duplicate of ticket 9724 ***