Ticket 10434

Summary: Job with --mem-per-cpu doesn't get started.
Product: Slurm Reporter: Marcin Stolarek <cinek>
Component: SchedulingAssignee: Marshall Garey <marshall>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bmundim
Version: 20.02.6   
Hardware: Linux   
OS: Linux   
Site: SciNet Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Marcin Stolarek 2020-12-14 05:43:34 MST
I have found an issue though. I have a user that has the largest priority on Niagara but his job doesn't run. It is stuck on the queue waiting for resources.
I simplified the script and tested on our TDS and I could reproduce the problem.
For example the following batch script:

>#!/bin/bash -l
>#SBATCH --ntasks=1
>#SBATCH --time=00-11:30
>#SBATCH --ntasks=1
>#SBATCH --nodes=1
>#SBATCH --cpus-per-task=1
>#SBATCH --mem-per-cpu=14000
>#SBATCH -J memory_per_cpu_test
>#SBATCH -A scinet
>#SBATCH --mail-type=ALL
>#SBATCH --output=slurm-%j.out
> 
>echo "#########################################"
>echo " SLURM submission batch script stdout "
>echo "#########################################"
> 
>source ~/scripts/sbatch_job_envs.sh 
> 
>env > env_${SLURM_JOB_ID}
> 
>srun -l hostname
>sleep 600

is very similar to the user's one. However it gets stuck on the queue:

>squeue
>   JOBID PARTITION       NAME                 USER              ACCOUNT ST TIME_LIMIT       TIME  TIME_LEFT           START_TIME   CPUS   PRIORITY  NODES NODELIST(REASON)
>    5008   compute memory_per              bmundim               scinet PD   11:30:00       0:00   11:30:00                  N/A      1      22739      1 (Resources)

Even thought TDS cluster is free of jobs. If I comment out the following
line:

>#SBATCH --mem-per-cpu=14000

the job runs normally. Do you know why? I will attach the slurm.conf and cgroup.conf in a minute.

Thanks,
Bruno.
Comment 4 Marshall Garey 2020-12-14 15:29:55 MST
Hi Bruno,

This looks like a duplicate of bug 9724, fixed by commit 49a7d7f9fb but only in 20.11. You can likely apply that commit in 20.02. Throw out the part in the NEWS file - it won't apply cleanly and you don't need that anyway. You can get the patchfile here:

https://github.com/SchedMD/slurm/commit/49a7d7f9fb.patch

Can you apply this patch and let us know if it fixes the problem?
Comment 5 Marshall Garey 2021-01-08 08:40:54 MST
I'm closing this as a duplicate of bug 9724. Let us know if you have any issues with the patch.

*** This ticket has been marked as a duplicate of ticket 9724 ***