Ticket 10434 - Job with --mem-per-cpu doesn't get started.
Summary: Job with --mem-per-cpu doesn't get started.
Status: RESOLVED DUPLICATE of ticket 9724
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.02.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-12-14 05:43 MST by Marcin Stolarek
Modified: 2021-01-08 08:40 MST (History)
1 user (show)

See Also:
Site: SciNet
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Marcin Stolarek 2020-12-14 05:43:34 MST
I have found an issue though. I have a user that has the largest priority on Niagara but his job doesn't run. It is stuck on the queue waiting for resources.
I simplified the script and tested on our TDS and I could reproduce the problem.
For example the following batch script:

>#!/bin/bash -l
>#SBATCH --ntasks=1
>#SBATCH --time=00-11:30
>#SBATCH --ntasks=1
>#SBATCH --nodes=1
>#SBATCH --cpus-per-task=1
>#SBATCH --mem-per-cpu=14000
>#SBATCH -J memory_per_cpu_test
>#SBATCH -A scinet
>#SBATCH --mail-type=ALL
>#SBATCH --output=slurm-%j.out
> 
>echo "#########################################"
>echo " SLURM submission batch script stdout "
>echo "#########################################"
> 
>source ~/scripts/sbatch_job_envs.sh 
> 
>env > env_${SLURM_JOB_ID}
> 
>srun -l hostname
>sleep 600

is very similar to the user's one. However it gets stuck on the queue:

>squeue
>   JOBID PARTITION       NAME                 USER              ACCOUNT ST TIME_LIMIT       TIME  TIME_LEFT           START_TIME   CPUS   PRIORITY  NODES NODELIST(REASON)
>    5008   compute memory_per              bmundim               scinet PD   11:30:00       0:00   11:30:00                  N/A      1      22739      1 (Resources)

Even thought TDS cluster is free of jobs. If I comment out the following
line:

>#SBATCH --mem-per-cpu=14000

the job runs normally. Do you know why? I will attach the slurm.conf and cgroup.conf in a minute.

Thanks,
Bruno.
Comment 4 Marshall Garey 2020-12-14 15:29:55 MST
Hi Bruno,

This looks like a duplicate of bug 9724, fixed by commit 49a7d7f9fb but only in 20.11. You can likely apply that commit in 20.02. Throw out the part in the NEWS file - it won't apply cleanly and you don't need that anyway. You can get the patchfile here:

https://github.com/SchedMD/slurm/commit/49a7d7f9fb.patch

Can you apply this patch and let us know if it fixes the problem?
Comment 5 Marshall Garey 2021-01-08 08:40:54 MST
I'm closing this as a duplicate of bug 9724. Let us know if you have any issues with the patch.

*** This ticket has been marked as a duplicate of ticket 9724 ***