21766 – Jobs oversubscribing when resources should be allocated

Ticket 21766 - Jobs oversubscribing when resources should be allocated

Summary: Jobs oversubscribing when resources should be allocated

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	23.02.7
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2025-01-08 07:55 MST by sdavis2
Modified:	2025-01-08 07:56 MST (History)
CC List:	0 users

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (4.34 KB, text/plain) 2025-01-08 07:55 MST, sdavis2	Details
oversubscribe test submit (633 bytes, text/x-sh) 2025-01-08 07:56 MST, sdavis2	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description sdavis2 2025-01-08 07:55:47 MST

Created attachment 40309 [details]
slurm.conf

I searched around for a similar issue, and haven't been able to find it, but sorry if it's been discussed before. 
We have a small cluster (14 nodes) and are running into an oversubscribe issue that seems like it shouldn't be there.
The partition I'm testing on has 256GB of Ram and 80 cores.
It's configured this way - 
PartitionName="phyq" MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscribe=FORCE:4 PreemptMode=OFF MaxMemPerNode=240000 DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL Nodes=phygrid[01-04]

Our Slurm.conf is set like this - 
SelectType=select/linear
SelectTypeParameters=CR_Memory

The job submitted is simply this - 
#!/bin/bash
#SBATCH --job-name=test_oversubscription    # Job name
#SBATCH --output=test_oversubscription%j.out # Output file
#SBATCH --error=test_oversubscription.err  # Error file
#SBATCH --mem=150G                         # Request 150 GB memory
#SBATCH --ntasks=1                         # Number of tasks
#SBATCH --cpus-per-task=60                  # CPUs per task
#SBATCH --time=00:05:00                    # Run for 5 minutes
#SBATCH --partition=phyq       # Replace with your partition name

# Display allocated resources
echo "Job running on node(s): $SLURM_NODELIST"
echo "Requested CPUs: $SLURM_CPUS_ON_NODE"
echo "Requested memory: $SLURM_MEM_PER_NODE MB"

# Simulate workload
sleep 300

In my mind this should submit to nodes 1, 2, 3, 4 and then when I submit a 5th job it should sit in Pending and when the first job ends it should go, but when I send the 5th job it goes to node 1. When a real job does this the performance goes way down because it's sharing resources even though they are requested. 

Am I missing something painfully obvious? 

Thanks for any help/advice.
Steve Davis

Comment 1 sdavis2 2025-01-08 07:56:18 MST

Created attachment 40310 [details]
oversubscribe test submit