Created attachment 40309 [details] slurm.conf I searched around for a similar issue, and haven't been able to find it, but sorry if it's been discussed before. We have a small cluster (14 nodes) and are running into an oversubscribe issue that seems like it shouldn't be there. The partition I'm testing on has 256GB of Ram and 80 cores. It's configured this way - PartitionName="phyq" MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscribe=FORCE:4 PreemptMode=OFF MaxMemPerNode=240000 DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL Nodes=phygrid[01-04] Our Slurm.conf is set like this - SelectType=select/linear SelectTypeParameters=CR_Memory The job submitted is simply this - #!/bin/bash #SBATCH --job-name=test_oversubscription # Job name #SBATCH --output=test_oversubscription%j.out # Output file #SBATCH --error=test_oversubscription.err # Error file #SBATCH --mem=150G # Request 150 GB memory #SBATCH --ntasks=1 # Number of tasks #SBATCH --cpus-per-task=60 # CPUs per task #SBATCH --time=00:05:00 # Run for 5 minutes #SBATCH --partition=phyq # Replace with your partition name # Display allocated resources echo "Job running on node(s): $SLURM_NODELIST" echo "Requested CPUs: $SLURM_CPUS_ON_NODE" echo "Requested memory: $SLURM_MEM_PER_NODE MB" # Simulate workload sleep 300 In my mind this should submit to nodes 1, 2, 3, 4 and then when I submit a 5th job it should sit in Pending and when the first job ends it should go, but when I send the 5th job it goes to node 1. When a real job does this the performance goes way down because it's sharing resources even though they are requested. Am I missing something painfully obvious? Thanks for any help/advice. Steve Davis
Created attachment 40310 [details] oversubscribe test submit