Ticket 13059

Summary:	Job submission over overlapping partititons
Product:	Slurm	Reporter:	Paolo Oliveri <paul>
Component:	Scheduling	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED INVALID	QA Contact:
Severity:	6 - No support contract
Priority:	---
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Slurm configuration for our cluster

Description Paolo Oliveri 2021-12-17 09:57:57 MST

Created attachment 22727 [details]
Slurm configuration for our cluster

Hello,
We have a Cluster with 12 nodes + 1 master node with CentOS 7.7 and Slurm 18.08.8.

I know that I should upgrade to a new version that is supported, but we can't do it now and anyway before upgrading we would like to know if there are other reasons apart the Slurm version for our bug.

We configured different partitions and QOS in the system, and some partition share nodes with the other.
Every node has 144 threads, with 4 reserved for system and then 140 available for executing jobs.

The preemption method is suspend and gangling jobs, together with QOS prioritization.

The problem here is that we are experiencing node overloads: if a node is running 140 jobs from one partition, other partition can submit new jobs regardless Slurm should not accept other jobs to it.

For example, the node hpc-01 is fully charged by jobs from "normal" partition.

scontrol show node hpc-01

NodeName=hpc-01 Arch=x86_64 CoresPerSocket=18
   CPUAlloc=140 CPUTot=144 CPULoad=117.54
   AvailableFeatures=infiniband
   ActiveFeatures=infiniband
   Gres=(null)
   NodeAddr=hpc-01 NodeHostName=hpc-01 Version=18.08
   OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 
   RealMemory=515633 AllocMem=430080 FreeMem=484762 Sockets=4 Boards=1
   CoreSpecCount=2 CPUSpecList=106-107,142-143 MemSpecLimit=6144
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=20 Owner=N/A MCS_label=N/A
   Partitions=interactive,normal,fast,big 
   BootTime=2021-07-30T18:27:35 SlurmdStartTime=2021-07-30T18:29:19
   CfgTRES=cpu=144,mem=515633M,billing=144
   AllocTRES=cpu=140,mem=420G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Try to submit a new job from "normal" partition to hpc-01

salloc -n 1 --partition normal --qos normal --nodelist hpc-01 --no-shell

salloc: Pending job allocation 1469203
salloc: job 1469203 queued and waiting for resources

Try to submit a new job from another partition, "big", that has nodes that overlaps with "normal" partition, such as hpc-01

salloc -n 1 --partition big --qos low --nodelist hpc-01 --no-shell

salloc: Granted job allocation 1469204
salloc: Waiting for resource configuration
salloc: Nodes hpc-01 are ready for job

Do any of you know if there are some problems in Slurm configuration? I attached it for inspection.

Sincerely,
Paolo Oliveri