14975 – SLURM_TASKS_PER_NODE inconsistent with CR_ONE_TASK_PER_CORE and DefCpuPerGPU

Ticket 14975 - SLURM_TASKS_PER_NODE inconsistent with CR_ONE_TASK_PER_CORE and DefCpuPerGPU

Summary: SLURM_TASKS_PER_NODE inconsistent with CR_ONE_TASK_PER_CORE and DefCpuPerGPU

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	22.05.3
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-09-15 12:48 MDT by Stefan
Modified:	2022-09-27 11:01 MDT (History)
CC List:	0 users

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Stefan 2022-09-15 12:48:58 MDT

I'm currently testing Slurm 22.05.03 and see some inconsistent behavior. The node in this case has 32 physical cores with hyperthreading, so 64 logical CPUs. In the slurmd.conf we have basically these settings:

GresTypes=gpu
TaskPlugin=task/cgroup,task/affinity
DefMemPerCPU=512
DefMemPerGPU=16384
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE,CR_CORE_DEFAULT_DIST_BLOCK

NodeName=gpu[50] Feature="gpu,avx2,naples,ampere"         CPUs=64 CoresPerSocket=8  ThreadsPerCore=2 RealMemory=126976 Gres=gpu:a5000:4

PartitionName=gpu3 Nodes=gpu[50]                       DefCpuPerGPU=16 OverSubscribe=NO


The idea is to slice the machine into "GPU slots", so if I just do
 $ salloc -p gpu3 -G 1
I will get one GPU and a quarter of the CPU resources. Since CR_ONE_TASK_PER_CORE is set the maximum number of tasks should be 32, and if I request more I get an error message (as it should be):
 $ salloc -p gpu3 -N 1 -n 64
salloc: error: Job submit/allocate failed: Requested node configuration is not available

However if I request all 4 GPUs and then check the TASKS_PER_NODE I see 64:
 $ salloc -p gpu3 -N 1 -G 4
 $ $ env | grep TASKS
 SLURM_TASKS_PER_NODE=64

Also possibly related, if I increase the DefMemPerCPU setting to 1024 I cannot allocate 4 GPUs:
 $ salloc -p gpu33 -N 1 -G 4
 salloc: error: Job submit/allocate failed: Requested node configuration is not available
 $ salloc -p gpu33 -N 1 -G 4 --mem=120G
 alloc: Granted job allocation 9

I've tested with lower values and the breaking point is at 992, i.e. RealMemory/64/2

Comment 1 Stefan 2022-09-15 16:38:22 MDT

Sorry, some typos got in there in the latter part of the text when I was editing. While I'm reposting that part I might as well add some more info.

Also possibly related, if I increase the DefMemPerCPU setting to 1024 I cannot allocate 4 GPUs without --mem flag:
 $ salloc -p gpu3 -N 1 -G 4
 salloc: error: Job submit/allocate failed: Requested node configuration is not available
 $ salloc -p gpu3 -N 1 -G 4 --mem=120G
 salloc: Granted job allocation 9

Note that the first command should request 64G, which fails. Yet I can request 120G explicitly. I've tested with lower values and the breaking point is at 992, i.e. RealMemory/64/2. So seems like the first command is actually checking for 128G instead of 64G. Yet the final allocation has the correct amount, here with DefMemPerCPU=992:

$ salloc -p gpu3 -N 1 -G 4
salloc: Granted job allocation 10
$ scontrol show jobid=10
   NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=64G,node=1,billing=64
   CpusPerTres=gpu:16
   MemPerTres=gpu:16384
   TresPerJob=gres:gpu:4
   [...]

The same behavior if I remove DefMemPerGPU from the config file and only have DefMemPerCPU, so does not seem to be related to that. Only difference is that now the allocated memory clearly comes from the DefMemPerCPU setting, since I get 62G instead of 64G above.

   NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=62G,node=1,billing=64
   CpusPerTres=gpu:16
   TresPerJob=gres:gpu:4

But again with DefMemPerCPU=1024:
$ salloc -v -p gpu3 -N 1 -G 4
salloc: defined options
salloc: -------------------- --------------------
salloc: gpus                : 4
salloc: nodes               : 1
salloc: partition           : gpu3
salloc: verbose             : 1
salloc: -------------------- --------------------
salloc: end of defined options
salloc: select/cray_aries: init: Cray/Aries node selection plugin loaded
salloc: select/linear: init: Linear node selection plugin loaded with argument 4372
salloc: select/cons_res: common_init: select/cons_res loaded
salloc: select/cons_tres: common_init: select/cons_tres loaded
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 12 has been revoked.

Comment 2 Stefan 2022-09-15 16:47:35 MDT

$ scontrol show jobid=12
JobId=12 JobName=interactive
   JobState=FAILED Reason=BadConstraints Dependency=(null)
Scheduler=Main
   Partition=gpu3 AllocNode:Sid=login2:173281
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=N/A CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=1G,node=1,billing=1
   CpusPerTres=gpu:16
   TresPerJob=gres:gpu:4
   [...]

Comment 3 Stefan 2022-09-16 02:02:48 MDT

I've looked through the closed bug reports and found https://bugs.schedmd.com/show_bug.cgi?id=14223

So I think the idea from there to manually set the number of CPUs to the number of physical cores takes care of the inconsistencies I'm seeing here. I.e. changing these lines in my slurm.conf to

DefMemPerCPU=1024

NodeName=gpu[50] Feature="gpu,avx2,naples,ampere"         CPUs=32 Sockets=4 CoresPerSocket=8  ThreadsPerCore=2 RealMemory=126976 Gres=gpu:a5000:4

PartitionName=gpu3 Nodes=gpu[50]                       DefCpuPerGPU=8 OverSubscribe=NO

I suppose configuring resources as physical cores is an explicit way of forcing CR_ONE_TASK_PER_CORE?

Comment 4 Stefan 2022-09-26 06:32:34 MDT

As much as I enjoy my own company, I did file this bug with the hope to get some feedback ;-)

Comment 5 Jacob Jenson 2022-09-26 10:15:04 MDT

Stefan, 

Jess Arrington <jess@schedmd.com> was supposed to have contacted you about this last week. Did you get a direct email from Jess about this? 

Jacob

Comment 6 Stefan 2022-09-27 11:01:19 MDT

No I didn't receive any email. But I do get and delete spam on that account, so I could have accidentally deleted it. Could you ask Jess to resend it, please? I'll add the email address to my whitelist.