16655 – Scheduler considers the gpu resource request to be invalid when it should not

Ticket 16655 - Scheduler considers the gpu resource request to be invalid when it should not

Summary: Scheduler considers the gpu resource request to be invalid when it should not

Status:	RESOLVED DUPLICATE of ticket 14153

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	22.05.2
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-05-04 20:41 MDT by Pascal
Modified:	2023-07-11 04:02 MDT (History)
CC List:	1 user (show)

See Also:	14153 17172
Site:	Pawsey
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (37.46 KB, text/plain) 2023-05-09 20:21 MDT, Sam Yates	Details
gres.conf (11.32 KB, text/plain) 2023-05-09 20:22 MDT, Sam Yates	Details
cli_filter.lua (4.54 KB, text/plain) 2023-05-16 20:06 MDT, Sam Yates	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Pascal 2023-05-04 20:41:30 MDT

Hello,

We are running into an issue with the resource request for CPU resources when also requesting GPU resources. Specifically with the use of --cpus-per-gpu

For context, we are running a 8 GPU node with shared access. The configuration is 
(not showing nodes for brevity). 
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=17152 TotalNodes=134 SelectTypeParameters=CR_SOCKET_MEMORY
   JobDefaults=(null)
   DefMemPerCPU=1840 MaxMemPerCPU=3680
   TRES=cpu=17152,mem=32830000M,node=134,billing=137216,gres/gpu=1072
   TRESBillingWeights=CPU=2,Mem=0.24G,gres/GPU=128

So we allocate memory based on the number of cores requested (whether through tasks or slurm 'cpus'). The desired behaviour for non-exclusive gpu partition allocations is that we have 8 cores allocated for each gpu allocated on the node. Testing shows that we do get a minimum of 8 cores in an allocation, but depending on --cpus-per-task or --cpus-per-gpu options in the request, we otherwise get only 8 cores, or else all 64 cores.

Ultimately, we think the optimal desired behaviour will be achieved through the use of a Slurm CLI filter so that number of cores and allocated memory be solely dependent upon the number of allocated GPUs on the node, possibly through enforcing --cpus-per-gpu and --mem-per-gpu settings. 

The test runs below use the following script to report allocated cores and GPUs; each allocation is for a single node (and a single task), --cpu-bind=none and --gpu-bind=none are used in the srun invocation to disable any binding to a subset of the allocated resources.

#!/usr/bin/env bash
mask=$(hwloc-bind --get)
echo "cores $(hwloc-calc -I core $mask)"
echo "NUMA domains $(hwloc-calc -I NUMAnode $mask)"
/opt/rocm/bin/rocm-smi --showbus | grep '^GPU'
We see the same behaviour on both the gpu and gpu-dev partitions. Except where relevant, salloc and srun output has been omitted.

## Tests without specifying -c/--cpus-per-task or --cpus-per-gpu

Here we get eight cores no matter what. A zero GPU allocation succeeds (it'd be best if it did not) and is granted eight cores.

+ salloc -n 1 -N 1 -p gpu-dev -A pawsey0001-gpu --gpus=0 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none ./report
cores 0,1,2,3,4,5,6,7
NUMA domains 0
WARNING: No AMD GPUs specified

+ salloc -n 1 -N 1 -p gpu-dev -A pawsey0001-gpu --gpus=1 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none ./report
cores 0,1,2,3,4,5,6,7
NUMA domains 0
GPU[0]          : PCI Bus: 0000:D1:00.0

+ salloc -n 1 -N 1 -p gpu-dev -A pawsey0001-gpu --gpus=2 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none ./report
cores 0,1,2,3,4,5,6,7
NUMA domains 0
GPU[0]          : PCI Bus: 0000:D1:00.0
GPU[1]          : PCI Bus: 0000:D6:00.0

+ salloc -n 1 -N 1 -p gpu-dev -A pawsey0001-gpu --gpus=4 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none ./report
cores 0,1,2,3,4,5,6,7
NUMA domains 0
GPU[0]          : PCI Bus: 0000:C9:00.0
GPU[1]          : PCI Bus: 0000:CE:00.0
GPU[2]          : PCI Bus: 0000:D1:00.0
GPU[3]          : PCI Bus: 0000:D6:00.0

## Tests with --cpus-per-gpu=8

This gives the desired result for one GPU (that is --ntasks=1 --gpus-per-task=1 --cpus-per-task=8 is fine) but is otherwise weird:

A request for 2 or 6 GPUs fails with an unavailable node configuration error.
A request for 3 or 4 GPUs allocates 33 cores comprising the first two NUMA domains (4 CCXs) and the first core of the next.
+ salloc -n 1 -N 1 -p gpu-dev -A pawsey0001-gpu --cpus-per-gpu=8 --gpus=1 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none ./report
cores 0,1,2,3,4,5,6,7
NUMA domains 0
GPU[0]          : PCI Bus: 0000:D1:00.0

+ salloc -n 1 -N 1 -p gpu-dev -A pawsey0001-gpu --cpus-per-gpu=8 --gpus=2 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none ./report
salloc: error: Job submit/allocate failed: Requested node configuration is not available

+ salloc -n 1 -N 1 -p gpu-dev -A pawsey0001-gpu --cpus-per-gpu=8 --gpus=3 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none ./report
cores 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
NUMA domains 0,2,3
GPU[0]		: PCI Bus: 0000:D1:00.0
GPU[1]		: PCI Bus: 0000:D9:00.0
GPU[2]		: PCI Bus: 0000:DE:00.0

+ salloc -n 1 -N 1 -p gpu-dev -A pawsey0001-gpu --cpus-per-gpu=8 --gpus=4 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none ./report
cores 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
NUMA domains 0,2,3
GPU[0]          : PCI Bus: 0000:C1:00.0
GPU[1]          : PCI Bus: 0000:D1:00.0
GPU[2]          : PCI Bus: 0000:D9:00.0
GPU[3]          : PCI Bus: 0000:DE:00.0

+ salloc -n 1 -N 1 -p gpu-dev -A pawsey0001-gpu --cpus-per-gpu=8 --gpus=6 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none ./report
salloc: error: Job submit/allocate failed: Requested node configuration is not available

The key issue here is failed resource requests which should definitely be satisfied. 

## Tests with --cpus-per-task
Here the tests are run with an explicit request for eight times as many cores as GPUs. A request for 1 GPU and eight cores grants eight cores; requests for 2 GPUs and 16 cores or 4 GPUs and 32 cores grant all 64 cores.

+ salloc -n 1 -N 1 -p gpu-dev -A pawsey0001-gpu --cpus-per-task=8 --gpus=1 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none ./report
cores 0,1,2,3,4,5,6,7
NUMA domains 0
GPU[0]          : PCI Bus: 0000:D1:00.0

+ salloc -n 1 -N 1 -p gpu-dev -A pawsey0001-gpu --cpus-per-task=16 --gpus=2 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none ./report
cores 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,↳53,54,55,56,57,58,59,60,61,62,63
NUMA domains 0,1,2,3
GPU[0]          : PCI Bus: 0000:D1:00.0
GPU[1]          : PCI Bus: 0000:D6:00.0

+ salloc -n 1 -N 1 -p gpu-dev -A pawsey0001-gpu --cpus-per-task=32 --gpus=4 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none ./report
cores 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,↳53,54,55,56,57,58,59,60,61,62,63
NUMA domains 0,1,2,3
GPU[0]          : PCI Bus: 0000:C9:00.0
GPU[1]          : PCI Bus: 0000:CE:00.0
GPU[2]          : PCI Bus: 0000:D1:00.0
GPU[3]          : PCI Bus: 0000:D6:00.0


It is not obvious why slurm behaves so oddly with --cpus-per-gpu and why with --cpus-per-task sometimes we are getting more cpus than requested.

Comment 1 Ben Roberts 2023-05-05 10:19:29 MDT

Hi Pascal,

This looks like it might be related to a similar issue we've seen.  The ticket shows that you are using 22.05.2.  Can I have you confirm that this is correct?  I'd also like to have you send a current copy of your slurm.conf and gres.conf files, if possible.  

Thanks,
Ben

Comment 4 Sam Yates 2023-05-09 20:21:43 MDT

Created attachment 30194 [details]
slurm.conf

Comment 5 Sam Yates 2023-05-09 20:22:20 MDT

Created attachment 30195 [details]
gres.conf

Comment 6 Sam Yates 2023-05-09 20:23:05 MDT

(In reply to Ben Roberts from comment #1)
> This looks like it might be related to a similar issue we've seen.  The
> ticket shows that you are using 22.05.2.  Can I have you confirm that this
> is correct?  I'd also like to have you send a current copy of your
> slurm.conf and gres.conf files, if possible.  

I've attached our slurm.conf and gres.conf, and can confirm that we're running 22.05.2.

Cheers,
Sam

Comment 7 Marcin Stolarek 2023-05-12 03:42:03 MDT

Could you please share the cli_filter.lua script? Maybe I'm missing some detail, but I can't reproduce the issue with 22.05.2 and similar config.
For instance for the case of 2 GPUs I get:

>[root@slurmctl slurm-sources]# salloc -n 1 -N 1 -p gpu-dev  --cpus-per-gpu=8 --gpus=2 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none scontrol show job -d | grep Nodes=
>salloc: Granted job allocation 10969
>   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>     Nodes=test01 CPU_IDs=0-15 Mem=29440 GRES=gpu:2(IDX:4-5)
>salloc: Relinquishing job allocation 10969
>salloc: Job allocation 10969 has been revoked.
>[root@slurmctl slurm-sources]# salloc -V
>slurm 22.05.2


cheers,
Marcin

Comment 8 Sam Yates 2023-05-16 20:06:46 MDT

Created attachment 30329 [details]
cli_filter.lua

Comment 9 Marcin Stolarek 2023-05-18 05:46:51 MDT

Pascal
Sam,

I was able to reproduce the issue on Slurm 22.05.2 and confirmed that it's already fixed in Slurm 22.05.7 (by commit 8e1231028dc).

Could you please verify if the issue is resolved by Slurm update?

cheers,
Marcin

Comment 10 Marcin Stolarek 2023-05-29 06:14:12 MDT

Any update from your side? In case of no reply I'll go ahead and close the bug as a duplicate of bug 14153.

cheers,
Marcin

Comment 11 Pascal 2023-05-29 19:28:38 MDT

Hi Marcin,
So we haven't tested a newer version of slurm, primarily because Setonix currently has a older version "blessed" by Cray. We are looking into installing newer versions decoupled from the Cray "blessed" version but that is in the future. 

Feel free to close the ticket.
Cheers,
Pascal

Comment 12 Marcin Stolarek 2023-05-30 00:09:33 MDT

I'm marking the ticket as a duplicate of Bug 14153.

*** This ticket has been marked as a duplicate of ticket 14153 ***