Ticket 15522

Summary:	Only half the requested CPU cores are available when asking for a single GPU (GRES) resource
Product:	Slurm	Reporter:	hpc-ops
Component:	slurmstepd	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	marshall
Version:	22.05.5
Hardware:	Linux
OS:	Linux
Site:	Ghent	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm config file

Description hpc-ops 2022-11-30 09:04:17 MST

Created attachment 27960 [details]
slurm config file

Hi,


When submitting a job to a GPU cluster, we're not quite understanding why the step_0 cgroup only gets 16 cores instead of the expected 32.


This is the job submission command:

/usr/bin/salloc --reservation=maintenance2022Q4 --cpus-per-gpu=32 --gres=gpu:1 --job-name=INTERACTIVE --mail-type=NONE --nodes=1 --ntasks-per-node=32 --ntasks=32 --time=3-00:00:00 /usr/bin/srun --chdir=/user/gent/400/vsc40003 --cpu-bind=none --export=USER,HOME,TERM --mem=0 --mpi=none --nodes=1 --ntasks=1 --pty /bin/bash -i -l

salloc: Granted job allocation 40271421
salloc: Waiting for resource configuration
salloc: Nodes node3303.joltik.os are ready for job


Which then yields:

[vsc40003@node3303 ~]$ nproc
16

Looking at the job's info I see:

TRES=cpu=32,mem=262080M,node=1,billing=33,gres/gpu=1


Looking at the cgroups, I see:

[root@node3303 job_40271421]# cat cpuset.cpus
0-31


Idem for step_extern

But

[root@node3303 step_0]# cat cpuset.cpus
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30


Somehow this job step only gets the even cores. Is this something expected, do we need to configure something differently?

When not asking for any GPUs, we do see that 32 cores are assigned to this job step.

Our cgroup config is:


AllowedSwapSpace=0
CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes


-- Andy

Comment 2 Marcin Stolarek 2022-12-01 06:33:07 MST

Andy,

I can't easily reproduce the behavior. Could you please attach the output of `lstopo-no-graphics` and your gres.conf?

cheers,
Marcin

Comment 3 Marcin Stolarek 2022-12-08 07:03:57 MST

Could you please take a look at last comment.

cheers,
Marcin

Comment 4 hpc-ops 2022-12-13 07:15:37 MST

Hi,


We're in the process of changing the config, and will see if this gets fixed.


-- Andy

Comment 5 Marcin Stolarek 2022-12-20 09:40:16 MST

Any update from your side?

Comment 6 hpc-ops 2022-12-20 09:47:00 MST

Hi,

We're trying the upstream/slurm-22.05 branch to see if this works out better, but so far, no luck afaik.

-- Andy

Comment 7 Marcin Stolarek 2022-12-21 02:26:27 MST

Is this ticket effectively a duplicate of Bug 15614?

Comment 8 Marcin Stolarek 2023-01-12 02:03:39 MST

Is there anything else I can help you with in the bug report?

Comment 9 hpc-ops 2023-01-13 04:04:29 MST

this is ok now. you can close this ticket