Ticket 15522

Summary: Only half the requested CPU cores are available when asking for a single GPU (GRES) resource
Product: Slurm Reporter: hpc-admin
Component: slurmstepdAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: marshall
Version: 22.05.5   
Hardware: Linux   
OS: Linux   
Site: Ghent Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm config file

Description hpc-admin 2022-11-30 09:04:17 MST
Created attachment 27960 [details]
slurm config file

Hi,


When submitting a job to a GPU cluster, we're not quite understanding why the step_0 cgroup only gets 16 cores instead of the expected 32.


This is the job submission command:

/usr/bin/salloc --reservation=maintenance2022Q4 --cpus-per-gpu=32 --gres=gpu:1 --job-name=INTERACTIVE --mail-type=NONE --nodes=1 --ntasks-per-node=32 --ntasks=32 --time=3-00:00:00 /usr/bin/srun --chdir=/user/gent/400/vsc40003 --cpu-bind=none --export=USER,HOME,TERM --mem=0 --mpi=none --nodes=1 --ntasks=1 --pty /bin/bash -i -l

salloc: Granted job allocation 40271421
salloc: Waiting for resource configuration
salloc: Nodes node3303.joltik.os are ready for job


Which then yields:

[vsc40003@node3303 ~]$ nproc
16

Looking at the job's info I see:

TRES=cpu=32,mem=262080M,node=1,billing=33,gres/gpu=1


Looking at the cgroups, I see:

[root@node3303 job_40271421]# cat cpuset.cpus
0-31


Idem for step_extern

But

[root@node3303 step_0]# cat cpuset.cpus
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30


Somehow this job step only gets the even cores. Is this something expected, do we need to configure something differently?

When not asking for any GPUs, we do see that 32 cores are assigned to this job step.

Our cgroup config is:


AllowedSwapSpace=0
CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes


-- Andy
Comment 2 Marcin Stolarek 2022-12-01 06:33:07 MST
Andy,

I can't easily reproduce the behavior. Could you please attach the output of `lstopo-no-graphics` and your gres.conf?

cheers,
Marcin
Comment 3 Marcin Stolarek 2022-12-08 07:03:57 MST
Could you please take a look at last comment.

cheers,
Marcin
Comment 4 hpc-admin 2022-12-13 07:15:37 MST
Hi,


We're in the process of changing the config, and will see if this gets fixed.


-- Andy
Comment 5 Marcin Stolarek 2022-12-20 09:40:16 MST
Any update from your side?
Comment 6 hpc-admin 2022-12-20 09:47:00 MST
Hi,

We're trying the upstream/slurm-22.05 branch to see if this works out better, but so far, no luck afaik.

-- Andy
Comment 7 Marcin Stolarek 2022-12-21 02:26:27 MST
Is this ticket effectively a duplicate of Bug 15614?
Comment 8 Marcin Stolarek 2023-01-12 02:03:39 MST
Is there anything else I can help you with in the bug report?
Comment 9 hpc-admin 2023-01-13 04:04:29 MST
this is ok now. you can close this ticket