Ticket 15522 - Only half the requested CPU cores are available when asking for a single GPU (GRES) resource
Summary: Only half the requested CPU cores are available when asking for a single GPU ...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 22.05.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-11-30 09:04 MST by hpc-ops
Modified: 2023-01-15 22:40 MST (History)
1 user (show)

See Also:
Site: Ghent
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm config file (3.03 KB, text/plain)
2022-11-30 09:04 MST, hpc-ops
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description hpc-ops 2022-11-30 09:04:17 MST
Created attachment 27960 [details]
slurm config file

Hi,


When submitting a job to a GPU cluster, we're not quite understanding why the step_0 cgroup only gets 16 cores instead of the expected 32.


This is the job submission command:

/usr/bin/salloc --reservation=maintenance2022Q4 --cpus-per-gpu=32 --gres=gpu:1 --job-name=INTERACTIVE --mail-type=NONE --nodes=1 --ntasks-per-node=32 --ntasks=32 --time=3-00:00:00 /usr/bin/srun --chdir=/user/gent/400/vsc40003 --cpu-bind=none --export=USER,HOME,TERM --mem=0 --mpi=none --nodes=1 --ntasks=1 --pty /bin/bash -i -l

salloc: Granted job allocation 40271421
salloc: Waiting for resource configuration
salloc: Nodes node3303.joltik.os are ready for job


Which then yields:

[vsc40003@node3303 ~]$ nproc
16

Looking at the job's info I see:

TRES=cpu=32,mem=262080M,node=1,billing=33,gres/gpu=1


Looking at the cgroups, I see:

[root@node3303 job_40271421]# cat cpuset.cpus
0-31


Idem for step_extern

But

[root@node3303 step_0]# cat cpuset.cpus
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30


Somehow this job step only gets the even cores. Is this something expected, do we need to configure something differently?

When not asking for any GPUs, we do see that 32 cores are assigned to this job step.

Our cgroup config is:


AllowedSwapSpace=0
CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes


-- Andy
Comment 2 Marcin Stolarek 2022-12-01 06:33:07 MST
Andy,

I can't easily reproduce the behavior. Could you please attach the output of `lstopo-no-graphics` and your gres.conf?

cheers,
Marcin
Comment 3 Marcin Stolarek 2022-12-08 07:03:57 MST
Could you please take a look at last comment.

cheers,
Marcin
Comment 4 hpc-ops 2022-12-13 07:15:37 MST
Hi,


We're in the process of changing the config, and will see if this gets fixed.


-- Andy
Comment 5 Marcin Stolarek 2022-12-20 09:40:16 MST
Any update from your side?
Comment 6 hpc-ops 2022-12-20 09:47:00 MST
Hi,

We're trying the upstream/slurm-22.05 branch to see if this works out better, but so far, no luck afaik.

-- Andy
Comment 7 Marcin Stolarek 2022-12-21 02:26:27 MST
Is this ticket effectively a duplicate of Bug 15614?
Comment 8 Marcin Stolarek 2023-01-12 02:03:39 MST
Is there anything else I can help you with in the bug report?
Comment 9 hpc-ops 2023-01-13 04:04:29 MST
this is ok now. you can close this ticket