Ticket 22430 - Hyperthreads get oversubscribed with CR_ONE_TASK_PER_CORE and cpus-per-task=1
Summary: Hyperthreads get oversubscribed with CR_ONE_TASK_PER_CORE and cpus-per-task=1
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 24.11.3
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-03-25 11:17 MDT by Frank Otto
Modified: 2025-03-25 11:17 MDT (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Frank Otto 2025-03-25 11:17:06 MDT
In an environment with hyperthreading, when CR_ONE_TASK_PER_CORE is active, and the option --cpus-per-task=1 is used, we find that cores/threads are sub-optimally allocated, in that fewer than expected physical cores are used and hyperthreads get assigned TWO tasks.


Environment: Slurm 24.11.3, RedHat Linux 9.5

Relevant bits from slurm.conf:

SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
TaskPlugin=task/affinity,task/cgroup
PrologFlags=Contain,Alloc,X11
NodeName=DEFAULT Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=122880

Contents of cgroup.conf:

ConstrainCores=yes
ConstrainRAMSpace=yes
AllowedRamSpace=101
ConstrainSwapSpace=yes
AllowedSwapSpace=0
ConstrainDevices=no
MinRAMSpace=32
MemorySwappiness=5

The CPU numbering is as follows:

NUMA node0 CPU(s):      0-11,24-35
NUMA node1 CPU(s):      12-23,36-47

i.e. 0-11 are the first thread on the first socket, 12-23 the first thread on the second socket, 24-35 the second thread on the first socket, 36-47 the second thread on the second socket.


Reproducer:

$ srun --cpu-bind=verbose --ntasks=8 --cpus-per-task=1 true
cpu-bind=MASK - node-k01e-001, task  0  0 [232069]: mask 0x1 set
cpu-bind=MASK - node-k01e-001, task  1  1 [232070]: mask 0x1000 set
cpu-bind=MASK - node-k01e-001, task  2  2 [232071]: mask 0x2 set
cpu-bind=MASK - node-k01e-001, task  3  3 [232072]: mask 0x2000 set
cpu-bind=MASK - node-k01e-001, task  4  4 [232073]: mask 0x1000 set
cpu-bind=MASK - node-k01e-001, task  5  5 [232074]: mask 0x1 set
cpu-bind=MASK - node-k01e-001, task  6  6 [232075]: mask 0x2000 set
cpu-bind=MASK - node-k01e-001, task  7  7 [232076]: mask 0x2 set

As you see, only CPUs 0,12,1,13 are used, and each of them twice. That is, the first hyperthread on the first two cores of both sockets is doubly oversubscribed. In a sense, it behaves more like cpus-per-task=0.5 instead of cpus-per-task=1.

The same happens when run via sbatch, where we can also check which CPUs got allocated to the job overall:
$ sbatch --ntasks=8 --cpus-per-task=1 --wrap='taskset -p $$; srun --cpu-bind=verbose true'
pid 235376's current affinity mask: f00f00f00f
cpu-bind=MASK - node-k01e-001, task  0  0 [235263]: mask 0x1 set
cpu-bind=MASK - node-k01e-001, task  1  1 [235264]: mask 0x1000 set
cpu-bind=MASK - node-k01e-001, task  2  2 [235265]: mask 0x2 set
cpu-bind=MASK - node-k01e-001, task  3  3 [235266]: mask 0x2000 set
cpu-bind=MASK - node-k01e-001, task  4  4 [235267]: mask 0x1000 set
cpu-bind=MASK - node-k01e-001, task  5  5 [235268]: mask 0x1 set
cpu-bind=MASK - node-k01e-001, task  6  6 [235269]: mask 0x2000 set
cpu-bind=MASK - node-k01e-001, task  7  7 [235270]: mask 0x2 set

As you see, only a fraction of the allocated cores/threads are used.


In contrast, if we also pass --ntasks-per-core=1 (which should be the default behaviour when CR_ONE_TASK_PER_CORE is set), then the CPU allocation is more reasonable:

$ srun --cpu-bind=verbose --ntasks=8 --cpus-per-task=1 --ntasks-per-core=1 true
cpu-bind=MASK - node-k01e-001, task  0  0 [233414]: mask 0x1000001 set
cpu-bind=MASK - node-k01e-001, task  1  1 [233415]: mask 0x1000001000 set
cpu-bind=MASK - node-k01e-001, task  2  2 [233416]: mask 0x2000002 set
cpu-bind=MASK - node-k01e-001, task  3  3 [233417]: mask 0x2000002000 set
cpu-bind=MASK - node-k01e-001, task  4  4 [233418]: mask 0x4000004 set
cpu-bind=MASK - node-k01e-001, task  5  5 [233419]: mask 0x4000004000 set
cpu-bind=MASK - node-k01e-001, task  6  6 [233420]: mask 0x8000008 set
cpu-bind=MASK - node-k01e-001, task  7  7 [233421]: mask 0x8000008000 set

Now each task gets assigned its own physical core (with both hyperthreads). We would expect that with CR_ONE_TASK_PER_CORE active, one wouldn't need to pass --ntasks-per-core=1, but as you see they behave differently. (Also what we would really wish for in this case is that the tasks get bound only to the first thread on the cores, but that's perhaps another discussion.) 


Thanks,
Frank