Ticket 10019

Summary:	Bad CPU binding on single task/multicore
Product:	Slurm	Reporter:	IDRIS System Team <gensyshpe>
Component:	Scheduling	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED INFOGIVEN	QA Contact:	Ben Roberts <ben>
Severity:	4 - Minor Issue
Priority:	---	CC:	remi.lacroix
Version:	20.02.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=10474
Site:	GENCI - Grand Equipement National de Calcul Intensif	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description IDRIS System Team 2020-10-20 03:38:25 MDT

We have noticed on inconsistency in Slurm 20.02.4 CPUs allocation behavior
compared to the Slurm 18.08.8.

Configuration:
NodeName=r1i0n0 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
ThreadsPerCore=2 RealMemory=191752
Processors [0-19,40-59] have the same physical id.
Processors [20-39,60-79] have the same physical id.
Processors (0, 40), (1, 41), (2, 42) and so on have the same core id.

This issue appears specifically when requesting only one tasks with multiple
CPUs:

  * Slurm 20.02.4 allocates CPUs on both sockets of the node which is not what
    we expect, especially when "-m block:block:block" is specified:

  $ srun -A sos@gpu -n 1 -c 10 --gres=gpu:1 --hint=nomultithread -m
  block:block:block ~/binding_mpi.exe
  srun: job 190 queued and waiting for resources
  srun: job 190 has been allocated resources
  Hello from level 1: rank= 0, thread level 1= -1, on r14i6n0. (core affinity
  = 0-4,20-24)

  * Slurm 18.08.8 allocates contiguous CPUs as expected:

  $ srun -A sos@gpu -n 1 -c 10 --gres=gpu:1 --hint=nomultithread -m
  block:block:block ~/binding_mpi.exe
  srun: job 556203 queued and waiting for resources
  srun: job 556203 has been allocated resources
  Hello from level 1: rank= 0, thread level 1= -1, on r14i5n8. (core affinity
  = 0-9)

The issue cannot be reproduced with Slurm 20.02.4 when requesting multiple
tasks:

  $ srun -A sos@gpu -n 2 -c 10 --gres=gpu:1 --hint=nomultithread -m
  block:block:block ~/binding_mpi.exe
  srun: job 194 queued and waiting for resources
  srun: job 194 has been allocated resources
  Hello from level 1: rank= 1, thread level 1= -1, on r14i6n0. (core affinity
  = 10-19)
  Hello from level 1: rank= 0, thread level 1= -1, on r14i6n0. (core affinity
  = 0-9)

In this case, both tasks have been allocated contiguous CPUs and are packed on
one socket as expected.

Comment 1 Marcin Stolarek 2020-10-21 02:34:26 MDT

Am I correct that 20.02 makes use of cons_tres?

Could you please share your gres.conf?

cheers,
Marcin

Comment 2 IDRIS System Team 2020-10-21 04:19:56 MDT

Yes, you are right, we use cons_tres :

$ scontrol show config | grep cons_tres
SelectType              = select/cons_tres

There is only one line in our gres.conf:

NodeName=r14i6n0 Name=gpu File=/dev/nvidia[0-3]


We just checked on a non-GPU node and we have indeed no problems.

(In reply to Marcin Stolarek from comment #1)
> Am I correct that 20.02 makes use of cons_tres?
> 
> Could you please share your gres.conf?
> 
> cheers,
> Marcin

Comment 3 Marcin Stolarek 2020-10-21 07:14:34 MDT

There is a logic in cons_tres spreading job across sockets required by GPU(GRES) devices allocated to job. This logic tries to acomplish two things:
1) Allocate CPUs on the socket closest to allocated GRES.
2) Spread the job across sockets if we have to use more sockets(for instance because of currently running jobs and there was no request to enforce-binding) and not all GRES used - to allow other jobs running on the closest socket, potentially using the remaning GRES.

In your configuration (gres.conf) doesn't have closest socket (Cores= option) so all sockets are marked as required and the job gets spreaded across required sockets.

If you'll change the config to:
>NodeName=r14i6n0 Name=gpu File=/dev/nvidia[0-1] Cores=0-19
>NodeName=r14i6n0 Name=gpu File=/dev/nvidia[2-3] Cores=20-39

You should get all CPUs on the same socket. Is there any reason why you decided to skip Cores in gres.conf?
I'll have to look closer into the code to understand if this should be treated as a bug or documented as cons_tres specific behavior in rigards to -m application to job allocation.

cheers,
Marcin

Comment 4 IDRIS System Team 2020-10-21 10:34:24 MDT

We changed our configuration and the CPU binding is now good. Thanks!

The gres configuration comes from our production version (v18.08) where we never succeeded to configure the CPU/GPU affinity. That's why the Cores parameter was missing in the file.

After reading the NOTE for the Cores parameter in the gres.conf documentation, we are a little bit confused about the values to put when using hyperthreading. The lstopo command reports:

Machine (187GB total)
  Package L#0
    NUMANode L#0 (93GB)
    L3 L#0 (28MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0
        PU L#1
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2
        PU L#3
[...]
      L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
        PU L#38
        PU L#39
  Package L#1
    NUMANode L#1 (94GB)
    L3 L#1 (28MB)
      L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
        PU L#40
        PU L#41
      L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
        PU L#42
        PU L#43
[...]
     L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
        PU L#78
        PU L#79

Should we use the PU numbering or the physical numbering? Must we put only the first thread of each core? With PU numbering and first thread only, should the configuration be:

NodeName=r14i6n0 Name=gpu File=/dev/nvidia[0-1] Cores=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38
NodeName=r14i6n0 Name=gpu File=/dev/nvidia[2-3] Cores=40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78


(In reply to Marcin Stolarek from comment #3)
> There is a logic in cons_tres spreading job across sockets required by
> GPU(GRES) devices allocated to job. This logic tries to acomplish two things:
> 1) Allocate CPUs on the socket closest to allocated GRES.
> 2) Spread the job across sockets if we have to use more sockets(for instance
> because of currently running jobs and there was no request to
> enforce-binding) and not all GRES used - to allow other jobs running on the
> closest socket, potentially using the remaning GRES.
> 
> In your configuration (gres.conf) doesn't have closest socket (Cores=
> option) so all sockets are marked as required and the job gets spreaded
> across required sockets.
> 
> If you'll change the config to:
> >NodeName=r14i6n0 Name=gpu File=/dev/nvidia[0-1] Cores=0-19
> >NodeName=r14i6n0 Name=gpu File=/dev/nvidia[2-3] Cores=20-39
> 
> You should get all CPUs on the same socket. Is there any reason why you
> decided to skip Cores in gres.conf?
> I'll have to look closer into the code to understand if this should be
> treated as a bug or documented as cons_tres specific behavior in rigards to
> -m application to job allocation.
> 
> cheers,
> Marcin

Comment 5 Marcin Stolarek 2020-10-22 02:02:10 MDT

> we are a little bit confused about the values to put when using hyperthreading.

We're aware that it's not a obvious part of slurm configuration. Slurm 20.11 comes with slightly improved code handling it, but the main thing is that Cores= is expected to be a list of Cores on the closes socket number by *abstract Slurm IDs* not physicial core numbering.

Let me share an updated version of the NOTE we're working on in an internal ticket for docs improvement:
>NOTE: Since Slurm must be able to perform resource management on heterogeneous 
>clusters having various processing unit numbering schemes, a logical core index 
>must be specified instead of the physical core index.   That  logical  core  
>index might  not  correspond to your physical core index number.  Core 0 will be 
>the first core on the first socket, while core 1 will be the second core on the 
>first socket.  This numbering coincides with the logical core number (Core L#) 
>seen in "lstopo -l" command output.

That's why I used Cores 0-19 (for 1st socket/package) and Cores=20-39 for the second one.

Let me know if it's more clear now.

cheers,
Marcin

Comment 6 IDRIS System Team 2020-10-22 04:30:26 MDT

Ok! Just to be sure: does it apply to v20.02 too?

(In reply to Marcin Stolarek from comment #5)
> > we are a little bit confused about the values to put when using hyperthreading.
> 
> We're aware that it's not a obvious part of slurm configuration. Slurm 20.11
> comes with slightly improved code handling it, but the main thing is that
> Cores= is expected to be a list of Cores on the closes socket number by
> *abstract Slurm IDs* not physicial core numbering.
> 
> Let me share an updated version of the NOTE we're working on in an internal
> ticket for docs improvement:
> >NOTE: Since Slurm must be able to perform resource management on heterogeneous 
> >clusters having various processing unit numbering schemes, a logical core index 
> >must be specified instead of the physical core index.   That  logical  core  
> >index might  not  correspond to your physical core index number.  Core 0 will be 
> >the first core on the first socket, while core 1 will be the second core on the 
> >first socket.  This numbering coincides with the logical core number (Core L#) 
> >seen in "lstopo -l" command output.
> 
> That's why I used Cores 0-19 (for 1st socket/package) and Cores=20-39 for
> the second one.
> 
> Let me know if it's more clear now.
> 
> cheers,
> Marcin

Comment 7 Marcin Stolarek 2020-10-22 04:53:54 MDT

>Ok! Just to be sure: does it apply to v20.02 too?
Yes - it was always like that, however, in 20.02 --gpu-bind=closest doesn't work correctly in certain situations specifically when lstopo core id's are not in sequence - for instance odd cores on one socket and even cores on another one.

Did you consider using Autodetect=NVML[1] so slurmd can get this set for you? This requires gpu/nvml, which can be build only if nvml-devel was installed on the node where you compiled sources.

cheers,
Marcin
[1]https://slurm.schedmd.com/gres.html#Configuration

Comment 11 IDRIS System Team 2020-11-13 03:02:53 MST

We did it when we installed Slurm v18.08 last year but we didn't remember why we chose not to use it. Currently nvml-devel is missing on our compilation node.

(In reply to Marcin Stolarek from comment #7)
> [...]
>
> Did you consider using Autodetect=NVML[1] so slurmd can get this set for
> you? This requires gpu/nvml, which can be build only if nvml-devel was
> installed on the node where you compiled sources.
> 
> cheers,
> Marcin
> [1]https://slurm.schedmd.com/gres.html#Configuration

Comment 12 Marcin Stolarek 2020-11-13 03:27:39 MST

> [...]but we didn't remember why we chose not to use it[...]
Keep in mind that it's an option in the future. Having nvml enabled additional feature `--gpu-freq` - on devices supporting frequency switch by NVML calls.

To better document the reported behavior we've updated our slurm.conf doc by 4b176846522132a[1]. Having that said I'll go ahead and mark this case as information given.

Should you have any question please don't hesitate to reopen.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/4b176846522132a460a2719935c6d3968e154738