9670 – CPU binding with nomultithread and exclusive options

Ticket 9670 - CPU binding with nomultithread and exclusive options

Summary: CPU binding with nomultithread and exclusive options

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.02.4
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:	Brian Christiansen

URL:

Depends on:
Blocks:

Reported:	2020-08-26 04:08 MDT by IDRIS System Team
Modified:	2021-01-12 13:54 MST (History)
CC List:	2 users (show)

See Also:	10474
Site:	IDRIS
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
v2 (1.02 KB, patch) 2020-09-28 04:27 MDT, Marcin Stolarek	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description IDRIS System Team 2020-08-26 04:08:03 MDT

Hi!

We noticed strange behaviors with the CPU binding in Slurm v20.02.4 when using nomultithread and exclusive options.

Configuration:

NodeName=r1i0n0 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=191752
Processors (0, 40), (1, 41), (2, 42) and so on have the same core id.

SelectType              = select/cons_tres
SelectTypeParameters    = CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
TaskPlugin              = task/affinity,task/cgroup
TaskPluginParam         = (null type)
TaskAffinity            = no
ConstrainCores          = yes

1. We request no-multithreading but logical CPUs from the same physical core are used.

srun -A xyz -n 1 -c 40 --cpu-bind=verbose --hint=nomultithread --exclusive hostname
srun: job 119 queued and waiting for resources
srun: job 119 has been allocated resources
cpu-bind-threads=MASK - r1i0n0, task  0  0 [3653]: mask 0xfffff00000fffff set

scontrol show job 119
JobId=119 JobName=hostname
   UserId=user01(10000) GroupId=grp01(10000) MCS_label=N/A
   Priority=255796 Nice=0 Account=xyz QOS=qos_cpu-t3
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2020-08-25T17:29:58 EligibleTime=2020-08-25T17:29:58
   AccrueTime=2020-08-25T17:29:58
   StartTime=2020-08-25T17:29:58 EndTime=2020-08-25T17:29:58 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-25T17:29:58
   Partition=cpu_p1 AllocNode:Sid=front3:79906
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=r1i0n0
   BatchHost=r1i0n0
   NumNodes=1 NumCPUs=80 NumTasks=1 CPUs/Task=40 ReqB:S:C:T=0:0:*:1
   TRES=cpu=80,mem=80G,energy=65,node=1,billing=40
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=40 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=hostname
   WorkDir=/path/to/work/dir/
   Power=
   MailUser=(null) MailType=NONE

2. We request multiple tasks but CPU are overlapping between tasks:

srun -A xyz -n 2 -c 20 --cpu-bind=verbose --hint=nomultithread --exclusive hostname
srun: job 123 queued and waiting for resources
srun: job 123 has been allocated resources
cpu-bind=MASK - r1i0n0, task  0  0 [45214]: mask 0xfffff set
cpu-bind=MASK - r1i0n0, task  1  1 [45215]: mask 0x17ffff set

JobId=123 JobName=hostname
   UserId=user01(10000) GroupId=grp01(10000) MCS_label=N/A
   Priority=255558 Nice=0 Account=xyz QOS=qos_cpu-t3
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2020-08-26T09:07:03 EligibleTime=2020-08-26T09:07:03
   AccrueTime=2020-08-26T09:07:03
   StartTime=2020-08-26T09:07:03 EndTime=2020-08-26T09:07:04 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-26T09:07:03
   Partition=cpu_p1 AllocNode:Sid=front3:79645
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=r1i0n0
   BatchHost=r1i0n0
   NumNodes=1 NumCPUs=80 NumTasks=2 CPUs/Task=20 ReqB:S:C:T=0:0:*:1
   TRES=cpu=80,mem=80G,energy=111,node=1,billing=40
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=20 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=hostname
   WorkDir=/path/to/work/dir/
   Power=
   MailUser=(null) MailType=NONE


Also, in previous examples, shouldn't the memory be twice the TRES value (80 CPU x 2G RAM per CPU = 160G)?


Using the whole node without the exclusive option seems to be fine (no multithreading, no overlapping and good amount of memory):

srun -A xyz -n 2 -c 20 --cpu-bind=verbose --hint=nomultithread hostname
srun: job 121 queued and waiting for resources
srun: job 121 has been allocated resources
cpu-bind=MASK - r1i0n0, task  0  0 [65293]: mask 0xfffff set
cpu-bind=MASK - r1i0n0, task  1  1 [65295]: mask 0xfffff00000 set

scontrol show job 121
JobId=121 JobName=hostname
   UserId=user01(10000) GroupId=grp01(10000) MCS_label=N/A
   Priority=255677 Nice=0 Account=xyz QOS=qos_cpu-t3
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2020-08-25T17:50:30 EligibleTime=2020-08-25T17:50:30
   AccrueTime=2020-08-25T17:50:30
   StartTime=2020-08-25T17:50:30 EndTime=2020-08-25T17:50:30 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-25T17:50:30
   Partition=cpu_p1 AllocNode:Sid=front3:79906
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=r1i0n0
   BatchHost=r1i0n0
   NumNodes=1 NumCPUs=80 NumTasks=2 CPUs/Task=20 ReqB:S:C:T=0:0:*:1
   TRES=cpu=80,mem=160G,energy=117,node=1,billing=40
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=20 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=hostname
   WorkDir=/path/to/work/dir/
   Power=
   MailUser=(null) MailType=NONE

Thanks for you help!

Comment 4 Marcin Stolarek 2020-09-02 10:49:44 MDT

I was able to reproduce the reported behavior. I have a patch for wrong binding (which is the main issue reported here) that is under our internal review.

For the memory, the case is at least worth checking, however, to make the discussion structured I'll open a separate bug report and you to CC there.

cheers,
Marcin

Comment 5 IDRIS System Team 2020-09-21 05:34:23 MDT

Hi!

Any news on the patch?

Comment 11 Marcin Stolarek 2020-09-28 04:27:54 MDT

Created attachment 16061 [details]
v2

The issue is resolved by other changes on master branch (slurm-20.11 to be). We're still discussing how to best address it on slurm-20.02. Could you please apply the attached patch and confirm that it solves the issue for you?

cheers,
Marcin

Comment 12 IDRIS System Team 2020-10-20 03:43:06 MDT

The patch fixes this issue but we noticed an other binding problem (see #10019).

Comment 13 Marcin Stolarek 2020-10-20 05:01:16 MDT

Focusing on this case. Are you OK with the case closure with just local fix delivered as information given?

As I mentioned this is fixed in 20.11 by other work that was a substantial improvement in the handling of --threads-per-core for management of steps inside allocation. We're close to the release of 20.11 and the attached patch being a fix is also a change in behavior that would be specific for only for late releases of 20.02, which may finally be more confusion than a fix for the wide range of users.

Let me know your thoughts.

cheers,
Marcin

Comment 14 Marcin Stolarek 2020-11-13 07:32:37 MST

Can you share your thoughts on the closure suggestion from comment 13? In case of no reply I'll cluse the case as "information given".

cheers,
Marcin

Comment 16 IDRIS System Team 2020-12-03 07:16:00 MST

Hi!

Can the patch be also applied on versions >20.02.4? If so we agree to close this case.

(In reply to Marcin Stolarek from comment #14)
> Can you share your thoughts on the closure suggestion from comment 13? In
> case of no reply I'll cluse the case as "information given".
> 
> cheers,
> Marcin

Comment 17 Marcin Stolarek 2020-12-03 08:03:21 MST

The patch should be easy to apply locally - it's very simple and I don't expect any code changes in this area in up-coming minor releases of 20.02.
If it doesn't apply you can always reopen the bug and I'll prepare an appropriate patch for you.

We just don't want to make any changes on 20.02, since the code is subjected to a larger rewrite in 20.11 and we want to avoid frequent changes in the same area.

Does that make sense for you?

cheers,
Marcin

Comment 18 IDRIS System Team 2020-12-03 08:13:25 MST

Ok! The case can be closed.

(In reply to Marcin Stolarek from comment #17)
> The patch should be easy to apply locally - it's very simple and I don't
> expect any code changes in this area in up-coming minor releases of 20.02.
> If it doesn't apply you can always reopen the bug and I'll prepare an
> appropriate patch for you.
> 
> We just don't want to make any changes on 20.02, since the code is subjected
> to a larger rewrite in 20.11 and we want to avoid frequent changes in the
> same area.
> 
> Does that make sense for you?
> 
> cheers,
> Marcin