Hi! We noticed strange behaviors with the CPU binding in Slurm v20.02.4 when using nomultithread and exclusive options. Configuration: NodeName=r1i0n0 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=191752 Processors (0, 40), (1, 41), (2, 42) and so on have the same core id. SelectType = select/cons_tres SelectTypeParameters = CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK TaskPlugin = task/affinity,task/cgroup TaskPluginParam = (null type) TaskAffinity = no ConstrainCores = yes 1. We request no-multithreading but logical CPUs from the same physical core are used. srun -A xyz -n 1 -c 40 --cpu-bind=verbose --hint=nomultithread --exclusive hostname srun: job 119 queued and waiting for resources srun: job 119 has been allocated resources cpu-bind-threads=MASK - r1i0n0, task 0 0 [3653]: mask 0xfffff00000fffff set scontrol show job 119 JobId=119 JobName=hostname UserId=user01(10000) GroupId=grp01(10000) MCS_label=N/A Priority=255796 Nice=0 Account=xyz QOS=qos_cpu-t3 JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2020-08-25T17:29:58 EligibleTime=2020-08-25T17:29:58 AccrueTime=2020-08-25T17:29:58 StartTime=2020-08-25T17:29:58 EndTime=2020-08-25T17:29:58 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-25T17:29:58 Partition=cpu_p1 AllocNode:Sid=front3:79906 ReqNodeList=(null) ExcNodeList=(null) NodeList=r1i0n0 BatchHost=r1i0n0 NumNodes=1 NumCPUs=80 NumTasks=1 CPUs/Task=40 ReqB:S:C:T=0:0:*:1 TRES=cpu=80,mem=80G,energy=65,node=1,billing=40 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=40 MinMemoryCPU=2G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=hostname WorkDir=/path/to/work/dir/ Power= MailUser=(null) MailType=NONE 2. We request multiple tasks but CPU are overlapping between tasks: srun -A xyz -n 2 -c 20 --cpu-bind=verbose --hint=nomultithread --exclusive hostname srun: job 123 queued and waiting for resources srun: job 123 has been allocated resources cpu-bind=MASK - r1i0n0, task 0 0 [45214]: mask 0xfffff set cpu-bind=MASK - r1i0n0, task 1 1 [45215]: mask 0x17ffff set JobId=123 JobName=hostname UserId=user01(10000) GroupId=grp01(10000) MCS_label=N/A Priority=255558 Nice=0 Account=xyz QOS=qos_cpu-t3 JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2020-08-26T09:07:03 EligibleTime=2020-08-26T09:07:03 AccrueTime=2020-08-26T09:07:03 StartTime=2020-08-26T09:07:03 EndTime=2020-08-26T09:07:04 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-26T09:07:03 Partition=cpu_p1 AllocNode:Sid=front3:79645 ReqNodeList=(null) ExcNodeList=(null) NodeList=r1i0n0 BatchHost=r1i0n0 NumNodes=1 NumCPUs=80 NumTasks=2 CPUs/Task=20 ReqB:S:C:T=0:0:*:1 TRES=cpu=80,mem=80G,energy=111,node=1,billing=40 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=20 MinMemoryCPU=2G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=hostname WorkDir=/path/to/work/dir/ Power= MailUser=(null) MailType=NONE Also, in previous examples, shouldn't the memory be twice the TRES value (80 CPU x 2G RAM per CPU = 160G)? Using the whole node without the exclusive option seems to be fine (no multithreading, no overlapping and good amount of memory): srun -A xyz -n 2 -c 20 --cpu-bind=verbose --hint=nomultithread hostname srun: job 121 queued and waiting for resources srun: job 121 has been allocated resources cpu-bind=MASK - r1i0n0, task 0 0 [65293]: mask 0xfffff set cpu-bind=MASK - r1i0n0, task 1 1 [65295]: mask 0xfffff00000 set scontrol show job 121 JobId=121 JobName=hostname UserId=user01(10000) GroupId=grp01(10000) MCS_label=N/A Priority=255677 Nice=0 Account=xyz QOS=qos_cpu-t3 JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2020-08-25T17:50:30 EligibleTime=2020-08-25T17:50:30 AccrueTime=2020-08-25T17:50:30 StartTime=2020-08-25T17:50:30 EndTime=2020-08-25T17:50:30 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-25T17:50:30 Partition=cpu_p1 AllocNode:Sid=front3:79906 ReqNodeList=(null) ExcNodeList=(null) NodeList=r1i0n0 BatchHost=r1i0n0 NumNodes=1 NumCPUs=80 NumTasks=2 CPUs/Task=20 ReqB:S:C:T=0:0:*:1 TRES=cpu=80,mem=160G,energy=117,node=1,billing=40 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=20 MinMemoryCPU=2G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=hostname WorkDir=/path/to/work/dir/ Power= MailUser=(null) MailType=NONE Thanks for you help!
I was able to reproduce the reported behavior. I have a patch for wrong binding (which is the main issue reported here) that is under our internal review. For the memory, the case is at least worth checking, however, to make the discussion structured I'll open a separate bug report and you to CC there. cheers, Marcin
Hi! Any news on the patch?
Created attachment 16061 [details] v2 The issue is resolved by other changes on master branch (slurm-20.11 to be). We're still discussing how to best address it on slurm-20.02. Could you please apply the attached patch and confirm that it solves the issue for you? cheers, Marcin
The patch fixes this issue but we noticed an other binding problem (see #10019).
Focusing on this case. Are you OK with the case closure with just local fix delivered as information given? As I mentioned this is fixed in 20.11 by other work that was a substantial improvement in the handling of --threads-per-core for management of steps inside allocation. We're close to the release of 20.11 and the attached patch being a fix is also a change in behavior that would be specific for only for late releases of 20.02, which may finally be more confusion than a fix for the wide range of users. Let me know your thoughts. cheers, Marcin
Can you share your thoughts on the closure suggestion from comment 13? In case of no reply I'll cluse the case as "information given". cheers, Marcin
Hi! Can the patch be also applied on versions >20.02.4? If so we agree to close this case. (In reply to Marcin Stolarek from comment #14) > Can you share your thoughts on the closure suggestion from comment 13? In > case of no reply I'll cluse the case as "information given". > > cheers, > Marcin
The patch should be easy to apply locally - it's very simple and I don't expect any code changes in this area in up-coming minor releases of 20.02. If it doesn't apply you can always reopen the bug and I'll prepare an appropriate patch for you. We just don't want to make any changes on 20.02, since the code is subjected to a larger rewrite in 20.11 and we want to avoid frequent changes in the same area. Does that make sense for you? cheers, Marcin
Ok! The case can be closed. (In reply to Marcin Stolarek from comment #17) > The patch should be easy to apply locally - it's very simple and I don't > expect any code changes in this area in up-coming minor releases of 20.02. > If it doesn't apply you can always reopen the bug and I'll prepare an > appropriate patch for you. > > We just don't want to make any changes on 20.02, since the code is subjected > to a larger rewrite in 20.11 and we want to avoid frequent changes in the > same area. > > Does that make sense for you? > > cheers, > Marcin