2655 – Can we make --hint=nomultithread to default

Ticket 2655 - Can we make --hint=nomultithread to default

Summary: Can we make --hint=nomultithread to default

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	15.08.10
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-04-20 17:20 MDT by Zhengji Zhao
Modified:	2016-10-13 11:42 MDT (History)
CC List:	0 users

See Also:
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm.conf (4.75 KB, text/plain) 2016-04-20 17:20 MDT, Zhengji Zhao	Details
xthi.c file (1.45 KB, text/plain) 2016-04-25 07:00 MDT, Zhengji Zhao	Details
Numactl --hardware output for the Quadrant Flat memory configuration on a KNL node (269.08 KB, text/plain) 2016-04-28 05:19 MDT, Zhengji Zhao	Details
Numactl --hardware output for the SNC Flat memory mode on a KNL node (267.69 KB, text/plain) 2016-04-28 05:21 MDT, Zhengji Zhao	Details
Slurm configuration file (4.55 KB, text/plain) 2016-08-10 16:57 MDT, Zhengji Zhao	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Zhengji Zhao 2016-04-20 17:20:28 MDT

Created attachment 3016 [details]
Slurm.conf

We are testing the task affinity on our Cray XC30 (Ivy bridge node, 2 sockets per node, 12 cores per socket, 2 threads per core) with both task/affinity and task/cgroup plugins enabled (TaskPlugin = affinity,cgroup,cray). See our configuration file attached. We would like to have a good default setting so that most of the users just run with srun -n <# of tasks> ./a.out. One of the things we want to do is to make the --hint=nomultithread to default as most of the workload does not get benefit from using hyperthreading on our Cray XC30. Could you please let us know if there is any way we can set this to default for all jobs? 

In addition, the --ntasks-per-socket option still does not work, I wonder if you could point to us how we can fix it. 

Thanks,
Zhengji

Comment 1 Tim Wickberg 2016-04-21 01:21:14 MDT

Hi Zhengji - 

You marked this as a "Sev 2 - High Impact" issue. Is this actively preventing jobs from running on your system, or would you mind changing this to a lower priority?

> We are testing the task affinity on our Cray XC30 (Ivy bridge node, 2
> sockets per node, 12 cores per socket, 2 threads per core) with both
> task/affinity and task/cgroup plugins enabled (TaskPlugin =
> affinity,cgroup,cray). See our configuration file attached. We would like to
> have a good default setting so that most of the users just run with srun -n
> <# of tasks> ./a.out. One of the things we want to do is to make the
> --hint=nomultithread to default as most of the workload does not get benefit
> from using hyperthreading on our Cray XC30. Could you please let us know if
> there is any way we can set this to default for all jobs?

There's no way to set a default hint through the slurm.conf. You could potentially set an appropriate SLURM_HINT environment variable for your users which may accomplish what you want.

There's a SelectTypeParameters option of CR_ONE_TASK_PER_CORE that may do what you're looking for. Please see http://slurm.schedmd.com/slurm.conf.html for some notes on how it works.

> In addition, the --ntasks-per-socket option still does not work, I wonder if
> you could point to us how we can fix it.

Can you elaborate on "does not work" ? I don't see an obvious issue with it, and the regression suite does cover that option.

One thing that can help highlight how the CPU affinity is being setup is --cpu_bind=verbose .

Comment 2 Zhengji Zhao 2016-04-21 02:15:32 MDT

Dear Tim,

Thanks very much for your prompt help. Sure it is OK to set this bug to a lower priority. I had different understanding about the "High Impact". We are working on what we should set as default on our systems in a short future, and it will affect all of our users, so I considered this bug as a high impact. I did not find something like "urgency level" in your bug system, which could be also a useful metric to calculate the priority of a ticket. (I thought this is not urgent, but has high impact).

So you have answered my first two questions. Looks like the way to set some sbatch and srun default options is to set the corresponding environment variables. I will try those. 

By --ntasks-per-socket does not work, I mean

zz217@nid00033:~/tests/affinity> srun -n 8 --ntasks-per-socket=4 --cpu_bind=cores xthi.intel
Hello from rank 0, thread 0, on nid00033. (core affinity = 0,24)
Hello from rank 1, thread 0, on nid00033. (core affinity = 1,25)
Hello from rank 6, thread 0, on nid00033. (core affinity = 6,30)
Hello from rank 7, thread 0, on nid00033. (core affinity = 7,31)
Hello from rank 2, thread 0, on nid00033. (core affinity = 2,26)
Hello from rank 3, thread 0, on nid00033. (core affinity = 3,27)
Hello from rank 4, thread 0, on nid00033. (core affinity = 4,28)
Hello from rank 5, thread 0, on nid00033. (core affinity = 5,29) 
 

I wanted to make the first 4 tasks bind to first socket, and the rest 4 tasks bind to the second socket, but it does not do that. All the tasks were bound to the first socket. 

Here is the output of --cpu_bind=verbose,

zz217@nid00033:~/tests/affinity> srun -n 8 --ntasks-per-socket=4 --cpu_bind=cores,verbose xthi.intel
cpu_bind=MASK - nid00033, task  6  6 [41890]: mask 0x40000040 set
cpu_bind=MASK - nid00033, task  5  5 [41889]: mask 0x20000020 set
cpu_bind=MASK - nid00033, task  0  0 [41884]: mask 0x1000001 set
cpu_bind=MASK - nid00033, task  3  3 [41887]: mask 0x8000008 set
cpu_bind=MASK - nid00033, task  4  4 [41888]: mask 0x10000010 set
cpu_bind=MASK - nid00033, task  2  2 [41886]: mask 0x4000004 set
cpu_bind=MASK - nid00033, task  1  1 [41885]: mask 0x2000002 set
cpu_bind=MASK - nid00033, task  7  7 [41891]: mask 0x80000080 set
Hello from rank 0, thread 0, on nid00033. (core affinity = 0,24)
Hello from rank 1, thread 0, on nid00033. (core affinity = 1,25)
Hello from rank 4, thread 0, on nid00033. (core affinity = 4,28)
Hello from rank 5, thread 0, on nid00033. (core affinity = 5,29)
Hello from rank 6, thread 0, on nid00033. (core affinity = 6,30)
Hello from rank 7, thread 0, on nid00033. (core affinity = 7,31)
Hello from rank 2, thread 0, on nid00033. (core affinity = 2,26)
Hello from rank 3, thread 0, on nid00033. (core affinity = 3,27)

We currently have (TaskPlugin = cgroup,cray) in our production systems, but we wanted to use the options that come with the task/affinity plugin, so we are testing (TaskPlugin = affinity,cgroup,cray) now. One of the options that we wanted to use is the --ntasks-per-socket. 

In addition, Currently we have

SelectTypeParameters    = CR_SOCKET_MEMORY,OTHER_CONS_RES,CR_CORE_DEFAULT_DIST_BLOCK

if we want to try out CR_ONE_TASK_PER_CORE as you suggested, what would be the SelectTypeParameters ? As you may already know we need support shared partition as well on our system (which may need CR_SOCKET_MEMORY).

Thanks,
Zhengji

Comment 3 Zhengji Zhao 2016-04-21 02:38:12 MDT

I just tried to use SLURM_HINT=nomultithread but it seems not work as expected (if this works, I should not see the high number cores (>24) in the program output), while the command line option --hint=nomultithread seems to work only when the --hint option appears as the last option of the srun command line option. Could you please let me know what could be the issue?

Thanks,
Zhengji
 

zz217@nid00033:~/tests/affinity> export SLURM_HINT=nomultithread; srun -n 8 --ntasks-per-socket=4 --cpu_bind=cores,verbose xthi.intel
cpu_bind=MASK - nid00033, task  0  0 [43237]: mask 0x1000001 set
cpu_bind=MASK - nid00033, task  3  3 [43240]: mask 0x8000008 set
cpu_bind=MASK - nid00033, task  4  4 [43241]: mask 0x10000010 set
cpu_bind=MASK - nid00033, task  6  6 [43243]: mask 0x40000040 set
cpu_bind=MASK - nid00033, task  7  7 [43244]: mask 0x80000080 set
cpu_bind=MASK - nid00033, task  2  2 [43239]: mask 0x4000004 set
cpu_bind=MASK - nid00033, task  5  5 [43242]: mask 0x20000020 set
cpu_bind=MASK - nid00033, task  1  1 [43238]: mask 0x2000002 set
Hello from rank 2, thread 0, on nid00033. (core affinity = 2,26)
Hello from rank 3, thread 0, on nid00033. (core affinity = 3,27)
Hello from rank 0, thread 0, on nid00033. (core affinity = 0,24)
Hello from rank 1, thread 0, on nid00033. (core affinity = 1,25)
Hello from rank 4, thread 0, on nid00033. (core affinity = 4,28)
Hello from rank 5, thread 0, on nid00033. (core affinity = 5,29)
Hello from rank 6, thread 0, on nid00033. (core affinity = 6,30)
Hello from rank 7, thread 0, on nid00033. (core affinity = 7,31)

#--hint=nomultithread does not work if appears infront of other srun options.
zz217@nid00033:~/tests/affinity> unset SLURM_HINT; srun -n 8 --ntasks-per-socket=4 --hint=nomultithread --cpu_bind=cores,verbose xthi.intel
cpu_bind=MASK - nid00033, task  7  7 [43331]: mask 0x80000080 set
cpu_bind=MASK - nid00033, task  0  0 [43324]: mask 0x1000001 set
cpu_bind=MASK - nid00033, task  1  1 [43325]: mask 0x2000002 set
cpu_bind=MASK - nid00033, task  5  5 [43329]: mask 0x20000020 set
cpu_bind=MASK - nid00033, task  2  2 [43326]: mask 0x4000004 set
cpu_bind=MASK - nid00033, task  3  3 [43327]: mask 0x8000008 set
cpu_bind=MASK - nid00033, task  6  6 [43330]: mask 0x40000040 set
cpu_bind=MASK - nid00033, task  4  4 [43328]: mask 0x10000010 set
Hello from rank 0, thread 0, on nid00033. (core affinity = 0,24)
Hello from rank 4, thread 0, on nid00033. (core affinity = 4,28)
Hello from rank 1, thread 0, on nid00033. (core affinity = 1,25)
Hello from rank 2, thread 0, on nid00033. (core affinity = 2,26)
Hello from rank 3, thread 0, on nid00033. (core affinity = 3,27)
Hello from rank 5, thread 0, on nid00033. (core affinity = 5,29)
Hello from rank 6, thread 0, on nid00033. (core affinity = 6,30)
Hello from rank 7, thread 0, on nid00033. (core affinity = 7,31)

#--hint works if it appears in the end of the other srun options:
zz217@nid00033:~/tests/affinity> unset SLURM_HINT; srun -n 8 --ntasks-per-socket=4 --cpu_bind=cores,verbose --hint=nomultithread xthi.intel
cpu_bind=MASK - nid00033, task  2  2 [43508]: mask 0x4 set
cpu_bind=MASK - nid00033, task  0  0 [43506]: mask 0x1 set
cpu_bind=MASK - nid00033, task  4  4 [43510]: mask 0x10 set
cpu_bind=MASK - nid00033, task  7  7 [43513]: mask 0x80 set
cpu_bind=MASK - nid00033, task  6  6 [43512]: mask 0x40 set
cpu_bind=MASK - nid00033, task  5  5 [43511]: mask 0x20 set
cpu_bind=MASK - nid00033, task  3  3 [43509]: mask 0x8 set
cpu_bind=MASK - nid00033, task  1  1 [43507]: mask 0x2 set
Hello from rank 0, thread 0, on nid00033. (core affinity = 0)
Hello from rank 1, thread 0, on nid00033. (core affinity = 1)
Hello from rank 2, thread 0, on nid00033. (core affinity = 2)
Hello from rank 3, thread 0, on nid00033. (core affinity = 3)
Hello from rank 4, thread 0, on nid00033. (core affinity = 4)
Hello from rank 5, thread 0, on nid00033. (core affinity = 5)
Hello from rank 6, thread 0, on nid00033. (core affinity = 6)
Hello from rank 7, thread 0, on nid00033. (core affinity = 7)
zz217@nid00033:~/tests/affinity>

Comment 4 Zhengji Zhao 2016-04-25 05:29:37 MDT

Dear Tim,

Could you please let me know about the actions that SchedMD would like to take with the problem of the SLURM_HINT=nomultithread (I reported in the last update to this bug that this env failed to allow the srun command to use only physical cores)? It is very important for us to know if SchedMD will be fixing this bug or not soon, so we can decide our next step. I appreciate if you could update us at your earliest convenience. 

I would like to let you know that what we really need is the capability of enabling hyperthreading by demand only (e.g., using --hint=multithread on the srun command line). We hope the srun command works with the physical cores only by default (provided the hypreading is enabled in BIOS all the time). If you could make SLURM_HINT=nomultithread work for us so that we can use that env set our default, that would be great, but we are happy to pursue other approaches if available as well. Actually I am wondering if it is a good idea to add something like SbatchDefaultCommand into the Slurm config support so that we can use it to set the default srun command line for the batch jobs? 

Thanks,
Zhengji

Comment 5 Tim Wickberg 2016-04-25 06:06:29 MDT

(In reply to Zhengji Zhao from comment #2)
> Dear Tim,
> 
> Thanks very much for your prompt help. Sure it is OK to set this bug to a
> lower priority. I had different understanding about the "High Impact". We
> are working on what we should set as default on our systems in a short
> future, and it will affect all of our users, so I considered this bug as a
> high impact. I did not find something like "urgency level" in your bug
> system, which could be also a useful metric to calculate the priority of a
> ticket. (I thought this is not urgent, but has high impact).

Importance is something that can be set, but we don't actively sort based on that. We try to respond to everything promptly; but the Impact levels are tied to specific contractual obligations and have specific response times.

> So you have answered my first two questions. Looks like the way to set some
> sbatch and srun default options is to set the corresponding environment
> variables. I will try those. 
> 
> By --ntasks-per-socket does not work, I mean

> I wanted to make the first 4 tasks bind to first socket, and the rest 4
> tasks bind to the second socket, but it does not do that. All the tasks were
> bound to the first socket. 

--ntasks-per-socket does not impact the layout, just the calculation of how many tasks to launch total; the task distribution is what impacts the layout. For block distribution this appears to be working correctly. Launching fewer tasks than the number of allocated cores does introduce some ambiguity into the result, although this is still working as defined.

> We currently have (TaskPlugin = cgroup,cray) in our production systems, but
> we wanted to use the options that come with the task/affinity plugin, so we
> are testing (TaskPlugin = affinity,cgroup,cray) now. One of the options that
> we wanted to use is the --ntasks-per-socket. 

Again, ntasks-per-socket does not directly influence affinity.

> In addition, Currently we have
> 
> SelectTypeParameters    =
> CR_SOCKET_MEMORY,OTHER_CONS_RES,CR_CORE_DEFAULT_DIST_BLOCK
> 
> if we want to try out CR_ONE_TASK_PER_CORE as you suggested, what would be
> the SelectTypeParameters ? As you may already know we need support shared
> partition as well on our system (which may need CR_SOCKET_MEMORY).

It's an additional flag you can add:

SelectTypeParameters=CR_SOCKET_MEMORY,OTHER_CONS_RES,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE

Comment 6 Tim Wickberg 2016-04-25 06:32:25 MDT

(In reply to Zhengji Zhao from comment #3)
> I just tried to use SLURM_HINT=nomultithread but it seems not work as
> expected (if this works, I should not see the high number cores (>24) in the
> program output), while the command line option --hint=nomultithread seems to
> work only when the --hint option appears as the last option of the srun
> command line option. Could you please let me know what could be the issue?
> 
> zz217@nid00033:~/tests/affinity> export SLURM_HINT=nomultithread; srun -n 8
> --ntasks-per-socket=4 --cpu_bind=cores,verbose xthi.intel
> cpu_bind=MASK - nid00033, task  0  0 [43237]: mask 0x1000001 set
> cpu_bind=MASK - nid00033, task  3  3 [43240]: mask 0x8000008 set
> cpu_bind=MASK - nid00033, task  4  4 [43241]: mask 0x10000010 set
> cpu_bind=MASK - nid00033, task  6  6 [43243]: mask 0x40000040 set
> cpu_bind=MASK - nid00033, task  7  7 [43244]: mask 0x80000080 set
> cpu_bind=MASK - nid00033, task  2  2 [43239]: mask 0x4000004 set
> cpu_bind=MASK - nid00033, task  5  5 [43242]: mask 0x20000020 set
> cpu_bind=MASK - nid00033, task  1  1 [43238]: mask 0x2000002 set
> Hello from rank 2, thread 0, on nid00033. (core affinity = 2,26)
> Hello from rank 3, thread 0, on nid00033. (core affinity = 3,27)
> Hello from rank 0, thread 0, on nid00033. (core affinity = 0,24)
> Hello from rank 1, thread 0, on nid00033. (core affinity = 1,25)
> Hello from rank 4, thread 0, on nid00033. (core affinity = 4,28)
> Hello from rank 5, thread 0, on nid00033. (core affinity = 5,29)
> Hello from rank 6, thread 0, on nid00033. (core affinity = 6,30)
> Hello from rank 7, thread 0, on nid00033. (core affinity = 7,31)
> 
> #--hint=nomultithread does not work if appears infront of other srun options.
> zz217@nid00033:~/tests/affinity> unset SLURM_HINT; srun -n 8
> --ntasks-per-socket=4 --hint=nomultithread --cpu_bind=cores,verbose
> xthi.intel
> cpu_bind=MASK - nid00033, task  7  7 [43331]: mask 0x80000080 set
> cpu_bind=MASK - nid00033, task  0  0 [43324]: mask 0x1000001 set
> cpu_bind=MASK - nid00033, task  1  1 [43325]: mask 0x2000002 set
> cpu_bind=MASK - nid00033, task  5  5 [43329]: mask 0x20000020 set
> cpu_bind=MASK - nid00033, task  2  2 [43326]: mask 0x4000004 set
> cpu_bind=MASK - nid00033, task  3  3 [43327]: mask 0x8000008 set
> cpu_bind=MASK - nid00033, task  6  6 [43330]: mask 0x40000040 set
> cpu_bind=MASK - nid00033, task  4  4 [43328]: mask 0x10000010 set
> Hello from rank 0, thread 0, on nid00033. (core affinity = 0,24)
> Hello from rank 4, thread 0, on nid00033. (core affinity = 4,28)
> Hello from rank 1, thread 0, on nid00033. (core affinity = 1,25)
> Hello from rank 2, thread 0, on nid00033. (core affinity = 2,26)
> Hello from rank 3, thread 0, on nid00033. (core affinity = 3,27)
> Hello from rank 5, thread 0, on nid00033. (core affinity = 5,29)
> Hello from rank 6, thread 0, on nid00033. (core affinity = 6,30)
> Hello from rank 7, thread 0, on nid00033. (core affinity = 7,31)
> 
> #--hint works if it appears in the end of the other srun options:
> zz217@nid00033:~/tests/affinity> unset SLURM_HINT; srun -n 8
> --ntasks-per-socket=4 --cpu_bind=cores,verbose --hint=nomultithread
> xthi.intel
> cpu_bind=MASK - nid00033, task  2  2 [43508]: mask 0x4 set
> cpu_bind=MASK - nid00033, task  0  0 [43506]: mask 0x1 set
> cpu_bind=MASK - nid00033, task  4  4 [43510]: mask 0x10 set
> cpu_bind=MASK - nid00033, task  7  7 [43513]: mask 0x80 set
> cpu_bind=MASK - nid00033, task  6  6 [43512]: mask 0x40 set
> cpu_bind=MASK - nid00033, task  5  5 [43511]: mask 0x20 set
> cpu_bind=MASK - nid00033, task  3  3 [43509]: mask 0x8 set
> cpu_bind=MASK - nid00033, task  1  1 [43507]: mask 0x2 set
> Hello from rank 0, thread 0, on nid00033. (core affinity = 0)
> Hello from rank 1, thread 0, on nid00033. (core affinity = 1)
> Hello from rank 2, thread 0, on nid00033. (core affinity = 2)
> Hello from rank 3, thread 0, on nid00033. (core affinity = 3)
> Hello from rank 4, thread 0, on nid00033. (core affinity = 4)
> Hello from rank 5, thread 0, on nid00033. (core affinity = 5)
> Hello from rank 6, thread 0, on nid00033. (core affinity = 6)
> Hello from rank 7, thread 0, on nid00033. (core affinity = 7)
> zz217@nid00033:~/tests/affinity>

Are all three of these commands being executed within an existing allocation, or are they making the allocation request separately?

If within an allocation, can you provide the 'salloc' command used to acquire the resources? Settings from there are inherited by the srun command, and may explain some of the behavior seen here.

(In reply to Zhengji Zhao from comment #4)
> Dear Tim,
> 
> Could you please let me know about the actions that SchedMD would like to
> take with the problem of the SLURM_HINT=nomultithread (I reported in the
> last update to this bug that this env failed to allow the srun command to
> use only physical cores)? It is very important for us to know if SchedMD
> will be fixing this bug or not soon, so we can decide our next step. I
> appreciate if you could update us at your earliest convenience. 

I'm still deciphering some of the output you sent; something does appear to be working oddly at least on my test system and I need to work through whether its intentional behavior or a bug.

> I would like to let you know that what we really need is the capability of
> enabling hyperthreading by demand only (e.g., using --hint=multithread on
> the srun command line). We hope the srun command works with the physical
> cores only by default (provided the hypreading is enabled in BIOS all the
> time). If you could make SLURM_HINT=nomultithread work for us so that we can
> use that env set our default, that would be great, but we are happy to
> pursue other approaches if available as well. Actually I am wondering if it
> is a good idea to add something like SbatchDefaultCommand into the Slurm
> config support so that we can use it to set the default srun command line
> for the batch jobs?

We're unlikely to pursu set elsewhere and the interaction between various settings with such command would further confuse things. I believe CR_ONE_TASK_PER_CORE gets you most of what you're after, although I'm still working through the examples you've sent to try to understand if there's a bug there or I'm misinterpreting the output.

Comment 7 Zhengji Zhao 2016-04-25 06:59:30 MDT

Hi Tim,

Thanks for getting back to me promptly. I am attaching the code, xthi.c (compile like this with an intel compiler: mpiicc -openmp xthi.c), which I used to print out the CPU bindings in my tests in case helpful. It is a code provided by Cray, and we have been using it to test the CPU affinity on our Cray systems, as its outputs are easier to read than reading the CPU masks or directly look into the /proc/self/status file. 

I will test CR_ONE_TASK_PER_CORE and will update you. Meanwhile if you have any update about the SLURM_HINT, please let me know. 

Thanks,
Zhengji

Comment 8 Zhengji Zhao 2016-04-25 07:00:12 MDT

Created attachment 3025 [details]
xthi.c file

Comment 13 Moe Jette 2016-04-27 02:02:02 MDT

I am able to achieve the desired behaviour by adding the SelectTypeParameters value of "CR_ONE_TASK_PER_CORE" to those you already have set (as Tim suggested in comment #6) PLUS either setting the environment variable SLURM_CPU_BIND=cores or using the job option --cpu_bind=cores. 

Given that environment, if a user wants to run one task per thread, he would need to use the following two options:
--cpu_bind=thread --ntasks-per-core=2

What I would like to propose for our next release (16.05, due out in May) is if the cluster is configured with CR_ONE_TASK_PER_CORE and the user does not specify the --ntasks-per-core=# with a value larger than 1, then by default bind tasks to cores. That would eliminate the need for the SLURM_CPU_BIND=cores environment variable or--cpu_bind=cores option. In order to bind to threads, the user would only need to specify --ntasks-per-core=2 (the --cpu_bind=thread would become the default).

Does that sound acceptable?

Comment 16 Moe Jette 2016-04-27 05:46:19 MDT

We've come up with what I believe is a better solution.

The first part, which you can do with Slurm version 15.08, is to configure CR_ONE_TASK_PER_CORE as described in prior comments.

The second part required making changes to RPCs and sending the ntasks-per-socket information to the task binding plugin. The task affinity plugin was modified to support the --ntasks-per-socket option (that information isn't available to the plugin in Slurm version 15.08). The commit with that change is here:
https://github.com/SchedMD/slurm/commit/31aa3244b55bcf6fafe8d76a2c3b8047afeac6e3

Finally you should be aware that if the tasks to be launched can be "nicely" mapped onto the allocated resources, all tasks by default get bound to all allocated resources. For example, if a job is allocated an entire node or socket and wants to launch 2 tasks then each task would get bound to one of the sockets on a 2 socket node. If there are 3 tasks to be launched on that same 2 socket node, then each task can access all threads by default. There is a --cpu_bind option to override the default by binding tasks to cores for example, even it that leaves cores idle.

Here are some logs demonstrating some of this on a node with 2 sockets, 6 cores per socket, and 2 threads per core:
$ srun --cpu_bind=verbose -n4 --ntasks-per-socket=2 -m block:block  hostname
cpu_bind=MASK - smd-server, task  0  0 [24774]: mask 0x555555 set
cpu_bind=MASK - smd-server, task  1  1 [24775]: mask 0x555555 set
cpu_bind=MASK - smd-server, task  3  3 [24777]: mask 0xaaaaaa set
cpu_bind=MASK - smd-server, task  2  2 [24776]: mask 0xaaaaaa set
smd-server
smd-server
smd-server
smd-server
$ srun --cpu_bind=verbose -n4 --ntasks-per-socket=3 -m block:block  hostname
cpu_bind=MASK - smd-server, task  0  0 [24790]: mask 0x555555 set
cpu_bind=MASK - smd-server, task  1  1 [24791]: mask 0x555555 set
cpu_bind=MASK - smd-server, task  2  2 [24792]: mask 0x555555 set
cpu_bind=MASK - smd-server, task  3  3 [24793]: mask 0xaaaaaa set
smd-server
smd-server
smd-server
smd-server
$ srun --cpu_bind=verbose -n4 --ntasks-per-socket=4 -m block:block  hostname
cpu_bind=MASK - smd-server, task  0  0 [24816]: mask 0x555555 set
cpu_bind=MASK - smd-server, task  1  1 [24817]: mask 0x555555 set
cpu_bind=MASK - smd-server, task  2  2 [24818]: mask 0x555555 set
cpu_bind=MASK - smd-server, task  3  3 [24819]: mask 0x555555 set
smd-server
smd-server
smd-server
smd-server
$ srun --cpu_bind=verbose,core -n4 --ntasks-per-socket=4 -m block:block  hostname
cpu_bind=MASK - smd-server, task  0  0 [24847]: mask 0x1001 set
cpu_bind=MASK - smd-server, task  1  1 [24848]: mask 0x4004 set
cpu_bind=MASK - smd-server, task  2  2 [24849]: mask 0x10010 set
cpu_bind=MASK - smd-server, task  3  3 [24850]: mask 0x40040 set
smd-server
smd-server
smd-server
smd-server
$ srun --cpu_bind=verbose,core -n4 --ntasks-per-socket=3 -m block:block  hostname
cpu_bind=MASK - smd-server, task  0  0 [24873]: mask 0x1001 set
cpu_bind=MASK - smd-server, task  1  1 [24874]: mask 0x4004 set
cpu_bind=MASK - smd-server, task  2  2 [24875]: mask 0x10010 set
cpu_bind=MASK - smd-server, task  3  3 [24876]: mask 0x2002 set
smd-server
smd-server
smd-server
smd-server
$ srun --cpu_bind=verbose,thread -n4 --ntasks-per-socket=3 -m block:block  hostname
cpu_bind=MASK - smd-server, task  0  0 [25039]: mask 0x1 set
cpu_bind=MASK - smd-server, task  1  1 [25040]: mask 0x4 set
cpu_bind=MASK - smd-server, task  2  2 [25041]: mask 0x10 set
cpu_bind=MASK - smd-server, task  3  3 [25042]: mask 0x2 set
smd-server
smd-server
smd-server
smd-server
$ srun --cpu_bind=verbose,thread -n4 --ntasks-per-socket=2 -m block:block  hostname
cpu_bind=MASK - smd-server, task  0  0 [25077]: mask 0x1 set
cpu_bind=MASK - smd-server, task  1  1 [25078]: mask 0x4 set
cpu_bind=MASK - smd-server, task  2  2 [25079]: mask 0x2 set
cpu_bind=MASK - smd-server, task  3  3 [25080]: mask 0x8 set
smd-server
smd-server
smd-server
smd-server

Comment 17 Zhengji Zhao 2016-04-27 05:59:17 MDT

Thanks a lot for the comment. I will reply to your comments 17 shortly.

My update was crashed for some reason, this is what I wanted to send to respond to your comment 16.

Dear Moe,

Yes, this sounds great! Your proposed change for next Slurm release appears to be what we want exactly for our Cray XC30 (Ivy Bridge, 2 threads/core), and Cray XC40 (Haswell, 2 threads/core).

For our next Cray XC40 (KNL nodes, 4 threads/per core), it is possible that we may want the default to be 4 threads/core. So I hope to make sure that in the next May release (with your proposed change in place) we will still be able to set the threads to be the default and just use --ntasks-per-core=1 to request not to use the hyperthreads (so to just work with cores). Our goal is to make the default easy for most of the workload/users (as easy as just doing srun –n #tasks ./a.out), meanwhile the non-default is not that inconvenient (e.g., just need to use one extra flag/option at most). In addition, the default setting should not restrict/fail us from doing extra/complicated task/thread/memory bindings beyond the default.

***********
Just to confirm, the following config/setting (only the relevant config/setting) was what you used to achieve what I want now:

TaskPlugin = affinity,cgroup,cray

SelectTypeParameters=CR_SOCKET_MEMORY,OTHER_CONS_RES,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE

export SLURM_CPU_BIND=cores (or srun --cpu_bind=cores ...) for now but in the May release this will be removed.

I will test this setting with our current Slurm install and will reproduce the cpu bndings you observed first, and will update you. Just to give you an heads-up, we may need your help with the memory binding as well, which is very important for us to be able to work with the high bandwidth memory on KNL.

Thanks a lot!

Zhengji

Comment 18 Zhengji Zhao 2016-04-27 06:00:47 MDT

Just notice I was addressing your comment 13...

Comment 19 Moe Jette 2016-04-27 09:42:06 MDT

(In reply to Zhengji Zhao from comment #17)
> Thanks a lot for the comment. I will reply to your comments 17 shortly. 
>
> Just to confirm, the following config/setting (only the relevant
> config/setting) was what you used to achieve what I want now:
> 
>  TaskPlugin = affinity,cgroup,cray
> 
>  SelectTypeParameters=CR_SOCKET_MEMORY,OTHER_CONS_RES,
> CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE
> 
>  export SLURM_CPU_BIND=cores (or srun --cpu_bind=cores ...) for now but in
> the May release this will be removed. 

This is correct.

I did just make another change to Slurm version 16.05 to support this. Previously the TaskPluginParams would specify the one and only task binding that the system would support. I changed that so the configuration parameter specified the default task binding. It will only be used if the user fails to specify the binding and any user CPU binding option will now override the configuration parameter rather than generate an error. This additional change is here:
https://github.com/SchedMD/slurm/commit/a01e6562edc1040bc3cee37fd96cade269b12ff4

We plan to tag a pre-release of Slurm version 16.5 on Thursday if you have a test system to work with.

Comment 20 Zhengji Zhao 2016-04-28 05:17:10 MDT

(In reply to Comment 19)

Yes, we have a test system, so we can test the pre-release of Slurm 16.5. Please let us know how to get it (perhaps our system admin, Doug know the place to download already, just in case). I am looking forward to testing your new changes, which appear to be exactly what I asked for when I opened this bug! I will get back to you after testing (it may take some days).


(In reply to Comment 16)
It is great to see that the change you have made makes the --ntasks-per-socket work as I wanted! This will definitely meet our needs on Ivey Bridge, and Haswell systems. 

However, for KNL nodes, we need to be able to achieve the same or similar control over the numa nodes where the number of sockets does not equal to the number of numa nodes, i.e., on a KNL node, there is only a single socket but multiple numa domains/nodes. Just to give you a heads-up, I am attaching two files, which contain the numactl --hardware command output for the two (flat) memory configurations (Qudrant and Sub NUMA Cluster (SNC) modes) on KNL nodes. I hope we can do --ntasks-per-numanode to control the number of tasks bound to each numa node and also can bind memory to the numa node of choice. Note that the HBM appears as multiple numa nodes (SNC mode) or a single numa node (in Quadrant mode). 

Thanks,
Zhengji

Comment 21 Zhengji Zhao 2016-04-28 05:19:29 MDT

Created attachment 3036 [details]
Numactl --hardware output for the Quadrant Flat memory configuration on a KNL node

Comment 22 Zhengji Zhao 2016-04-28 05:21:32 MDT

Created attachment 3037 [details]
Numactl --hardware output for the SNC Flat memory mode on a KNL node

Comment 23 Moe Jette 2016-04-28 05:33:11 MDT

(In reply to Zhengji Zhao from comment #20)
> (In reply to Comment 19)
> 
> Yes, we have a test system, so we can test the pre-release of Slurm 16.5.
> Please let us know how to get it (perhaps our system admin, Doug know the
> place to download already, just in case). I am looking forward to testing
> your new changes, which appear to be exactly what I asked for when I opened
> this bug! I will get back to you after testing (it may take some days).

See: http://www.schedmd.com/#repos
Then look in the second section: "Download the latest development version of Slurm"

> (In reply to Comment 16)
> It is great to see that the change you have made makes the
> --ntasks-per-socket work as I wanted! This will definitely meet our needs on
> Ivey Bridge, and Haswell systems. 
> 
> However, for KNL nodes, we need to be able to achieve the same or similar
> control over the numa nodes where the number of sockets does not equal to
> the number of numa nodes, i.e., on a KNL node, there is only a single socket
> but multiple numa domains/nodes. Just to give you a heads-up, I am attaching
> two files, which contain the numactl --hardware command output for the two
> (flat) memory configurations (Qudrant and Sub NUMA Cluster (SNC) modes) on
> KNL nodes. I hope we can do --ntasks-per-numanode to control the number of
> tasks bound to each numa node and also can bind memory to the numa node of
> choice. Note that the HBM appears as multiple numa nodes (SNC mode) or a
> single numa node (in Quadrant mode). 

I would recommend opening a separate ticket for KNL as there is more development work required. Slurm manages processor allocation layouts at the level of baseboards, NUMA, sockets, cores, and threads. Slurm is also dependent upon the number of cores per NUMA node being uniform within a single node, which is not the case of KNL quad mode. Slurm lack a concept of the core-pairs as on a KNL. Slurm also lacks a --ntasks-per-numanode option. Slurm is recording the KNL NUMA nodes as sockets, which seems to work best for right now. It's a work in progress...

Comment 24 Zhengji Zhao 2016-04-28 06:18:48 MDT

Thanks a lot for the link. We will test it on our test system. 

Yes, it makes sense to open a separate ticket for KNL.I will open a new ticket once our need/requirement for the memory binding (along with CPU binding) becomes more solid and specific. We have just got the access to early KNL nodes, we should be able to gain some experience soon. It sounds great that Slurm treats the KNL NUMA nodes as sockets.

Thanks,
Zhengji

Comment 25 Moe Jette 2016-05-18 03:48:44 MDT

(In reply to Zhengji Zhao from comment #24)
> Thanks a lot for the link. We will test it on our test system. 

Do you have any updated information on this?

Comment 26 Zhengji Zhao 2016-05-18 04:27:53 MDT

Thanks a lot for following up, I really appreciate it. Our system admin has
been completely overbooked by many other duties recently (also conferences,
etc), so that I have been waiting for him to install the new version on our
test system. Once he is back to his regular work (he is still out of town),
we can resume the testing. Now we have some experience on the KNL white
nodes as well, so I will soon get back to you with more specific idea about
what we want to with the task/thread/memory affinity.

Thanks,
Zhengji

On Wed, May 18, 2016 at 9:48 AM, <bugs@schedmd.com> wrote:

> *Comment # 25 <https://bugs.schedmd.com/show_bug.cgi?id=2655#c25> on bug
> 2655 <https://bugs.schedmd.com/show_bug.cgi?id=2655> from Moe Jette
> <jette@schedmd.com> *
>
> (In reply to Zhengji Zhao from comment #24 <https://bugs.schedmd.com/show_bug.cgi?id=2655#c24>)> Thanks a lot for the link. We will test it on our test system.
>
> Do you have any updated information on this?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 27 Moe Jette 2016-06-02 06:42:23 MDT

Do you have any updates on this ticket?

Comment 28 Zhengji Zhao 2016-06-03 02:58:07 MDT

Thanks for checking on this. I was on vacation last two weeks. I will check
the status, and will get back to you as soon as I can.

Thanks again, I really appreciate your help with this.

Zhengji

On Thu, Jun 2, 2016 at 12:42 PM, <bugs@schedmd.com> wrote:

> *Comment # 27 <https://bugs.schedmd.com/show_bug.cgi?id=2655#c27> on bug
> 2655 <https://bugs.schedmd.com/show_bug.cgi?id=2655> from Moe Jette
> <jette@schedmd.com> *
>
> Do you have any updates on this ticket?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 29 Moe Jette 2016-06-23 11:35:03 MDT

(In reply to Zhengji Zhao from comment #28)
> Thanks for checking on this. I was on vacation last two weeks. I will check
> the status, and will get back to you as soon as I can.

Any update on this?

Comment 30 Zhengji Zhao 2016-06-23 12:27:59 MDT

I am really sorry for the long delay in getting back to you and really
appreciate you for  following up with us on this bug.

Unfortunately, since we have to keep our XC30 test system (called Alva, we
have been doing our task affinity testing on Alva) to stay in sync with our
production system, Edison, (per our system admin Doug Jacobson, I suppose
it was to support some on-going system upgrade) for the moment, so we were
not able to experiment with the new Slurm version yet. I think our test has
to wait until our test system for Cori (called Gerty) to be back with an
upgraded CLE version (Rhine/RedWood).

I am gathering the requirement from our application readiness team members
(who are working on the KNL white boxes) about their task/thread/memory
affinity need now. We are basically using the KMP_AFFINITY to control the
affinity for now on single KNL white box now. At a very high level, we hope
to have srun to manage the task/thread/memory affinity without needing to
use the KMP_AFFINITY on our Cori KNL system later. I believe we will need
srun to support all KMP_AFFINITY options (compact, scatter, balanced, none,
explicit). While we use the srun to control the task/thread/memory
affinity, we still hope to be able to use KMP_AFFINITY if we want to.

I will get back to you with further updates.

Zhengji

On Thu, Jun 23, 2016 at 10:35 AM, <bugs@schedmd.com> wrote:

> *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=2655#c29> on bug
> 2655 <https://bugs.schedmd.com/show_bug.cgi?id=2655> from Moe Jette
> <jette@schedmd.com> *
>
> (In reply to Zhengji Zhao from comment #28 <https://bugs.schedmd.com/show_bug.cgi?id=2655#c28>)> Thanks for checking on this. I was on vacation last two weeks. I will check
> > the status, and will get back to you as soon as I can.
>
> Any update on this?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 32 Moe Jette 2016-06-23 12:42:45 MDT

(In reply to Zhengji Zhao from comment #30)
> I am really sorry for the long delay in getting back to you and really
> appreciate you for  following up with us on this bug.

No problem. I'd rather be waiting for you than the other way around.


> I am gathering the requirement from our application readiness team members
> (who are working on the KNL white boxes) about their task/thread/memory
> affinity need now. We are basically using the KMP_AFFINITY to control the
> affinity for now on single KNL white box now. At a very high level, we hope
> to have srun to manage the task/thread/memory affinity without needing to
> use the KMP_AFFINITY on our Cori KNL system later. I believe we will need
> srun to support all KMP_AFFINITY options (compact, scatter, balanced, none,
> explicit). While we use the srun to control the task/thread/memory
> affinity, we still hope to be able to use KMP_AFFINITY if we want to.

Slurm's -m/--distribution option supports all of these options and more. It can be controlled via the command line option or environment variables. The user can specify task distribution options of cyclic, block, fcyclic (cyclic task IDs, filling the resource), plus blocks of user controlled sizes (plane). These options are available to control layout at the node, socket/NUMA, core, and thread levels. More information here:
http://slurm.schedmd.com/mc_support.html

Comment 33 Zhengji Zhao 2016-08-10 16:57:27 MDT

Created attachment 3402 [details]
Slurm configuration file

Comment 34 Zhengji Zhao 2016-08-10 17:00:40 MDT

Dear Moe,

We finally have a test system that is set up with Slurm 16.05 with task/affinity enabled and have your suggested SelectTypeParameters,

CR_SOCKET_MEMORY,OTHER_CONS_RES,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE

See the slurm.conf file attached at 2016-08-10 15:57 PDT.

Immediately, I ran into two problems on our dual socket 16 core Haswell nodes (32 cores in total, 64 logical cores or threads (CPUs) in total).

1) when running with only 32 tasks per node (i.e., not using hyperthreads) it binds a single task to two CPUs from two different physical cores, while we hope it binds to two CPUs that belong to the same physical core.  This is an demonstration of the problem: 

srun --cpu_bind=verbose,cores --mem_bind=verbose,local -n32 ./xthi.intel 2>&1 |sort -nk4,6
Hello from rank 0 thread 0 on nid00021 (core affinity = 0,1)
cpu_bind=MASK - nid00021, task  0  0 [34296]: mask 0x3 set
cpu_bind=MASK - nid00021, task  1  1 [34297]: mask 0xc set
cpu_bind=MASK - nid00021, task  2  2 [34298]: mask 0x30 set
cpu_bind=MASK - nid00021, task  3  3 [34299]: mask 0xc0 set
cpu_bind=MASK - nid00021, task  4  4 [34300]: mask 0x300 set
cpu_bind=MASK - nid00021, task  5  5 [34301]: mask 0xc00 set
cpu_bind=MASK - nid00021, task  6  6 [34302]: mask 0x3000 set
cpu_bind=MASK - nid00021, task  7  7 [34303]: mask 0xc000 set
cpu_bind=MASK - nid00021, task  8  8 [34304]: mask 0x30000 set
cpu_bind=MASK - nid00021, task  9  9 [34305]: mask 0xc0000 set
cpu_bind=MASK - nid00021, task 10 10 [34306]: mask 0x300000 set
cpu_bind=MASK - nid00021, task 11 11 [34307]: mask 0xc00000 set
cpu_bind=MASK - nid00021, task 12 12 [34308]: mask 0x3000000 set
cpu_bind=MASK - nid00021, task 13 13 [34309]: mask 0xc000000 set
cpu_bind=MASK - nid00021, task 14 14 [34310]: mask 0x30000000 set
cpu_bind=MASK - nid00021, task 15 15 [34311]: mask 0xc0000000 set
cpu_bind=MASK - nid00021, task 16 16 [34312]: mask 0x300000000 set
cpu_bind=MASK - nid00021, task 17 17 [34313]: mask 0xc00000000 set
cpu_bind=MASK - nid00021, task 18 18 [34314]: mask 0x3000000000 set
cpu_bind=MASK - nid00021, task 19 19 [34315]: mask 0xc000000000 set
cpu_bind=MASK - nid00021, task 20 20 [34316]: mask 0x30000000000 set
cpu_bind=MASK - nid00021, task 21 21 [34317]: mask 0xc0000000000 set
cpu_bind=MASK - nid00021, task 22 22 [34318]: mask 0x300000000000 set
cpu_bind=MASK - nid00021, task 23 23 [34319]: mask 0xc00000000000 set
cpu_bind=MASK - nid00021, task 24 24 [34320]: mask 0x3000000000000 set
cpu_bind=MASK - nid00021, task 25 25 [34321]: mask 0xc000000000000 set
cpu_bind=MASK - nid00021, task 26 26 [34322]: mask 0x30000000000000 set
cpu_bind=MASK - nid00021, task 27 27 [34323]: mask 0xc0000000000000 set
cpu_bind=MASK - nid00021, task 28 28 [34324]: mask 0x300000000000000 set
cpu_bind=MASK - nid00021, task 29 29 [34325]: mask 0xc00000000000000 set
cpu_bind=MASK - nid00021, task 30 30 [34326]: mask 0x3000000000000000 set
cpu_bind=MASK - nid00021, task 31 31 [34327]: mask 0xc000000000000000 set
mem_bind=LOC  - nid00021, task  0  0 [34296]: mask 0x1 set
mem_bind=LOC  - nid00021, task  1  1 [34297]: mask 0x1 set
mem_bind=LOC  - nid00021, task  2  2 [34298]: mask 0x1 set
mem_bind=LOC  - nid00021, task  3  3 [34299]: mask 0x1 set
mem_bind=LOC  - nid00021, task  4  4 [34300]: mask 0x1 set
mem_bind=LOC  - nid00021, task  5  5 [34301]: mask 0x1 set
mem_bind=LOC  - nid00021, task  6  6 [34302]: mask 0x1 set
mem_bind=LOC  - nid00021, task  7  7 [34303]: mask 0x1 set
mem_bind=LOC  - nid00021, task  8  8 [34304]: mask 0x2 set
mem_bind=LOC  - nid00021, task  9  9 [34305]: mask 0x2 set
mem_bind=LOC  - nid00021, task 10 10 [34306]: mask 0x2 set
mem_bind=LOC  - nid00021, task 11 11 [34307]: mask 0x2 set
mem_bind=LOC  - nid00021, task 12 12 [34308]: mask 0x2 set
mem_bind=LOC  - nid00021, task 13 13 [34309]: mask 0x2 set
mem_bind=LOC  - nid00021, task 14 14 [34310]: mask 0x2 set
mem_bind=LOC  - nid00021, task 15 15 [34311]: mask 0x2 set
mem_bind=LOC  - nid00021, task 16 16 [34312]: mask 0x1 set
mem_bind=LOC  - nid00021, task 17 17 [34313]: mask 0x1 set
mem_bind=LOC  - nid00021, task 18 18 [34314]: mask 0x1 set
mem_bind=LOC  - nid00021, task 19 19 [34315]: mask 0x1 set
mem_bind=LOC  - nid00021, task 20 20 [34316]: mask 0x1 set
mem_bind=LOC  - nid00021, task 21 21 [34317]: mask 0x1 set
mem_bind=LOC  - nid00021, task 22 22 [34318]: mask 0x1 set
mem_bind=LOC  - nid00021, task 23 23 [34319]: mask 0x1 set
mem_bind=LOC  - nid00021, task 24 24 [34320]: mask 0x2 set
mem_bind=LOC  - nid00021, task 25 25 [34321]: mask 0x2 set
mem_bind=LOC  - nid00021, task 26 26 [34322]: mask 0x2 set
mem_bind=LOC  - nid00021, task 27 27 [34323]: mask 0x2 set
mem_bind=LOC  - nid00021, task 28 28 [34324]: mask 0x2 set
mem_bind=LOC  - nid00021, task 29 29 [34325]: mask 0x2 set
mem_bind=LOC  - nid00021, task 30 30 [34326]: mask 0x2 set
mem_bind=LOC  - nid00021, task 31 31 [34327]: mask 0x2 set
Hello from rank 1 thread 0 on nid00021 (core affinity = 2,3)
Hello from rank 2 thread 0 on nid00021 (core affinity = 4,5)
Hello from rank 3 thread 0 on nid00021 (core affinity = 6,7)
Hello from rank 4 thread 0 on nid00021 (core affinity = 8,9)
Hello from rank 5 thread 0 on nid00021 (core affinity = 10,11)
Hello from rank 6 thread 0 on nid00021 (core affinity = 12,13)
Hello from rank 7 thread 0 on nid00021 (core affinity = 14,15)
Hello from rank 8 thread 0 on nid00021 (core affinity = 16,17)
Hello from rank 9 thread 0 on nid00021 (core affinity = 18,19)
Hello from rank 10 thread 0 on nid00021 (core affinity = 20,21)
Hello from rank 11 thread 0 on nid00021 (core affinity = 22,23)
Hello from rank 12 thread 0 on nid00021 (core affinity = 24,25)
Hello from rank 13 thread 0 on nid00021 (core affinity = 26,27)
Hello from rank 14 thread 0 on nid00021 (core affinity = 28,29)
Hello from rank 15 thread 0 on nid00021 (core affinity = 30,31)
Hello from rank 16 thread 0 on nid00021 (core affinity = 32,33)
Hello from rank 17 thread 0 on nid00021 (core affinity = 34,35)
Hello from rank 18 thread 0 on nid00021 (core affinity = 36,37)
Hello from rank 19 thread 0 on nid00021 (core affinity = 38,39)
Hello from rank 20 thread 0 on nid00021 (core affinity = 40,41)
Hello from rank 21 thread 0 on nid00021 (core affinity = 42,43)
Hello from rank 22 thread 0 on nid00021 (core affinity = 44,45)
Hello from rank 23 thread 0 on nid00021 (core affinity = 46,47)
Hello from rank 24 thread 0 on nid00021 (core affinity = 48,49)
Hello from rank 25 thread 0 on nid00021 (core affinity = 50,51)
Hello from rank 26 thread 0 on nid00021 (core affinity = 52,53)
Hello from rank 27 thread 0 on nid00021 (core affinity = 54,55)
Hello from rank 28 thread 0 on nid00021 (core affinity = 56,57)
Hello from rank 29 thread 0 on nid00021 (core affinity = 58,59)
Hello from rank 30 thread 0 on nid00021 (core affinity = 60,61)
Hello from rank 31 thread 0 on nid00021 (core affinity = 62,63)

We need the bind looks like this, 

Hello from rank 0 thread 0 on nid00021 (core affinity = 0,32)
Hello from rank 1 thread 0 on nid00021 (core affinity = 1,33)
Hello from rank 2 thread 0 on nid00021 (core affinity = 2,34)
Hello from rank 3 thread 0 on nid00021 (core affinity = 3,35)
...

instead of the above 
Hello from rank 0 thread 0 on nid00021 (core affinity = 0,1)
Hello from rank 1 thread 0 on nid00021 (core affinity = 2,3)
Hello from rank 2 thread 0 on nid00021 (core affinity = 4,5)
Hello from rank 3 thread 0 on nid00021 (core affinity = 6,7)

2) It seems now I can not run with hyperthreads anymore, the following srun commands all return error, "More processors requested than permitted",

srun --cpu_bind=verbose -n64 --ntasks-per-core=2 ./xthi.intel
srun: error: Unable to create job step: More processors requested than permitted

srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2 ./xthi.intel
srun: error: Unable to create job step: More processors requested than permitted

srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2 --hint=multithread ./xthi.intel
srun: error: Unable to create job step: More processors requested than permitted

Could you please take a look? I appreciate any advice that help us to fix these two problems. 

Thanks,
Zhengji

Comment 35 Moe Jette 2016-08-10 17:05:34 MDT

The first problem (not binding to the proper threads) should be fixed inn the following commit from a few days ago:
https://github.com/SchedMD/slurm/commit/f36c4ee53763689c822ad524fa7b3f1853f5f9e6
This bug fix will be in Slurm version 16.05.4, which we plan to release tomorrow.

I will investigate the second problem soon.

Comment 36 Zhengji Zhao 2016-08-10 17:20:03 MDT

Thanks so much for the quick reply! I am glad the first problem has already had the fix. I will ask our system amdin to install it. 

In case you prefer the scontrol show config output (I did not see the SelectTypeParameters in the slurm.conf file) I am including it here. 

Looking forward to hearing from you soon.

Zhengji

zz217@gert01:~/affinity/hsw> scontrol show config
Configuration data as of 2016-08-10T16:14:31
AccountingStorageBackupHost = gert01-144
AccountingStorageEnforce = associations,limits,qos,safe
AccountingStorageHost   = gertque01-144
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,bb/cray
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/cray
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 1
AuthInfo                = (null)
AuthType                = auth/munge
BackupAddr              = 128.55.144.82
BackupController        = gertque01
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2016-08-10T11:40:38
BurstBufferType         = burst_buffer/cray
CacheGroups             = 0
CheckpointType          = checkpoint/none
ChosLoc                 = (null)
ClusterName             = gerty
CompleteWait            = 300 sec
ControlAddr             = ctlnet1
ControlMachine          = ctlnet1
CoreSpecPlugin          = core_spec/cray
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand
CryptoType              = crypto/munge
DebugFlags              = Backfill,BurstBuffer
DefMemPerNode           = UNLIMITED
DisableRootJobs         = Yes
EioTimeout              = 60
EnforcePartLimits       = ANY
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = craynetwork,hbm
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Different Ours=0x6d885c88 Slurmctld=0xeb6d56cd
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 600 sec
JobAcctGatherFrequency  = 0
JobAcctGatherType       = jobacct_gather/cgroup
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/cncu
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 0
JobSubmitPlugins        = cray,lua
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 1
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 = 
Licenses                = SCRATCH:1000000,gscratch1:1000000,project:1000000,projecta:1000000,projectb:1000000,dna:1000000
LicensesUsed            = dna:0/1000000,projectb:0/1000000,projecta:0/1000000,project:0/1000000,gscratch1:0/1000000,SCRATCH:0/1000000
MailProg                = /bin/mail
MaxArraySize            = 65000
MaxJobCount             = 500000
MaxJobId                = 2147418112
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MemLimitEnforce         = Yes
MessageTimeout          = 60 sec
MinJobAge               = 300 sec
MpiDefault              = openmpi
MpiParams               = ports=63001-64000
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 187
NodeFeaturesPlugins     = knl_cray
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PowerParameters         = (null)
PowerPlugin             = 
PreemptMode             = REQUEUE
PreemptType             = preempt/qos
PriorityParameters      = (null)
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = 
PriorityMaxAge          = 128-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 184320
PriorityWeightFairShare = 1440
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 253440
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cray
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc,Contain
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeProgram           = /usr/sbin/capmc_resume
ResumeRate              = 300 nodes/min
ResumeTimeout           = 1800 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = route/default
SallocDefaultCommand    = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=craynetwork:0 --mpi=none --cpu_bind=none $SHELL
SchedulerParameters     = no_backup_scheduling,bf_window=5760,bf_resolution=120,bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=1000000,bf_continue,nohold_on_prolog_fail,kill_invalid_depend,sched_min_interval=2,bf_interval=120,bf_min_age_reserve=600,bf_max_job_user=30,bf_min_prio_reserve=69120
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cray
SelectTypeParameters    = CR_SOCKET_MEMORY,OTHER_CONS_RES,CR_ONE_TASK_PER_CORE,CR_CORE_DEFAULT_DIST_BLOCK
SlurmUser               = root(0)
SlurmctldDebug          = debug
SlurmctldLogFile        = /var/tmp/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldTimeout        = 120 sec
SlurmdDebug             = info
SlurmdLogFile           = /var/spool/slurmd/%h.log
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPlugstack         = (null)
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 16.05.3
SrunEpilog              = (null)
SrunPortRange           = 60001-63000
SrunProlog              = (null)
StateSaveLocation       = /global/syscom/gerty/sc/nsg/var/gerty-slurm-state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = /usr/sbin/capmc_suspend
SuspendRate             = 60 nodes/min
SuspendTime             = 30000000 sec
SuspendTimeout          = 30 sec
SwitchType              = switch/cray
TaskEpilog              = (null)
TaskPlugin              = task/affinity,task/cgroup,task/cray
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = NoInAddrAny
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Slurmctld(primary/backup) at ctlnet1/gertque01 are UP/DOWN

Comment 37 Zhengji Zhao 2016-08-11 12:12:25 MDT

Dear Moe,

I noticed from the srun man page (quoted below) that, the --ntasks-per-core is valid only for the job allocation (not applies to job step allocation), so I tried the #SBATCH --ntasks-per-core=2 and that seems to allow me to use the hyperthreads (see the output attached below)! I still need to do further testings to check if other use cases work as expected, though.

I have two questions regarding this

1) It is preferred that we request a number of nodes by using #SBATCH -N <# of nodes> and then use a srun command line option to indicate if we want to use hyperthreads or not, and if yes, how many logical cores (CPUs) to use per physical core. This means I prefer to have a srun command line option, something like --ncpus-per-core, to indicate how many CPUs to use per core. Could you please take a look at this option?

2) if we have to use --ntasks-per-core=2 as a SBATCH flag, then there is a way that we can set it to default for all jobs, so that users do not have to bother to set that in each of their job script? I tested that #SBATCH --ntasks-per-core=2 and the Slurm configuration,  CR_ONE_TASK_PER_CORE, work together fine in my limited tests.

3) When I read the srun/sbatch man pages, regarding the --ntasks-per-core option I saw the following note,

" NOTE: This option is not supported unless SelectTypeParameters=CR_Core or SelectTypeParame- ters=CR_Core_Memory is configured. This option applies to job allocations."

but we do not have the CR_Core, or CR_Core_Memory memory configured. Could you please let me know if this has been changed? What we have is  

SelectTypeParameters=CR_SOCKET_MEMORY,OTHER_CONS_RES,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE

as you suggested, and the option --ntasks-per-core seems to work at job allocation time. 

Thanks,
Zhengji


−−ntasks−per−core=<ntasks>
Request the maximum ntasks be invoked on each core. This option applies to the job allocation, but not to step allocations. Meant to be used with the −−ntasks option. Related to −−ntasks−per−node except at the core level instead of the node level. Masks will automatically be generated to bind the tasks to specific core unless −−cpu_bind=none is specified. NOTE: This option is not supported unless SelectTypeParameters=CR_Core or SelectTypeParame- ters=CR_Core_Memory is configured. This option applies to job allocations.


export OMP_NUM_THREADS=1
srun --cpu_bind=verbose -n64 ./xthi.intel 2>&1 |sort -nk4,6
Hello from rank 0 thread 0 on nid00021 (core affinity = 0)
cpu_bind=MASK - nid00021, task  0  0 [23010]: mask 0x1 set
cpu_bind=MASK - nid00021, task  1  1 [23011]: mask 0x2 set
cpu_bind=MASK - nid00021, task  2  2 [23012]: mask 0x4 set
cpu_bind=MASK - nid00021, task  3  3 [23013]: mask 0x8 set
cpu_bind=MASK - nid00021, task  4  4 [23014]: mask 0x10 set
cpu_bind=MASK - nid00021, task  5  5 [23015]: mask 0x20 set
cpu_bind=MASK - nid00021, task  6  6 [23016]: mask 0x40 set
cpu_bind=MASK - nid00021, task  7  7 [23017]: mask 0x80 set
cpu_bind=MASK - nid00021, task  8  8 [23018]: mask 0x100 set
cpu_bind=MASK - nid00021, task  9  9 [23019]: mask 0x200 set
cpu_bind=MASK - nid00021, task 10 10 [23020]: mask 0x400 set
cpu_bind=MASK - nid00021, task 11 11 [23021]: mask 0x800 set
cpu_bind=MASK - nid00021, task 12 12 [23022]: mask 0x1000 set
cpu_bind=MASK - nid00021, task 13 13 [23023]: mask 0x2000 set
cpu_bind=MASK - nid00021, task 14 14 [23024]: mask 0x4000 set
cpu_bind=MASK - nid00021, task 15 15 [23025]: mask 0x8000 set
cpu_bind=MASK - nid00021, task 16 16 [23026]: mask 0x10000 set
cpu_bind=MASK - nid00021, task 17 17 [23027]: mask 0x20000 set
cpu_bind=MASK - nid00021, task 18 18 [23028]: mask 0x40000 set
cpu_bind=MASK - nid00021, task 19 19 [23029]: mask 0x80000 set
cpu_bind=MASK - nid00021, task 20 20 [23030]: mask 0x100000 set
cpu_bind=MASK - nid00021, task 21 21 [23031]: mask 0x200000 set
cpu_bind=MASK - nid00021, task 22 22 [23032]: mask 0x400000 set
cpu_bind=MASK - nid00021, task 23 23 [23033]: mask 0x800000 set
cpu_bind=MASK - nid00021, task 24 24 [23034]: mask 0x1000000 set
cpu_bind=MASK - nid00021, task 25 25 [23035]: mask 0x2000000 set
cpu_bind=MASK - nid00021, task 26 26 [23036]: mask 0x4000000 set
cpu_bind=MASK - nid00021, task 27 27 [23037]: mask 0x8000000 set
cpu_bind=MASK - nid00021, task 28 28 [23038]: mask 0x10000000 set
cpu_bind=MASK - nid00021, task 29 29 [23039]: mask 0x20000000 set
cpu_bind=MASK - nid00021, task 30 30 [23040]: mask 0x40000000 set
cpu_bind=MASK - nid00021, task 31 31 [23041]: mask 0x80000000 set
cpu_bind=MASK - nid00021, task 32 32 [23042]: mask 0x100000000 set
cpu_bind=MASK - nid00021, task 33 33 [23043]: mask 0x200000000 set
cpu_bind=MASK - nid00021, task 34 34 [23044]: mask 0x400000000 set
cpu_bind=MASK - nid00021, task 35 35 [23045]: mask 0x800000000 set
cpu_bind=MASK - nid00021, task 36 36 [23046]: mask 0x1000000000 set
cpu_bind=MASK - nid00021, task 37 37 [23047]: mask 0x2000000000 set
cpu_bind=MASK - nid00021, task 38 38 [23048]: mask 0x4000000000 set
cpu_bind=MASK - nid00021, task 39 39 [23049]: mask 0x8000000000 set
cpu_bind=MASK - nid00021, task 40 40 [23050]: mask 0x10000000000 set
cpu_bind=MASK - nid00021, task 41 41 [23051]: mask 0x20000000000 set
cpu_bind=MASK - nid00021, task 42 42 [23052]: mask 0x40000000000 set
cpu_bind=MASK - nid00021, task 43 43 [23053]: mask 0x80000000000 set
cpu_bind=MASK - nid00021, task 44 44 [23054]: mask 0x100000000000 set
cpu_bind=MASK - nid00021, task 45 45 [23055]: mask 0x200000000000 set
cpu_bind=MASK - nid00021, task 46 46 [23056]: mask 0x400000000000 set
cpu_bind=MASK - nid00021, task 47 47 [23057]: mask 0x800000000000 set
cpu_bind=MASK - nid00021, task 48 48 [23058]: mask 0x1000000000000 set
cpu_bind=MASK - nid00021, task 49 49 [23059]: mask 0x2000000000000 set
cpu_bind=MASK - nid00021, task 50 50 [23060]: mask 0x4000000000000 set
cpu_bind=MASK - nid00021, task 51 51 [23061]: mask 0x8000000000000 set
cpu_bind=MASK - nid00021, task 52 52 [23062]: mask 0x10000000000000 set
cpu_bind=MASK - nid00021, task 53 53 [23063]: mask 0x20000000000000 set
cpu_bind=MASK - nid00021, task 54 54 [23064]: mask 0x40000000000000 set
cpu_bind=MASK - nid00021, task 55 55 [23065]: mask 0x80000000000000 set
cpu_bind=MASK - nid00021, task 56 56 [23066]: mask 0x100000000000000 set
cpu_bind=MASK - nid00021, task 57 57 [23067]: mask 0x200000000000000 set
cpu_bind=MASK - nid00021, task 58 58 [23068]: mask 0x400000000000000 set
cpu_bind=MASK - nid00021, task 59 59 [23069]: mask 0x800000000000000 set
cpu_bind=MASK - nid00021, task 60 60 [23070]: mask 0x1000000000000000 set
cpu_bind=MASK - nid00021, task 61 61 [23071]: mask 0x2000000000000000 set
cpu_bind=MASK - nid00021, task 62 62 [23072]: mask 0x4000000000000000 set
cpu_bind=MASK - nid00021, task 63 63 [23073]: mask 0x8000000000000000 set
Hello from rank 1 thread 0 on nid00021 (core affinity = 1)
Hello from rank 2 thread 0 on nid00021 (core affinity = 2)
Hello from rank 3 thread 0 on nid00021 (core affinity = 3)
Hello from rank 4 thread 0 on nid00021 (core affinity = 4)
Hello from rank 5 thread 0 on nid00021 (core affinity = 5)
Hello from rank 6 thread 0 on nid00021 (core affinity = 6)
Hello from rank 7 thread 0 on nid00021 (core affinity = 7)
Hello from rank 8 thread 0 on nid00021 (core affinity = 8)
Hello from rank 9 thread 0 on nid00021 (core affinity = 9)
Hello from rank 10 thread 0 on nid00021 (core affinity = 10)
Hello from rank 11 thread 0 on nid00021 (core affinity = 11)
Hello from rank 12 thread 0 on nid00021 (core affinity = 12)
Hello from rank 13 thread 0 on nid00021 (core affinity = 13)
Hello from rank 14 thread 0 on nid00021 (core affinity = 14)
Hello from rank 15 thread 0 on nid00021 (core affinity = 15)
Hello from rank 16 thread 0 on nid00021 (core affinity = 16)
Hello from rank 17 thread 0 on nid00021 (core affinity = 17)
Hello from rank 18 thread 0 on nid00021 (core affinity = 18)
Hello from rank 19 thread 0 on nid00021 (core affinity = 19)
Hello from rank 20 thread 0 on nid00021 (core affinity = 20)
Hello from rank 21 thread 0 on nid00021 (core affinity = 21)
Hello from rank 22 thread 0 on nid00021 (core affinity = 22)
Hello from rank 23 thread 0 on nid00021 (core affinity = 23)
Hello from rank 24 thread 0 on nid00021 (core affinity = 24)
Hello from rank 25 thread 0 on nid00021 (core affinity = 25)
Hello from rank 26 thread 0 on nid00021 (core affinity = 26)
Hello from rank 27 thread 0 on nid00021 (core affinity = 27)
Hello from rank 28 thread 0 on nid00021 (core affinity = 28)
Hello from rank 29 thread 0 on nid00021 (core affinity = 29)
Hello from rank 30 thread 0 on nid00021 (core affinity = 30)
Hello from rank 31 thread 0 on nid00021 (core affinity = 31)
Hello from rank 32 thread 0 on nid00021 (core affinity = 32)
Hello from rank 33 thread 0 on nid00021 (core affinity = 33)
Hello from rank 34 thread 0 on nid00021 (core affinity = 34)
Hello from rank 35 thread 0 on nid00021 (core affinity = 35)
Hello from rank 36 thread 0 on nid00021 (core affinity = 36)
Hello from rank 37 thread 0 on nid00021 (core affinity = 37)
Hello from rank 38 thread 0 on nid00021 (core affinity = 38)
Hello from rank 39 thread 0 on nid00021 (core affinity = 39)
Hello from rank 40 thread 0 on nid00021 (core affinity = 40)
Hello from rank 41 thread 0 on nid00021 (core affinity = 41)
Hello from rank 42 thread 0 on nid00021 (core affinity = 42)
Hello from rank 43 thread 0 on nid00021 (core affinity = 43)
Hello from rank 44 thread 0 on nid00021 (core affinity = 44)
Hello from rank 45 thread 0 on nid00021 (core affinity = 45)
Hello from rank 46 thread 0 on nid00021 (core affinity = 46)
Hello from rank 47 thread 0 on nid00021 (core affinity = 47)
Hello from rank 48 thread 0 on nid00021 (core affinity = 48)
Hello from rank 49 thread 0 on nid00021 (core affinity = 49)
Hello from rank 50 thread 0 on nid00021 (core affinity = 50)
Hello from rank 51 thread 0 on nid00021 (core affinity = 51)
Hello from rank 52 thread 0 on nid00021 (core affinity = 52)
Hello from rank 53 thread 0 on nid00021 (core affinity = 53)
Hello from rank 54 thread 0 on nid00021 (core affinity = 54)
Hello from rank 55 thread 0 on nid00021 (core affinity = 55)
Hello from rank 56 thread 0 on nid00021 (core affinity = 56)
Hello from rank 57 thread 0 on nid00021 (core affinity = 57)
Hello from rank 58 thread 0 on nid00021 (core affinity = 58)
Hello from rank 59 thread 0 on nid00021 (core affinity = 59)
Hello from rank 60 thread 0 on nid00021 (core affinity = 60)
Hello from rank 61 thread 0 on nid00021 (core affinity = 61)
Hello from rank 62 thread 0 on nid00021 (core affinity = 62)
Hello from rank 63 thread 0 on nid00021 (core affinity = 63)


export OMP_NUM_THREADS=8
srun --cpu_bind=verbose -n8 -c8 ./xthi.intel 2>&1 |sort -nk4,6
Hello from rank 0 thread 0 on nid00021 (core affinity = 0-7)
Hello from rank 0 thread 1 on nid00021 (core affinity = 0-7)
Hello from rank 0 thread 2 on nid00021 (core affinity = 0-7)
Hello from rank 0 thread 3 on nid00021 (core affinity = 0-7)
Hello from rank 0 thread 4 on nid00021 (core affinity = 0-7)
Hello from rank 0 thread 5 on nid00021 (core affinity = 0-7)
Hello from rank 0 thread 6 on nid00021 (core affinity = 0-7)
Hello from rank 0 thread 7 on nid00021 (core affinity = 0-7)
cpu_bind=MASK - nid00021, task  0  0 [25632]: mask 0xff set
cpu_bind=MASK - nid00021, task  1  1 [25633]: mask 0xff00 set
cpu_bind=MASK - nid00021, task  2  2 [25634]: mask 0xff0000 set
cpu_bind=MASK - nid00021, task  3  3 [25635]: mask 0xff000000 set
cpu_bind=MASK - nid00021, task  4  4 [25636]: mask 0xff00000000 set
cpu_bind=MASK - nid00021, task  5  5 [25637]: mask 0xff0000000000 set
cpu_bind=MASK - nid00021, task  6  6 [25638]: mask 0xff000000000000 set
cpu_bind=MASK - nid00021, task  7  7 [25639]: mask 0xff00000000000000 set
Hello from rank 1 thread 0 on nid00021 (core affinity = 8-15)
Hello from rank 1 thread 1 on nid00021 (core affinity = 8-15)
Hello from rank 1 thread 2 on nid00021 (core affinity = 8-15)
Hello from rank 1 thread 3 on nid00021 (core affinity = 8-15)
Hello from rank 1 thread 4 on nid00021 (core affinity = 8-15)
Hello from rank 1 thread 5 on nid00021 (core affinity = 8-15)
Hello from rank 1 thread 6 on nid00021 (core affinity = 8-15)
Hello from rank 1 thread 7 on nid00021 (core affinity = 8-15)
Hello from rank 2 thread 0 on nid00021 (core affinity = 16-23)
Hello from rank 2 thread 1 on nid00021 (core affinity = 16-23)
Hello from rank 2 thread 2 on nid00021 (core affinity = 16-23)
Hello from rank 2 thread 3 on nid00021 (core affinity = 16-23)
Hello from rank 2 thread 4 on nid00021 (core affinity = 16-23)
Hello from rank 2 thread 5 on nid00021 (core affinity = 16-23)
Hello from rank 2 thread 6 on nid00021 (core affinity = 16-23)
Hello from rank 2 thread 7 on nid00021 (core affinity = 16-23)
Hello from rank 3 thread 0 on nid00021 (core affinity = 24-31)
Hello from rank 3 thread 1 on nid00021 (core affinity = 24-31)
Hello from rank 3 thread 2 on nid00021 (core affinity = 24-31)
Hello from rank 3 thread 3 on nid00021 (core affinity = 24-31)
Hello from rank 3 thread 4 on nid00021 (core affinity = 24-31)
Hello from rank 3 thread 5 on nid00021 (core affinity = 24-31)
Hello from rank 3 thread 6 on nid00021 (core affinity = 24-31)
Hello from rank 3 thread 7 on nid00021 (core affinity = 24-31)
Hello from rank 4 thread 0 on nid00021 (core affinity = 32-39)
Hello from rank 4 thread 1 on nid00021 (core affinity = 32-39)
Hello from rank 4 thread 2 on nid00021 (core affinity = 32-39)
Hello from rank 4 thread 3 on nid00021 (core affinity = 32-39)
Hello from rank 4 thread 4 on nid00021 (core affinity = 32-39)
Hello from rank 4 thread 5 on nid00021 (core affinity = 32-39)
Hello from rank 4 thread 6 on nid00021 (core affinity = 32-39)
Hello from rank 4 thread 7 on nid00021 (core affinity = 32-39)
Hello from rank 5 thread 0 on nid00021 (core affinity = 40-47)
Hello from rank 5 thread 1 on nid00021 (core affinity = 40-47)
Hello from rank 5 thread 2 on nid00021 (core affinity = 40-47)
Hello from rank 5 thread 3 on nid00021 (core affinity = 40-47)
Hello from rank 5 thread 4 on nid00021 (core affinity = 40-47)
Hello from rank 5 thread 5 on nid00021 (core affinity = 40-47)
Hello from rank 5 thread 6 on nid00021 (core affinity = 40-47)
Hello from rank 5 thread 7 on nid00021 (core affinity = 40-47)
Hello from rank 6 thread 0 on nid00021 (core affinity = 48-55)
Hello from rank 6 thread 1 on nid00021 (core affinity = 48-55)
Hello from rank 6 thread 2 on nid00021 (core affinity = 48-55)
Hello from rank 6 thread 3 on nid00021 (core affinity = 48-55)
Hello from rank 6 thread 4 on nid00021 (core affinity = 48-55)
Hello from rank 6 thread 5 on nid00021 (core affinity = 48-55)
Hello from rank 6 thread 6 on nid00021 (core affinity = 48-55)
Hello from rank 6 thread 7 on nid00021 (core affinity = 48-55)
Hello from rank 7 thread 0 on nid00021 (core affinity = 56-63)
Hello from rank 7 thread 1 on nid00021 (core affinity = 56-63)
Hello from rank 7 thread 2 on nid00021 (core affinity = 56-63)
Hello from rank 7 thread 3 on nid00021 (core affinity = 56-63)
Hello from rank 7 thread 4 on nid00021 (core affinity = 56-63)
Hello from rank 7 thread 5 on nid00021 (core affinity = 56-63)
Hello from rank 7 thread 6 on nid00021 (core affinity = 56-63)
Hello from rank 7 thread 7 on nid00021 (core affinity = 56-63)

Comment 38 Moe Jette 2016-08-11 12:18:31 MDT

(In reply to Zhengji Zhao from comment #34)
> 2) It seems now I can not run with hyperthreads anymore, the following srun
> commands all return error, "More processors requested than permitted",
> 
> srun --cpu_bind=verbose -n64 --ntasks-per-core=2 ./xthi.intel
> srun: error: Unable to create job step: More processors requested than
> permitted
> 
> srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2 ./xthi.intel
> srun: error: Unable to create job step: More processors requested than
> permitted
> 
> srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2 --hint=multithread
> ./xthi.intel
> srun: error: Unable to create job step: More processors requested than
> permitted
> 

Dear Zhengji,

I am not able to reproduce this second problem. An execute line of this sort:

> srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2 ./xthi.intel

binds one task to each thread for me.

Are you running this srun command within an existing job allocation (under an salloc or sbatch shell)? If so, could you send me the salloc/sbatch execute line which you use? I would guess that you are using an salloc/sbatch command which is setting some environment variables. The srun command would merge that environment with the options on your execute line to possibly generate a step request which can not be satisfied.

Comment 39 Zhengji Zhao 2016-08-11 12:32:26 MDT

I used the following job script to get that "More processors requested than permitted" error in my comment 34.


zz217@gert01:~/affinity/hsw> cat run.slurm
#!/bin/bash -l

#SBATCH -N 1
#SBATCH -p debug

set -x

#srun --cpu_bind=verbose --mem_bind=verbose,local -n32 ./xthi.intel 2>&1 |sort -nk4,6

srun --cpu_bind=verbose -n64 --ntasks-per-core=2 ./xthi.intel 2>&1 |sort -nk4,6

srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2 ./xthi.intel 2>&1 |sort -nk4,6

srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2 --hint=multithread ./xthi.intel 2>&1 |sort -nk4,6

zz217@gert01:~/affinity/hsw> sbatch run.slurm
Submitted batch job 210
zz217@gert01:~/affinity/hsw> cat slurm-210.out
+ srun --cpu_bind=verbose -n64 --ntasks-per-core=2 ./xthi.intel
+ sort -nk4,6
srun: error: Unable to create job step: More processors requested than permitted
+ sort -nk4,6
+ srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2 ./xthi.intel
srun: error: Unable to create job step: More processors requested than permitted
+ srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2 --hint=multithread ./xthi.intel
+ sort -nk4,6
srun: error: Unable to create job step: More processors requested than permitted


Please see my comment 37 (for more questions I have), where I tried #SBATCH --ntasks-per-node=2 and it seems to allow me to use hyperthreads. 


Thanks,
Zhengji

Comment 40 Moe Jette 2016-08-11 13:24:01 MDT

The "-N1" option allocates the job resources on at least one node, but the minimum allocation size (based upon "CR_Socket_Memory" in slurm.conf) is one socket. Your job allocation only includes one socket, which is 16 cores or 32 threads. Your srun commands to launch a job step are trying to use 64 threads. What you presumably want to do is modify your job allocation so that it gets both sockets (64 threads), so that your job step can run 64 tasks (one per thread). I would suggest that you do this with a line like this in your script:
#SBATCH -N 1 -n64 --ntasks-per-core=1

I'll respond to your other questions in a separate comment.


(In reply to Zhengji Zhao from comment #39)
> I used the following job script to get that "More processors requested than
> permitted" error in my comment 34.
> 
> 
> zz217@gert01:~/affinity/hsw> cat run.slurm
> #!/bin/bash -l
> 
> #SBATCH -N 1
> #SBATCH -p debug
> 
> set -x
> 
> #srun --cpu_bind=verbose --mem_bind=verbose,local -n32 ./xthi.intel 2>&1
> |sort -nk4,6
> 
> srun --cpu_bind=verbose -n64 --ntasks-per-core=2 ./xthi.intel 2>&1 |sort
> -nk4,6
> 
> srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2 ./xthi.intel 2>&1
> |sort -nk4,6
> 
> srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2 --hint=multithread
> ./xthi.intel 2>&1 |sort -nk4,6
> 
> zz217@gert01:~/affinity/hsw> sbatch run.slurm
> Submitted batch job 210
> zz217@gert01:~/affinity/hsw> cat slurm-210.out
> + srun --cpu_bind=verbose -n64 --ntasks-per-core=2 ./xthi.intel
> + sort -nk4,6
> srun: error: Unable to create job step: More processors requested than
> permitted
> + sort -nk4,6
> + srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2 ./xthi.intel
> srun: error: Unable to create job step: More processors requested than
> permitted
> + srun --cpu_bind=verbose,threads -n64 --ntasks-per-core=2
> --hint=multithread ./xthi.intel
> + sort -nk4,6
> srun: error: Unable to create job step: More processors requested than
> permitted
> 
> 
> Please see my comment 37 (for more questions I have), where I tried #SBATCH
> --ntasks-per-node=2 and it seems to allow me to use hyperthreads. 
> 
> 
> Thanks,
> Zhengji

Comment 41 Moe Jette 2016-08-11 14:02:19 MDT

Dear Zhengji,

My responses are in-line below.

(In reply to Zhengji Zhao from comment #37)
> Dear Moe,
> 
> I noticed from the srun man page (quoted below) that, the --ntasks-per-core
> is valid only for the job allocation (not applies to job step allocation),
> so I tried the #SBATCH --ntasks-per-core=2 and that seems to allow me to use
> the hyperthreads (see the output attached below)! I still need to do further
> testings to check if other use cases work as expected, though.
> 
> I have two questions regarding this
> 
> 1) It is preferred that we request a number of nodes by using #SBATCH -N <#
> of nodes> and then use a srun command line option to indicate if we want to
> use hyperthreads or not, and if yes, how many logical cores (CPUs) to use
> per physical core. This means I prefer to have a srun command line option,
> something like --ncpus-per-core, to indicate how many CPUs to use per core.
> Could you please take a look at this option?

Your system is currently configured to allocate resources to jobs at the socket level rather than node level (Slurm can allocate at the level of nodes, sockets, cores or threads, depending upon it's configuration). The advantage of this is that more than one job can run at a time on a compute node, which works very well for smaller jobs. The "-N#" option only tells Slurm to allocate the job resources on the specified node count. If you want to insure individual jobs are allocated more than a single socket on the node, say the entire node, the job request should specify this using something like a task count plus --ntasks-per-core. Note that setting options in sbatch will result in environment variables being set for the job step creation. For example "sbatch -n64 ..." eliminates the need for the "-n64" option in srun. 


> 2) if we have to use --ntasks-per-core=2 as a SBATCH flag, then there is a
> way that we can set it to default for all jobs, so that users do not have to
> bother to set that in each of their job script? I tested that #SBATCH
> --ntasks-per-core=2 and the Slurm configuration,  CR_ONE_TASK_PER_CORE, work
> together fine in my limited tests.

Perhaps you want to eliminate the "CR_ONE_TASK_PER_CORE" option in slurm.conf then and require users who want to run one task per core to explicitly specify "--ntasks-per-core=1".

Alternately global environment variables or a job_submit plugin can be used to set various default options for jobs. See:
http://slurm.schedmd.com/job_submit_plugins.html


> 3) When I read the srun/sbatch man pages, regarding the --ntasks-per-core
> option I saw the following note,
> 
> " NOTE: This option is not supported unless SelectTypeParameters=CR_Core or
> SelectTypeParame- ters=CR_Core_Memory is configured. This option applies to
> job allocations."
> 
> but we do not have the CR_Core, or CR_Core_Memory memory configured. Could
> you please let me know if this has been changed? What we have is  

That documentation is no longer correct. I will update it shortly.


> SelectTypeParameters=CR_SOCKET_MEMORY,OTHER_CONS_RES,
> CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE
> 
> as you suggested, and the option --ntasks-per-core seems to work at job
> allocation time. 
> 
> Thanks,
> Zhengji

Comment 42 Zhengji Zhao 2016-08-11 16:50:56 MDT

Dear Moe,

Thanks for getting back to me. 

Regarding how many sockets (one socket or both sockets) the #SBATCH -N 1 allocates with the CR_Socket_Memory configured, I still have a question. 

We have several partitions configured on our systems, and not all partitions are configured to share the nodes between jobs. For example, for the debug partition that I used, we have the following configuration,

zz217@gert01:~/affinity/hsw> scontrol show partition debug
PartitionName=debug
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=00:10:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=00:30:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid000[21-23,28-30,52-54,56-63]
   PriorityJobFactor=1000 PriorityTier=1000 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE PreemptMode=REQUEUE
   State=UP TotalCPUs=1088 TotalNodes=17 SelectTypeParameters=NONE
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Please note we have the OverSubscribe=EXCLUSIVE set for this partition. I wonder if this should be sufficient to avoid the node being shared with other jobs, in which case, then I think the node should be fully (both sockets) allocated to the job. In our production systems (both Cray XC30 called Edison, and XC40 called Cori phase I, we have been observing this behavior, i.e., if I use #SBATCH -N 1, we are getting full node (both sockets), and we did not have to use other SBATCH flags to help allocate the full node to the job. On Edison and Cori we have the following configuration

zz217@cori06:~> scontrol show config |grep -E "TaskPlugin|SelectTypeParameters"
SelectTypeParameters    = CR_SOCKET_MEMORY,OTHER_CONS_RES
TaskPlugin              = task/cgroup,task/cray
TaskPluginParam         = (null type)


while now on the test system (Gerty), with the following configuration with the same debug partition configuration (OverSubscribe=EXCLUSIVE) I am unable to get the full node with #SBATCH -N 1 alone. 

zz217@gert01:~> scontrol show config |grep -E "TaskPlugin|SelectTypeParameters"
SelectTypeParameters    = CR_SOCKET_MEMORY,OTHER_CONS_RES,CR_ONE_TASK_PER_CORE,CR_CORE_DEFAULT_DIST_BLOCK
TaskPlugin              = task/affinity,task/cgroup,task/cray
TaskPluginParam         = (null type)
zz217@gert01:~> 


So I think perhaps there is still some room to improve in Slurm, so that even with CR_Socket_memory is configured, if the partition does not allow the node sharing, then the full node can be allocated to a job with #SBATCH -N 1 alone. I understand this may or may not be a bug, and we can get around this with either #SBATCH -N 1 --ntasks-per-core=2 with the CR_ONE_TASK_PER_CORE being set or with #SBATCH -N 1 --ntasks=64 --ntasks-per-core=1 as you suggested. So this issue is not a pressing issue for me now. However, it would be great if someday this could be fixed. I will decide if we will settle with the  #SBATCH -N 1 --ntasks-per-core=2 to use hyperthreading when the default is CR_ONE_TASK_PER_CORE. Please let me know if you consider this is something needs to be fixed or not.


Regarding your following comment,

"Perhaps you want to eliminate the "CR_ONE_TASK_PER_CORE" option in slurm.conf then and require users who want to run one task per core to explicitly specify "--ntasks-per-core=1".

Alternately global environment variables or a job_submit plugin can be used to set various default options for jobs. See:
http://slurm.schedmd.com/job_submit_plugins.html"


I would like to let you know that we wanted to do the opposite. We wanted the default to be not using hyperthreading, and whoever wants to use hyperthreading indicate that explicitly with an extra flag (e.g., --ntasks-per-core=2). The reason we wanted to have the default not to bother with hyperthreading was that most of our workloads do not get benefits from using hyperthreading on Edison and Cori Phase I. However, the situation may change on KNL (Cori Phase II), so it is possible that in the future we want to hyperthreading to be the default on Cori Phase II, and require users who do not want to use hyperthreading to explicitly specify "--ntasks-per-core=1".


I will do further testing with the SLURM 16.05.4, and will let you know if it meets our task/memory/thread affinity need.

I will open another bug shortly for the affinity issues on KNL as you suggested in this bug.

Thanks very much for your timely help.

Zhengji

Comment 43 Zhengji Zhao 2016-08-12 00:21:00 MDT

Dear Moe,

Is Slurm 16.05.4 available now? Could you please let me know where I can find this version? I see this version has already been listed in the bugs.schedmd.com site, but I did not see from your download site. 

Thanks,
Zhengji

Comment 44 Moe Jette 2016-08-12 08:03:00 MDT

The release of version 16.05.4 was delayed until this morning due to a family emergency. You can download Slurm from here:
http://www.schedmd.com/#repos

Comment 45 Moe Jette 2016-08-12 08:15:10 MDT

(In reply to Zhengji Zhao from comment #42)
> Dear Moe,
> 
> Thanks for getting back to me. 
> 
> Regarding how many sockets (one socket or both sockets) the #SBATCH -N 1
> allocates with the CR_Socket_Memory configured, I still have a question. 
> 
> We have several partitions configured on our systems, and not all partitions
> are configured to share the nodes between jobs. For example, for the debug
> partition that I used, we have the following configuration,
> 
> zz217@gert01:~/affinity/hsw> scontrol show partition debug
> PartitionName=debug
>    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>    AllocNodes=ALL Default=YES QoS=N/A
>    DefaultTime=00:10:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0
> Hidden=NO
>    MaxNodes=UNLIMITED MaxTime=00:30:00 MinNodes=1 LLN=NO
> MaxCPUsPerNode=UNLIMITED
>    Nodes=nid000[21-23,28-30,52-54,56-63]
>    PriorityJobFactor=1000 PriorityTier=1000 RootOnly=NO ReqResv=NO
> OverSubscribe=EXCLUSIVE PreemptMode=REQUEUE
>    State=UP TotalCPUs=1088 TotalNodes=17 SelectTypeParameters=NONE
>    DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
> 
> Please note we have the OverSubscribe=EXCLUSIVE set for this partition. I
> wonder if this should be sufficient to avoid the node being shared with
> other jobs, in which case, then I think the node should be fully (both
> sockets) allocated to the job.

I overlooked the "OverSubscribe=EXCLUSIVE" in the partition specification. That is sufficient to allocate your job all cores on all sockets of the allocated node with the "-N1" option.


> In our production systems (both Cray XC30
> called Edison, and XC40 called Cori phase I, we have been observing this
> behavior, i.e., if I use #SBATCH -N 1, we are getting full node (both
> sockets), and we did not have to use other SBATCH flags to help allocate the
> full node to the job. On Edison and Cori we have the following configuration
> 
> zz217@cori06:~> scontrol show config |grep -E
> "TaskPlugin|SelectTypeParameters"
> SelectTypeParameters    = CR_SOCKET_MEMORY,OTHER_CONS_RES
> TaskPlugin              = task/cgroup,task/cray
> TaskPluginParam         = (null type)
> 
> 
> while now on the test system (Gerty), with the following configuration with
> the same debug partition configuration (OverSubscribe=EXCLUSIVE) I am unable
> to get the full node with #SBATCH -N 1 alone. 

I believe that you were being allocated the full node, but the bug binding tasks to the wrong CPUs was making it look like that was not the case.


> zz217@gert01:~> scontrol show config |grep -E
> "TaskPlugin|SelectTypeParameters"
> SelectTypeParameters    =
> CR_SOCKET_MEMORY,OTHER_CONS_RES,CR_ONE_TASK_PER_CORE,
> CR_CORE_DEFAULT_DIST_BLOCK
> TaskPlugin              = task/affinity,task/cgroup,task/cray
> TaskPluginParam         = (null type)
> zz217@gert01:~> 
> 
> 
> So I think perhaps there is still some room to improve in Slurm, so that
> even with CR_Socket_memory is configured, if the partition does not allow
> the node sharing, then the full node can be allocated to a job with #SBATCH
> -N 1 alone. I understand this may or may not be a bug, and we can get around
> this with either #SBATCH -N 1 --ntasks-per-core=2 with the
> CR_ONE_TASK_PER_CORE being set or with #SBATCH -N 1 --ntasks=64
> --ntasks-per-core=1 as you suggested. So this issue is not a pressing issue
> for me now. However, it would be great if someday this could be fixed. I
> will decide if we will settle with the  #SBATCH -N 1 --ntasks-per-core=2 to
> use hyperthreading when the default is CR_ONE_TASK_PER_CORE. Please let me
> know if you consider this is something needs to be fixed or not.
> 
> 
> Regarding your following comment,
> 
> "Perhaps you want to eliminate the "CR_ONE_TASK_PER_CORE" option in
> slurm.conf then and require users who want to run one task per core to
> explicitly specify "--ntasks-per-core=1".
> 
> Alternately global environment variables or a job_submit plugin can be used
> to set various default options for jobs. See:
> http://slurm.schedmd.com/job_submit_plugins.html"
> 
> 
> I would like to let you know that we wanted to do the opposite. We wanted
> the default to be not using hyperthreading, and whoever wants to use
> hyperthreading indicate that explicitly with an extra flag (e.g.,
> --ntasks-per-core=2). The reason we wanted to have the default not to bother
> with hyperthreading was that most of our workloads do not get benefits from
> using hyperthreading on Edison and Cori Phase I. However, the situation may
> change on KNL (Cori Phase II), so it is possible that in the future we want
> to hyperthreading to be the default on Cori Phase II, and require users who
> do not want to use hyperthreading to explicitly specify
> "--ntasks-per-core=1".
> 
> 
> I will do further testing with the SLURM 16.05.4, and will let you know if
> it meets our task/memory/thread affinity need.
> 
> I will open another bug shortly for the affinity issues on KNL as you
> suggested in this bug.
> 
> Thanks very much for your timely help.
> 
> Zhengji

Comment 46 Moe Jette 2016-10-12 14:17:16 MDT

Please open a new ticket or re-open this one if necessary.

Comment 47 Zhengji Zhao 2016-10-12 14:26:02 MDT

Dear Moe,

I would like to confirm with you that if the last bug was fixed or not. In Comment 45, you said that 


I overlooked the "OverSubscribe=EXCLUSIVE" in the partition specification. That is sufficient to allocate your job all cores on all sockets of the allocated node with the "-N1" option.


Could you please let us know? Currently we are running 16.05.5, we still can not get all the CPUs on the node (full node) with #SBATCH -N 1 alone with the current Select parameters,

swowner@cori08:~> scontrol show config |grep -i select
SelectType              = select/cray
SelectTypeParameters    = CR_SOCKET_MEMORY,OTHER_CONS_RES,NHC_NO,CR_ONE_TASK_PER_CORE,CR_CORE_DEFAULT_DIST_BLOCK


Thanks,
Zhengji

Comment 48 Zhengji Zhao 2016-10-12 14:28:54 MDT

Sorry I should have added that we still can not get the all cpus on the nodes with #SBATCH -N 1 under a partition configured OverSubscribe=EXCLUSIVE.

Thanks,
Zhengji

Comment 49 Moe Jette 2016-10-12 15:36:36 MDT

There is information about quite a few different hardware and software configurations in this ticket. There were definitely some task binding issues in Slurm version 16.05.4 with some KNL NUMA modes.  For example SNC2/flat in SLurm version 16.05.4 would produce the following task binding:

$ sbatch -N1 tmp
cpu_bind=MASK - knl, task  0  0 [91243]: mask 0xffffffffffffffffffffffffffffffffff000000003ffffffff000000003ffffffff set

That is corrected in version 16.05.5:
cpu_bind=MASK - knl, task  0  0 [49658]: mask 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff set

Can you provide more details about exactly what you are seeing in this ticket or open a new ticket, which would probably be less confusing.

More complete logs below:
$ cat tmp
#!/bin/bash
./srun --cpu_bind=v sleep 100
exit 0

$ sbatch -N1 tmp
Submitted batch job 38

$ cat sl*out
cpu_bind=MASK - knl, task  0  0 [49658]: mask 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff set


SlurmctldLogFile (with "DebugFlags=CPU_Bind" configured):
slurmctld: _slurm_rpc_submit_batch_job JobId=38 usec=1301
slurmctld: ====================
slurmctld: job_id:38 nhosts:1 ncpus:1 node_req:64000 nodes=knl0
slurmctld: Node[0]:
slurmctld:   Mem(MB):6800:0  Sockets:2  Cores:34  CPUs:68:0
slurmctld:   Socket[0] Core[0] is allocated
slurmctld:   Socket[0] Core[1] is allocated
slurmctld:   Socket[0] Core[2] is allocated
slurmctld:   Socket[0] Core[3] is allocated
slurmctld:   Socket[0] Core[4] is allocated
slurmctld:   Socket[0] Core[5] is allocated
slurmctld:   Socket[0] Core[6] is allocated
slurmctld:   Socket[0] Core[7] is allocated
slurmctld:   Socket[0] Core[8] is allocated
slurmctld:   Socket[0] Core[9] is allocated
slurmctld:   Socket[0] Core[10] is allocated
slurmctld:   Socket[0] Core[11] is allocated
slurmctld:   Socket[0] Core[12] is allocated
slurmctld:   Socket[0] Core[13] is allocated
slurmctld:   Socket[0] Core[14] is allocated
slurmctld:   Socket[0] Core[15] is allocated
slurmctld:   Socket[0] Core[16] is allocated
slurmctld:   Socket[0] Core[17] is allocated
slurmctld:   Socket[0] Core[18] is allocated
slurmctld:   Socket[0] Core[19] is allocated
slurmctld:   Socket[0] Core[20] is allocated
slurmctld:   Socket[0] Core[21] is allocated
slurmctld:   Socket[0] Core[22] is allocated
slurmctld:   Socket[0] Core[23] is allocated
slurmctld:   Socket[0] Core[24] is allocated
slurmctld:   Socket[0] Core[25] is allocated
slurmctld:   Socket[0] Core[26] is allocated
slurmctld:   Socket[0] Core[27] is allocated
slurmctld:   Socket[0] Core[28] is allocated
slurmctld:   Socket[0] Core[29] is allocated
slurmctld:   Socket[0] Core[30] is allocated
slurmctld:   Socket[0] Core[31] is allocated
slurmctld:   Socket[0] Core[32] is allocated
slurmctld:   Socket[0] Core[33] is allocated
slurmctld:   Socket[1] Core[0] is allocated
slurmctld:   Socket[1] Core[1] is allocated
slurmctld:   Socket[1] Core[2] is allocated
slurmctld:   Socket[1] Core[3] is allocated
slurmctld:   Socket[1] Core[4] is allocated
slurmctld:   Socket[1] Core[5] is allocated
slurmctld:   Socket[1] Core[6] is allocated
slurmctld:   Socket[1] Core[7] is allocated
slurmctld:   Socket[1] Core[8] is allocated
slurmctld:   Socket[1] Core[9] is allocated
slurmctld:   Socket[1] Core[10] is allocated
slurmctld:   Socket[1] Core[11] is allocated
slurmctld:   Socket[1] Core[12] is allocated
slurmctld:   Socket[1] Core[13] is allocated
slurmctld:   Socket[1] Core[14] is allocated
slurmctld:   Socket[1] Core[15] is allocated
slurmctld:   Socket[1] Core[16] is allocated
slurmctld:   Socket[1] Core[17] is allocated
slurmctld:   Socket[1] Core[18] is allocated
slurmctld:   Socket[1] Core[19] is allocated
slurmctld:   Socket[1] Core[20] is allocated
slurmctld:   Socket[1] Core[21] is allocated
slurmctld:   Socket[1] Core[22] is allocated
slurmctld:   Socket[1] Core[23] is allocated
slurmctld:   Socket[1] Core[24] is allocated
slurmctld:   Socket[1] Core[25] is allocated
slurmctld:   Socket[1] Core[26] is allocated
slurmctld:   Socket[1] Core[27] is allocated
slurmctld:   Socket[1] Core[28] is allocated
slurmctld:   Socket[1] Core[29] is allocated
slurmctld:   Socket[1] Core[30] is allocated
slurmctld:   Socket[1] Core[31] is allocated
slurmctld:   Socket[1] Core[32] is allocated
slurmctld:   Socket[1] Core[33] is allocated
slurmctld: --------------------
slurmctld: cpu_array_value[0]:68 reps:1
slurmctld: ====================
slurmctld: sched: Allocate JobID=38 NodeList=knl0 #CPUs=272 Partition=debug
slurmctld: _pick_step_nodes: Configuration for job 38 is complete
slurmctld: ====================
slurmctld: step_id:38.0
slurmctld: JobNode[0] Socket[0] Core[0] is allocated
slurmctld: JobNode[0] Socket[0] Core[1] is allocated
slurmctld: JobNode[0] Socket[0] Core[2] is allocated
slurmctld: JobNode[0] Socket[0] Core[3] is allocated
slurmctld: JobNode[0] Socket[0] Core[4] is allocated
slurmctld: JobNode[0] Socket[0] Core[5] is allocated
slurmctld: JobNode[0] Socket[0] Core[6] is allocated
slurmctld: JobNode[0] Socket[0] Core[7] is allocated
slurmctld: JobNode[0] Socket[0] Core[8] is allocated
slurmctld: JobNode[0] Socket[0] Core[9] is allocated
slurmctld: JobNode[0] Socket[0] Core[10] is allocated
slurmctld: JobNode[0] Socket[0] Core[11] is allocated
slurmctld: JobNode[0] Socket[0] Core[12] is allocated
slurmctld: JobNode[0] Socket[0] Core[13] is allocated
slurmctld: JobNode[0] Socket[0] Core[14] is allocated
slurmctld: JobNode[0] Socket[0] Core[15] is allocated
slurmctld: JobNode[0] Socket[0] Core[16] is allocated
slurmctld: JobNode[0] Socket[0] Core[17] is allocated
slurmctld: JobNode[0] Socket[0] Core[18] is allocated
slurmctld: JobNode[0] Socket[0] Core[19] is allocated
slurmctld: JobNode[0] Socket[0] Core[20] is allocated
slurmctld: JobNode[0] Socket[0] Core[21] is allocated
slurmctld: JobNode[0] Socket[0] Core[22] is allocated
slurmctld: JobNode[0] Socket[0] Core[23] is allocated
slurmctld: JobNode[0] Socket[0] Core[24] is allocated
slurmctld: JobNode[0] Socket[0] Core[25] is allocated
slurmctld: JobNode[0] Socket[0] Core[26] is allocated
slurmctld: JobNode[0] Socket[0] Core[27] is allocated
slurmctld: JobNode[0] Socket[0] Core[28] is allocated
slurmctld: JobNode[0] Socket[0] Core[29] is allocated
slurmctld: JobNode[0] Socket[0] Core[30] is allocated
slurmctld: JobNode[0] Socket[0] Core[31] is allocated
slurmctld: JobNode[0] Socket[0] Core[32] is allocated
slurmctld: JobNode[0] Socket[0] Core[33] is allocated
slurmctld: JobNode[0] Socket[1] Core[0] is allocated
slurmctld: JobNode[0] Socket[1] Core[1] is allocated
slurmctld: JobNode[0] Socket[1] Core[2] is allocated
slurmctld: JobNode[0] Socket[1] Core[3] is allocated
slurmctld: JobNode[0] Socket[1] Core[4] is allocated
slurmctld: JobNode[0] Socket[1] Core[5] is allocated
slurmctld: JobNode[0] Socket[1] Core[6] is allocated
slurmctld: JobNode[0] Socket[1] Core[7] is allocated
slurmctld: JobNode[0] Socket[1] Core[8] is allocated
slurmctld: JobNode[0] Socket[1] Core[9] is allocated
slurmctld: JobNode[0] Socket[1] Core[10] is allocated
slurmctld: JobNode[0] Socket[1] Core[11] is allocated
slurmctld: JobNode[0] Socket[1] Core[12] is allocated
slurmctld: JobNode[0] Socket[1] Core[13] is allocated
slurmctld: JobNode[0] Socket[1] Core[14] is allocated
slurmctld: JobNode[0] Socket[1] Core[15] is allocated
slurmctld: JobNode[0] Socket[1] Core[16] is allocated
slurmctld: JobNode[0] Socket[1] Core[17] is allocated
slurmctld: JobNode[0] Socket[1] Core[18] is allocated
slurmctld: JobNode[0] Socket[1] Core[19] is allocated
slurmctld: JobNode[0] Socket[1] Core[20] is allocated
slurmctld: JobNode[0] Socket[1] Core[21] is allocated
slurmctld: JobNode[0] Socket[1] Core[22] is allocated
slurmctld: JobNode[0] Socket[1] Core[23] is allocated
slurmctld: JobNode[0] Socket[1] Core[24] is allocated
slurmctld: JobNode[0] Socket[1] Core[25] is allocated
slurmctld: JobNode[0] Socket[1] Core[26] is allocated
slurmctld: JobNode[0] Socket[1] Core[27] is allocated
slurmctld: JobNode[0] Socket[1] Core[28] is allocated
slurmctld: JobNode[0] Socket[1] Core[29] is allocated
slurmctld: JobNode[0] Socket[1] Core[30] is allocated
slurmctld: JobNode[0] Socket[1] Core[31] is allocated
slurmctld: JobNode[0] Socket[1] Core[32] is allocated
slurmctld: JobNode[0] Socket[1] Core[33] is allocated
slurmctld: ====================
slurmctld: job_complete: invalid JobId=37


SlurmdLogFile  (with "DebugFlags=CPU_Bind" configured):
slurmd: task_p_slurmd_batch_request: 38
slurmd: task/affinity: job 38 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
slurmd: task/affinity: job 38 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
slurmd: _run_prolog: run job script took usec=1402
slurmd: _run_prolog: prolog with lock for job 38 ran for 0 seconds
slurmd: ====================
slurmd: batch_job:38 job_mem:100MB_per_CPU
slurmd: JobNode[0] CPU[0] Job alloc
slurmd: JobNode[0] CPU[1] Job alloc
slurmd: JobNode[0] CPU[2] Job alloc
slurmd: JobNode[0] CPU[3] Job alloc
slurmd: JobNode[0] CPU[4] Job alloc
slurmd: JobNode[0] CPU[5] Job alloc
slurmd: JobNode[0] CPU[6] Job alloc
slurmd: JobNode[0] CPU[7] Job alloc
slurmd: JobNode[0] CPU[8] Job alloc
slurmd: JobNode[0] CPU[9] Job alloc
slurmd: JobNode[0] CPU[10] Job alloc
slurmd: JobNode[0] CPU[11] Job alloc
slurmd: JobNode[0] CPU[12] Job alloc
slurmd: JobNode[0] CPU[13] Job alloc
slurmd: JobNode[0] CPU[14] Job alloc
slurmd: JobNode[0] CPU[15] Job alloc
slurmd: JobNode[0] CPU[16] Job alloc
slurmd: JobNode[0] CPU[17] Job alloc
slurmd: JobNode[0] CPU[18] Job alloc
slurmd: JobNode[0] CPU[19] Job alloc
slurmd: JobNode[0] CPU[20] Job alloc
slurmd: JobNode[0] CPU[21] Job alloc
slurmd: JobNode[0] CPU[22] Job alloc
slurmd: JobNode[0] CPU[23] Job alloc
slurmd: JobNode[0] CPU[24] Job alloc
slurmd: JobNode[0] CPU[25] Job alloc
slurmd: JobNode[0] CPU[26] Job alloc
slurmd: JobNode[0] CPU[27] Job alloc
slurmd: JobNode[0] CPU[28] Job alloc
slurmd: JobNode[0] CPU[29] Job alloc
slurmd: JobNode[0] CPU[30] Job alloc
slurmd: JobNode[0] CPU[31] Job alloc
slurmd: JobNode[0] CPU[32] Job alloc
slurmd: JobNode[0] CPU[33] Job alloc
slurmd: JobNode[0] CPU[34] Job alloc
slurmd: JobNode[0] CPU[35] Job alloc
slurmd: JobNode[0] CPU[36] Job alloc
slurmd: JobNode[0] CPU[37] Job alloc
slurmd: JobNode[0] CPU[38] Job alloc
slurmd: JobNode[0] CPU[39] Job alloc
slurmd: JobNode[0] CPU[40] Job alloc
slurmd: JobNode[0] CPU[41] Job alloc
slurmd: JobNode[0] CPU[42] Job alloc
slurmd: JobNode[0] CPU[43] Job alloc
slurmd: JobNode[0] CPU[44] Job alloc
slurmd: JobNode[0] CPU[45] Job alloc
slurmd: JobNode[0] CPU[46] Job alloc
slurmd: JobNode[0] CPU[47] Job alloc
slurmd: JobNode[0] CPU[48] Job alloc
slurmd: JobNode[0] CPU[49] Job alloc
slurmd: JobNode[0] CPU[50] Job alloc
slurmd: JobNode[0] CPU[51] Job alloc
slurmd: JobNode[0] CPU[52] Job alloc
slurmd: JobNode[0] CPU[53] Job alloc
slurmd: JobNode[0] CPU[54] Job alloc
slurmd: JobNode[0] CPU[55] Job alloc
slurmd: JobNode[0] CPU[56] Job alloc
slurmd: JobNode[0] CPU[57] Job alloc
slurmd: JobNode[0] CPU[58] Job alloc
slurmd: JobNode[0] CPU[59] Job alloc
slurmd: JobNode[0] CPU[60] Job alloc
slurmd: JobNode[0] CPU[61] Job alloc
slurmd: JobNode[0] CPU[62] Job alloc
slurmd: JobNode[0] CPU[63] Job alloc
slurmd: JobNode[0] CPU[64] Job alloc
slurmd: JobNode[0] CPU[65] Job alloc
slurmd: JobNode[0] CPU[66] Job alloc
slurmd: JobNode[0] CPU[67] Job alloc
slurmd: ====================
slurmd: Launching batch job 38 for UID 1001
slurmd: launch task 38.0 request from 1001.1001@127.0.0.1 (port 4249)
slurmd: ====================
slurmd: step_id:38.0 job_mem:100MB_per_CPU step_mem:100MB_per_CPU
slurmd: JobNode[0] CPU[0] Step alloc
slurmd: JobNode[0] CPU[1] Step alloc
slurmd: JobNode[0] CPU[2] Step alloc
slurmd: JobNode[0] CPU[3] Step alloc
slurmd: JobNode[0] CPU[4] Step alloc
slurmd: JobNode[0] CPU[5] Step alloc
slurmd: JobNode[0] CPU[6] Step alloc
slurmd: JobNode[0] CPU[7] Step alloc
slurmd: JobNode[0] CPU[8] Step alloc
slurmd: JobNode[0] CPU[9] Step alloc
slurmd: JobNode[0] CPU[10] Step alloc
slurmd: JobNode[0] CPU[11] Step alloc
slurmd: JobNode[0] CPU[12] Step alloc
slurmd: JobNode[0] CPU[13] Step alloc
slurmd: JobNode[0] CPU[14] Step alloc
slurmd: JobNode[0] CPU[15] Step alloc
slurmd: JobNode[0] CPU[16] Step alloc
slurmd: JobNode[0] CPU[17] Step alloc
slurmd: JobNode[0] CPU[18] Step alloc
slurmd: JobNode[0] CPU[19] Step alloc
slurmd: JobNode[0] CPU[20] Step alloc
slurmd: JobNode[0] CPU[21] Step alloc
slurmd: JobNode[0] CPU[22] Step alloc
slurmd: JobNode[0] CPU[23] Step alloc
slurmd: JobNode[0] CPU[24] Step alloc
slurmd: JobNode[0] CPU[25] Step alloc
slurmd: JobNode[0] CPU[26] Step alloc
slurmd: JobNode[0] CPU[27] Step alloc
slurmd: JobNode[0] CPU[28] Step alloc
slurmd: JobNode[0] CPU[29] Step alloc
slurmd: JobNode[0] CPU[30] Step alloc
slurmd: JobNode[0] CPU[31] Step alloc
slurmd: JobNode[0] CPU[32] Step alloc
slurmd: JobNode[0] CPU[33] Step alloc
slurmd: JobNode[0] CPU[34] Step alloc
slurmd: JobNode[0] CPU[35] Step alloc
slurmd: JobNode[0] CPU[36] Step alloc
slurmd: JobNode[0] CPU[37] Step alloc
slurmd: JobNode[0] CPU[38] Step alloc
slurmd: JobNode[0] CPU[39] Step alloc
slurmd: JobNode[0] CPU[40] Step alloc
slurmd: JobNode[0] CPU[41] Step alloc
slurmd: JobNode[0] CPU[42] Step alloc
slurmd: JobNode[0] CPU[43] Step alloc
slurmd: JobNode[0] CPU[44] Step alloc
slurmd: JobNode[0] CPU[45] Step alloc
slurmd: JobNode[0] CPU[46] Step alloc
slurmd: JobNode[0] CPU[47] Step alloc
slurmd: JobNode[0] CPU[48] Step alloc
slurmd: JobNode[0] CPU[49] Step alloc
slurmd: JobNode[0] CPU[50] Step alloc
slurmd: JobNode[0] CPU[51] Step alloc
slurmd: JobNode[0] CPU[52] Step alloc
slurmd: JobNode[0] CPU[53] Step alloc
slurmd: JobNode[0] CPU[54] Step alloc
slurmd: JobNode[0] CPU[55] Step alloc
slurmd: JobNode[0] CPU[56] Step alloc
slurmd: JobNode[0] CPU[57] Step alloc
slurmd: JobNode[0] CPU[58] Step alloc
slurmd: JobNode[0] CPU[59] Step alloc
slurmd: JobNode[0] CPU[60] Step alloc
slurmd: JobNode[0] CPU[61] Step alloc
slurmd: JobNode[0] CPU[62] Step alloc
slurmd: JobNode[0] CPU[63] Step alloc
slurmd: JobNode[0] CPU[64] Step alloc
slurmd: JobNode[0] CPU[65] Step alloc
slurmd: JobNode[0] CPU[66] Step alloc
slurmd: JobNode[0] CPU[67] Step alloc
slurmd: ====================
slurmd: Scaling CPU count by factor of 4 (272/(68-0))
slurmd: lllp_distribution jobid [38] auto binding off: verbose,mask_cpu
^Cslurmd: got shutdown request
slurmd: all threads complete
slurmd: Consumable Resources (CR) Node Selection plugin shutting down ...
slurmd: Munge cryptographic signature plugin unloaded

Comment 50 Moe Jette 2016-10-13 11:42:16 MDT

Remaining problem moved to new bug 3168.