| Summary: | -m ...:fcyclic doesn't distribute cores across sockets in round-robbin for single task | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sergey Meirovich <sergey_meirovich> |
| Component: | User Commands | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | brian, da |
| Version: | 15.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | AMAT | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf | ||
|
Description
Sergey Meirovich
2015-11-22 05:01:36 MST
I'm looking through this now, and trying to see if there may be a disagreement between the NUMA layout numbering and Slurm's internal representation. The mask of 0x555555555 implies that the 18-core constraint is being applied, but it looks like you'd expect 0xffffc0000 or 0x00003ffff given your hardware? Can you send in the output from "lstopo -c" for comparison? - Tim Hi Tim,
Yes, exactly. I expect 0xffffc0000 or 0x00003ffff
==========================================================================
[root@dcalph001 ~]# lstopo -c
Machine (128GB) cpuset=0x0000000f,0xffffffff
NUMANode L#0 (P#0 64GB) cpuset=0x00000005,0x55555555
Socket L#0 cpuset=0x00000005,0x55555555
L3 L#0 (45MB) cpuset=0x00000005,0x55555555
L2 L#0 (256KB) cpuset=0x00000001
L1d L#0 (32KB) cpuset=0x00000001
L1i L#0 (32KB) cpuset=0x00000001
Core L#0 cpuset=0x00000001
PU L#0 (P#0) cpuset=0x00000001
L2 L#1 (256KB) cpuset=0x00000004
L1d L#1 (32KB) cpuset=0x00000004
L1i L#1 (32KB) cpuset=0x00000004
Core L#1 cpuset=0x00000004
PU L#1 (P#2) cpuset=0x00000004
L2 L#2 (256KB) cpuset=0x00000010
L1d L#2 (32KB) cpuset=0x00000010
L1i L#2 (32KB) cpuset=0x00000010
Core L#2 cpuset=0x00000010
PU L#2 (P#4) cpuset=0x00000010
L2 L#3 (256KB) cpuset=0x00000040
L1d L#3 (32KB) cpuset=0x00000040
L1i L#3 (32KB) cpuset=0x00000040
Core L#3 cpuset=0x00000040
PU L#3 (P#6) cpuset=0x00000040
L2 L#4 (256KB) cpuset=0x00000100
L1d L#4 (32KB) cpuset=0x00000100
L1i L#4 (32KB) cpuset=0x00000100
Core L#4 cpuset=0x00000100
PU L#4 (P#8) cpuset=0x00000100
L2 L#5 (256KB) cpuset=0x00000400
L1d L#5 (32KB) cpuset=0x00000400
L1i L#5 (32KB) cpuset=0x00000400
Core L#5 cpuset=0x00000400
PU L#5 (P#10) cpuset=0x00000400
L2 L#6 (256KB) cpuset=0x00001000
L1d L#6 (32KB) cpuset=0x00001000
L1i L#6 (32KB) cpuset=0x00001000
Core L#6 cpuset=0x00001000
PU L#6 (P#12) cpuset=0x00001000
L2 L#7 (256KB) cpuset=0x00004000
L1d L#7 (32KB) cpuset=0x00004000
L1i L#7 (32KB) cpuset=0x00004000
Core L#7 cpuset=0x00004000
PU L#7 (P#14) cpuset=0x00004000
L2 L#8 (256KB) cpuset=0x00010000
L1d L#8 (32KB) cpuset=0x00010000
L1i L#8 (32KB) cpuset=0x00010000
Core L#8 cpuset=0x00010000
PU L#8 (P#16) cpuset=0x00010000
L2 L#9 (256KB) cpuset=0x00040000
L1d L#9 (32KB) cpuset=0x00040000
L1i L#9 (32KB) cpuset=0x00040000
Core L#9 cpuset=0x00040000
PU L#9 (P#18) cpuset=0x00040000
L2 L#10 (256KB) cpuset=0x00100000
L1d L#10 (32KB) cpuset=0x00100000
L1i L#10 (32KB) cpuset=0x00100000
Core L#10 cpuset=0x00100000
PU L#10 (P#20) cpuset=0x00100000
L2 L#11 (256KB) cpuset=0x00400000
L1d L#11 (32KB) cpuset=0x00400000
L1i L#11 (32KB) cpuset=0x00400000
Core L#11 cpuset=0x00400000
PU L#11 (P#22) cpuset=0x00400000
L2 L#12 (256KB) cpuset=0x01000000
L1d L#12 (32KB) cpuset=0x01000000
L1i L#12 (32KB) cpuset=0x01000000
Core L#12 cpuset=0x01000000
PU L#12 (P#24) cpuset=0x01000000
L2 L#13 (256KB) cpuset=0x04000000
L1d L#13 (32KB) cpuset=0x04000000
L1i L#13 (32KB) cpuset=0x04000000
Core L#13 cpuset=0x04000000
PU L#13 (P#26) cpuset=0x04000000
L2 L#14 (256KB) cpuset=0x10000000
L1d L#14 (32KB) cpuset=0x10000000
L1i L#14 (32KB) cpuset=0x10000000
Core L#14 cpuset=0x10000000
PU L#14 (P#28) cpuset=0x10000000
L2 L#15 (256KB) cpuset=0x40000000
L1d L#15 (32KB) cpuset=0x40000000
L1i L#15 (32KB) cpuset=0x40000000
Core L#15 cpuset=0x40000000
PU L#15 (P#30) cpuset=0x40000000
L2 L#16 (256KB) cpuset=0x00000001,0x0
L1d L#16 (32KB) cpuset=0x00000001,0x0
L1i L#16 (32KB) cpuset=0x00000001,0x0
Core L#16 cpuset=0x00000001,0x0
PU L#16 (P#32) cpuset=0x00000001,0x0
L2 L#17 (256KB) cpuset=0x00000004,0x0
L1d L#17 (32KB) cpuset=0x00000004,0x0
L1i L#17 (32KB) cpuset=0x00000004,0x0
Core L#17 cpuset=0x00000004,0x0
PU L#17 (P#34) cpuset=0x00000004,0x0
HostBridge L#0
PCIBridge
PCI 14e4:168e
Net L#0 "eth0"
PCI 14e4:168e
Net L#1 "eth1"
PCIBridge
PCI 1000:005d
Block L#2 "sda"
PCIBridge
PCI 15b3:1003
Net L#3 "ib0"
Net L#4 "ib1"
OpenFabrics L#5 "mlx4_0"
PCI 8086:8d62
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 102b:0534
PCI 8086:8d02
NUMANode L#1 (P#1 64GB) cpuset=0x0000000a,0xaaaaaaaa
Socket L#1 cpuset=0x0000000a,0xaaaaaaaa
L3 L#1 (45MB) cpuset=0x0000000a,0xaaaaaaaa
L2 L#18 (256KB) cpuset=0x00000002
L1d L#18 (32KB) cpuset=0x00000002
L1i L#18 (32KB) cpuset=0x00000002
Core L#18 cpuset=0x00000002
PU L#18 (P#1) cpuset=0x00000002
L2 L#19 (256KB) cpuset=0x00000008
L1d L#19 (32KB) cpuset=0x00000008
L1i L#19 (32KB) cpuset=0x00000008
Core L#19 cpuset=0x00000008
PU L#19 (P#3) cpuset=0x00000008
L2 L#20 (256KB) cpuset=0x00000020
L1d L#20 (32KB) cpuset=0x00000020
L1i L#20 (32KB) cpuset=0x00000020
Core L#20 cpuset=0x00000020
PU L#20 (P#5) cpuset=0x00000020
L2 L#21 (256KB) cpuset=0x00000080
L1d L#21 (32KB) cpuset=0x00000080
L1i L#21 (32KB) cpuset=0x00000080
Core L#21 cpuset=0x00000080
PU L#21 (P#7) cpuset=0x00000080
L2 L#22 (256KB) cpuset=0x00000200
L1d L#22 (32KB) cpuset=0x00000200
L1i L#22 (32KB) cpuset=0x00000200
Core L#22 cpuset=0x00000200
PU L#22 (P#9) cpuset=0x00000200
L2 L#23 (256KB) cpuset=0x00000800
L1d L#23 (32KB) cpuset=0x00000800
L1i L#23 (32KB) cpuset=0x00000800
Core L#23 cpuset=0x00000800
PU L#23 (P#11) cpuset=0x00000800
L2 L#24 (256KB) cpuset=0x00002000
L1d L#24 (32KB) cpuset=0x00002000
L1i L#24 (32KB) cpuset=0x00002000
Core L#24 cpuset=0x00002000
PU L#24 (P#13) cpuset=0x00002000
L2 L#25 (256KB) cpuset=0x00008000
L1d L#25 (32KB) cpuset=0x00008000
L1i L#25 (32KB) cpuset=0x00008000
Core L#25 cpuset=0x00008000
PU L#25 (P#15) cpuset=0x00008000
L2 L#26 (256KB) cpuset=0x00020000
L1d L#26 (32KB) cpuset=0x00020000
L1i L#26 (32KB) cpuset=0x00020000
Core L#26 cpuset=0x00020000
PU L#26 (P#17) cpuset=0x00020000
L2 L#27 (256KB) cpuset=0x00080000
L1d L#27 (32KB) cpuset=0x00080000
L1i L#27 (32KB) cpuset=0x00080000
Core L#27 cpuset=0x00080000
PU L#27 (P#19) cpuset=0x00080000
L2 L#28 (256KB) cpuset=0x00200000
L1d L#28 (32KB) cpuset=0x00200000
L1i L#28 (32KB) cpuset=0x00200000
Core L#28 cpuset=0x00200000
PU L#28 (P#21) cpuset=0x00200000
L2 L#29 (256KB) cpuset=0x00800000
L1d L#29 (32KB) cpuset=0x00800000
L1i L#29 (32KB) cpuset=0x00800000
Core L#29 cpuset=0x00800000
PU L#29 (P#23) cpuset=0x00800000
L2 L#30 (256KB) cpuset=0x02000000
L1d L#30 (32KB) cpuset=0x02000000
L1i L#30 (32KB) cpuset=0x02000000
Core L#30 cpuset=0x02000000
PU L#30 (P#25) cpuset=0x02000000
L2 L#31 (256KB) cpuset=0x08000000
L1d L#31 (32KB) cpuset=0x08000000
L1i L#31 (32KB) cpuset=0x08000000
Core L#31 cpuset=0x08000000
PU L#31 (P#27) cpuset=0x08000000
L2 L#32 (256KB) cpuset=0x20000000
L1d L#32 (32KB) cpuset=0x20000000
L1i L#32 (32KB) cpuset=0x20000000
Core L#32 cpuset=0x20000000
PU L#32 (P#29) cpuset=0x20000000
L2 L#33 (256KB) cpuset=0x80000000
L1d L#33 (32KB) cpuset=0x80000000
L1i L#33 (32KB) cpuset=0x80000000
Core L#33 cpuset=0x80000000
PU L#33 (P#31) cpuset=0x80000000
L2 L#34 (256KB) cpuset=0x00000002,0x0
L1d L#34 (32KB) cpuset=0x00000002,0x0
L1i L#34 (32KB) cpuset=0x00000002,0x0
Core L#34 cpuset=0x00000002,0x0
PU L#34 (P#33) cpuset=0x00000002,0x0
L2 L#35 (256KB) cpuset=0x00000008,0x0
L1d L#35 (32KB) cpuset=0x00000008,0x0
L1i L#35 (32KB) cpuset=0x00000008,0x0
Core L#35 cpuset=0x00000008,0x0
PU L#35 (P#35) cpuset=0x00000008,0x0
[root@dcalph001 ~]#
===========================================================================
BTW, we are not building slurm with hwloc.
Can you re-test using the --exclusive flag with your sbatch command? I'm curious if the placement will change; it may be that the allocation itself isn't considering the -m request, and the srun is just using whichever cpus were allocated to the process. Also - without hwloc none of this should be working properly - can you verify whether you're running with or without it? ================================================================================================================ -sh-4.1$ sbatch --exclusive -N1 -n 1 --cpus-per-task=18 -m block:fcyclic --wrap="srun --cpu_bind=v g09 r.in" Submitted batch job 844 -sh-4.1$ cat slurm-844.out cpu_bind=MASK - dcalph004, task 0 0 [132959]: mask 0xfffffffff set -sh-4.1$ ================================================================================================================ I've just installed hwloc to accommodate your request to porvide "lstopo -c" Tt was not present during slurm build. Here is a snippet from our config.log: ... configure:20747: checking for hwloc installation configure:20810: result: configure:20814: WARNING: unable to locate hwloc installation ... Shall we rebuild with hwloc? JFYI, I have just rebuild slurm with hwloc - the results are pretty much the same: -sh-4.1$ sbatch -N1 -n 1 --cpus-per-task=18 -m block:fcyclic --wrap="srun --cpu_bind=v g09 r.in" Submitted batch job 845 -sh-4.1$ cat slurm-845.out cpu_bind=MASK - dcalph004, task 0 0 [135114]: mask 0x555555555 set -sh-4.1$ And with hwloc and --exclusive: -sh-4.1$ sbatch --exclusive -N1 -n 1 --cpus-per-task=18 -m block:fcyclic --wrap="srun --cpu_bind=v g09 r.in" Submitted batch job 846 -sh-4.1$ cat slurm-846.out cpu_bind=MASK - dcalph004, task 0 0 [136719]: mask 0xfffffffff set -sh-4.1$ Will you try this? sbatch -N1 -n1 --cpus-per-task=6 --exclusive -mblock:fcyclic --wrap="srun cat /proc/self/status | grep -i cpus_ | sort -n" -sh-4.1$ sbatch -N1 -n1 --cpus-per-task=6 --exclusive -mblock:fcyclic --wrap="srun cat /proc/self/status | grep -i cpus_ | sort -n" Submitted batch job 910 -sh-4.1$ cat slurm-910.out Cpus_allowed: 0000,00000000,00000000,0000000f,ffffffff Cpus_allowed_list: 0-35 -sh-4.1$ What about? sbatch -N1 -n1 --cpus-per-task=6 --exclusive -mblock:fcyclic --wrap="srun --cpu_bind=cores cat /proc/self/status | grep -i cpus_ | sort -n" -sh-4.1$ sbatch -N1 -n1 --cpus-per-task=6 --exclusive -mblock:fcyclic --wrap="srun --cpu_bind=cores cat /proc/self/status | grep -i cpus_ | sort -n" Submitted batch job 912 -sh-4.1$ cat slurm-912.out Cpus_allowed: 0000,00000000,00000000,00000000,0000003f Cpus_allowed_list: 0-5 -sh-4.1$ Hi, Is that expected and "-m block:fcyclic" has to be accomplished with "srun --cpu_bind=cores..." to achieve what we need? Yes. You need to have --exclusive to get an allocation on both sockets and --cpu_bind=cores to bind to just cores, otherwise it is binding to the whole node. Slurm tries to figure out the best binding by matching the number of requested cpus to the number of available resources and binding to the appropriate resource. This is called auto-binding -- see srun man page. When you request 18 cpus -- with exclusive node access --, that doesn't match any total number of resources on the node (e.g. 18 cores != 36 cores) so it just binds to the whole node. You can see this in the slurmd logs. In my case I have 2 sockets with 6 cores each. If I request 6 cpus and exclusive access, the job binds to the whole node: debug: binding tasks:6 to nodes:1 sockets:2:0 cores:12:0 threads:12 lllp_distribution jobid [2545] auto binding off: mask_cpu,one_thread debug: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0xFFF) If I request 2 cpus and exclusive access, the job is bound to the sockets: debug: task affinity : before lllp distribution cpu bind method is '(null type)' ((null)) debug: binding tasks:2 to nodes:1 sockets:2:0 cores:12:0 threads:12 lllp_distribution jobid [2548] implicit auto binding: sockets,one_thread, dist 50 And like in your first case where you are only being allocated the first socket, -- not exclusive node access -- the job is being bound to cores on the first socket: debug: task affinity : before lllp distribution cpu bind method is '(null type)' ((null)) debug: binding tasks:6 to nodes:0 sockets:1:0 cores:6:0 threads:6 lllp_distribution jobid [2554] implicit auto binding: cores,one_thread, dist 50 So in your case it's good to be specific in the binding. Does this help? Hi, --exclusive requirement is clashing with the goal I described: "The goal of round-robin across sockets for g09 is our recent discovery that for some of applications (e.g. vasp) memory bandwidth per socket becomes a clear bottleneck in case of high cores Intel SKU (we are using http://ark.intel.com/products/81061/Intel-Xeon-Processor-E5-2699-v3-45M-Cache-2_30-GHz ) So for example it doesn't make sense to run 18 cores of g09 on single socket as it preferable to leave two sockets - each having 9 free cores for vasp instead of 1 socket that has 18 cores." We are clearly facing that memory per socket bottleneck. That is not theoretical. We spent a lot of time to figure that out in collaboration with Dell and Intel. Sorry for the typo: ...We are clearly facing that memory _bandwidth_ per socket bottleneck. That is not theoretical. We spent a lot of time to figure that out in collaboration with Dell and Intel. To me that is a real deficiency because that is not allowing to leverage some real hardware. That might be much easier to program making all sophisticated --cpu_bind modes depend on --exclusive but this is clearly not matching massive multi-cores processors reality. Try this:
sbatch -N1 -n18 --ntasks-per-socket=9 -mblock:fcyclic --wrap="srun -n1 --cpus-per-task=18 --cpu_bind=core cat /proc/self/status | grep -i cpus_ | sort -n"
Doing it this way, I'm able to use use half of each socket and pack two jobs on the node:
brian@knc:/localhome/brian/slurm/15.08/knc$ sbatch -N1 -n6 --ntasks-per-socket=3 -mblock:fcyclic --wrap="srun -n1 --cpus-per-task=6 --cpu_bind=core ~/whereami 10"
Submitted batch job 2592
brian@knc:/localhome/brian/slurm/15.08/knc$ sbatch -N1 -n6 --ntasks-per-socket=3 -mblock:fcyclic --wrap="srun -n1 --cpus-per-task=6 --cpu_bind=core ~/whereami 10"
Submitted batch job 2593
brian@knc:/localhome/brian/slurm/15.08/knc$ sbatch -N1 -n6 --ntasks-per-socket=3 -mblock:fcyclic --wrap="srun -n1 --cpus-per-task=6 --cpu_bind=core ~/whereami 10"
Submitted batch job 2594
brian@knc:/localhome/brian/slurm/15.08/knc$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2594 debug wrap brian R 0:00 1 compy2
2592 debug wrap brian R 0:03 1 compy1
2593 debug wrap brian R 0:03 1 compy1
brian@knc:/localhome/brian/slurm/15.08/knc$ cat slurm-2592.out
0 compy1 - MASK:0x3f
sleeping 10 seconds
brian@knc:/localhome/brian/slurm/15.08/knc$ cat slurm-2593.out
0 compy1 - MASK:0xfc0
sleeping 10 seconds
And in this case --cpu_bind=cores is not needed. Thanks a lot Brian! That is exactly what we need. I am closing this. Glad that will work for you. Thanks, Brian |