Created attachment 2437 [details] slurm.conf Hi, Maybe I am misunderstanding how fcyclic works. My goal is distribute gaussian (g09) threads across sockets. g09 runs in mutli-threaded mode. ===================================================================================================== -sh-4.1$ sbatch -N1 -n 1 --cpus-per-task=18 -m block:fcyclic --wrap="srun --cpu_bind=v g09 r.in" Submitted batch job 822 -sh-4.1$ cat slurm-822.out cpu_bind=MASK - dcalph055, task 0 0 [58201]: mask 0x555555555 set -sh-4.1$ head -n1 r.in %nprocshared=18 -sh-4.1$ ===================================================================================================== However: ===================================================================================================== [root@dcalph055 ~]# numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 node 0 size: 65439 MB node 0 free: 62808 MB node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 node 1 size: 65536 MB node 1 free: 63892 MB node distances: node 0 1 0: 10 21 1: 21 10 [root@dcalph055 ~]# ===================================================================================================== The goal of round-robin across sockets for g09 is our recent discovery that for some of applications (e.g. vasp) memory bandwidth per socket becomes a clear bottleneck in case of high cores Intel SKU (we are using http://ark.intel.com/products/81061/Intel-Xeon-Processor-E5-2699-v3-45M-Cache-2_30-GHz ) So for example it doesn't make sense to run 18 cores of g09 on single socket as it preferable to leave two sockets - each having 9 free cores for vasp instead of 1 socket that has 18 cores. So contrary to my expectations `-m block:fcyclic' still leads to all threads being put on the same socket. Please advise.
I'm looking through this now, and trying to see if there may be a disagreement between the NUMA layout numbering and Slurm's internal representation. The mask of 0x555555555 implies that the 18-core constraint is being applied, but it looks like you'd expect 0xffffc0000 or 0x00003ffff given your hardware? Can you send in the output from "lstopo -c" for comparison? - Tim
Hi Tim, Yes, exactly. I expect 0xffffc0000 or 0x00003ffff ========================================================================== [root@dcalph001 ~]# lstopo -c Machine (128GB) cpuset=0x0000000f,0xffffffff NUMANode L#0 (P#0 64GB) cpuset=0x00000005,0x55555555 Socket L#0 cpuset=0x00000005,0x55555555 L3 L#0 (45MB) cpuset=0x00000005,0x55555555 L2 L#0 (256KB) cpuset=0x00000001 L1d L#0 (32KB) cpuset=0x00000001 L1i L#0 (32KB) cpuset=0x00000001 Core L#0 cpuset=0x00000001 PU L#0 (P#0) cpuset=0x00000001 L2 L#1 (256KB) cpuset=0x00000004 L1d L#1 (32KB) cpuset=0x00000004 L1i L#1 (32KB) cpuset=0x00000004 Core L#1 cpuset=0x00000004 PU L#1 (P#2) cpuset=0x00000004 L2 L#2 (256KB) cpuset=0x00000010 L1d L#2 (32KB) cpuset=0x00000010 L1i L#2 (32KB) cpuset=0x00000010 Core L#2 cpuset=0x00000010 PU L#2 (P#4) cpuset=0x00000010 L2 L#3 (256KB) cpuset=0x00000040 L1d L#3 (32KB) cpuset=0x00000040 L1i L#3 (32KB) cpuset=0x00000040 Core L#3 cpuset=0x00000040 PU L#3 (P#6) cpuset=0x00000040 L2 L#4 (256KB) cpuset=0x00000100 L1d L#4 (32KB) cpuset=0x00000100 L1i L#4 (32KB) cpuset=0x00000100 Core L#4 cpuset=0x00000100 PU L#4 (P#8) cpuset=0x00000100 L2 L#5 (256KB) cpuset=0x00000400 L1d L#5 (32KB) cpuset=0x00000400 L1i L#5 (32KB) cpuset=0x00000400 Core L#5 cpuset=0x00000400 PU L#5 (P#10) cpuset=0x00000400 L2 L#6 (256KB) cpuset=0x00001000 L1d L#6 (32KB) cpuset=0x00001000 L1i L#6 (32KB) cpuset=0x00001000 Core L#6 cpuset=0x00001000 PU L#6 (P#12) cpuset=0x00001000 L2 L#7 (256KB) cpuset=0x00004000 L1d L#7 (32KB) cpuset=0x00004000 L1i L#7 (32KB) cpuset=0x00004000 Core L#7 cpuset=0x00004000 PU L#7 (P#14) cpuset=0x00004000 L2 L#8 (256KB) cpuset=0x00010000 L1d L#8 (32KB) cpuset=0x00010000 L1i L#8 (32KB) cpuset=0x00010000 Core L#8 cpuset=0x00010000 PU L#8 (P#16) cpuset=0x00010000 L2 L#9 (256KB) cpuset=0x00040000 L1d L#9 (32KB) cpuset=0x00040000 L1i L#9 (32KB) cpuset=0x00040000 Core L#9 cpuset=0x00040000 PU L#9 (P#18) cpuset=0x00040000 L2 L#10 (256KB) cpuset=0x00100000 L1d L#10 (32KB) cpuset=0x00100000 L1i L#10 (32KB) cpuset=0x00100000 Core L#10 cpuset=0x00100000 PU L#10 (P#20) cpuset=0x00100000 L2 L#11 (256KB) cpuset=0x00400000 L1d L#11 (32KB) cpuset=0x00400000 L1i L#11 (32KB) cpuset=0x00400000 Core L#11 cpuset=0x00400000 PU L#11 (P#22) cpuset=0x00400000 L2 L#12 (256KB) cpuset=0x01000000 L1d L#12 (32KB) cpuset=0x01000000 L1i L#12 (32KB) cpuset=0x01000000 Core L#12 cpuset=0x01000000 PU L#12 (P#24) cpuset=0x01000000 L2 L#13 (256KB) cpuset=0x04000000 L1d L#13 (32KB) cpuset=0x04000000 L1i L#13 (32KB) cpuset=0x04000000 Core L#13 cpuset=0x04000000 PU L#13 (P#26) cpuset=0x04000000 L2 L#14 (256KB) cpuset=0x10000000 L1d L#14 (32KB) cpuset=0x10000000 L1i L#14 (32KB) cpuset=0x10000000 Core L#14 cpuset=0x10000000 PU L#14 (P#28) cpuset=0x10000000 L2 L#15 (256KB) cpuset=0x40000000 L1d L#15 (32KB) cpuset=0x40000000 L1i L#15 (32KB) cpuset=0x40000000 Core L#15 cpuset=0x40000000 PU L#15 (P#30) cpuset=0x40000000 L2 L#16 (256KB) cpuset=0x00000001,0x0 L1d L#16 (32KB) cpuset=0x00000001,0x0 L1i L#16 (32KB) cpuset=0x00000001,0x0 Core L#16 cpuset=0x00000001,0x0 PU L#16 (P#32) cpuset=0x00000001,0x0 L2 L#17 (256KB) cpuset=0x00000004,0x0 L1d L#17 (32KB) cpuset=0x00000004,0x0 L1i L#17 (32KB) cpuset=0x00000004,0x0 Core L#17 cpuset=0x00000004,0x0 PU L#17 (P#34) cpuset=0x00000004,0x0 HostBridge L#0 PCIBridge PCI 14e4:168e Net L#0 "eth0" PCI 14e4:168e Net L#1 "eth1" PCIBridge PCI 1000:005d Block L#2 "sda" PCIBridge PCI 15b3:1003 Net L#3 "ib0" Net L#4 "ib1" OpenFabrics L#5 "mlx4_0" PCI 8086:8d62 PCIBridge PCIBridge PCIBridge PCIBridge PCI 102b:0534 PCI 8086:8d02 NUMANode L#1 (P#1 64GB) cpuset=0x0000000a,0xaaaaaaaa Socket L#1 cpuset=0x0000000a,0xaaaaaaaa L3 L#1 (45MB) cpuset=0x0000000a,0xaaaaaaaa L2 L#18 (256KB) cpuset=0x00000002 L1d L#18 (32KB) cpuset=0x00000002 L1i L#18 (32KB) cpuset=0x00000002 Core L#18 cpuset=0x00000002 PU L#18 (P#1) cpuset=0x00000002 L2 L#19 (256KB) cpuset=0x00000008 L1d L#19 (32KB) cpuset=0x00000008 L1i L#19 (32KB) cpuset=0x00000008 Core L#19 cpuset=0x00000008 PU L#19 (P#3) cpuset=0x00000008 L2 L#20 (256KB) cpuset=0x00000020 L1d L#20 (32KB) cpuset=0x00000020 L1i L#20 (32KB) cpuset=0x00000020 Core L#20 cpuset=0x00000020 PU L#20 (P#5) cpuset=0x00000020 L2 L#21 (256KB) cpuset=0x00000080 L1d L#21 (32KB) cpuset=0x00000080 L1i L#21 (32KB) cpuset=0x00000080 Core L#21 cpuset=0x00000080 PU L#21 (P#7) cpuset=0x00000080 L2 L#22 (256KB) cpuset=0x00000200 L1d L#22 (32KB) cpuset=0x00000200 L1i L#22 (32KB) cpuset=0x00000200 Core L#22 cpuset=0x00000200 PU L#22 (P#9) cpuset=0x00000200 L2 L#23 (256KB) cpuset=0x00000800 L1d L#23 (32KB) cpuset=0x00000800 L1i L#23 (32KB) cpuset=0x00000800 Core L#23 cpuset=0x00000800 PU L#23 (P#11) cpuset=0x00000800 L2 L#24 (256KB) cpuset=0x00002000 L1d L#24 (32KB) cpuset=0x00002000 L1i L#24 (32KB) cpuset=0x00002000 Core L#24 cpuset=0x00002000 PU L#24 (P#13) cpuset=0x00002000 L2 L#25 (256KB) cpuset=0x00008000 L1d L#25 (32KB) cpuset=0x00008000 L1i L#25 (32KB) cpuset=0x00008000 Core L#25 cpuset=0x00008000 PU L#25 (P#15) cpuset=0x00008000 L2 L#26 (256KB) cpuset=0x00020000 L1d L#26 (32KB) cpuset=0x00020000 L1i L#26 (32KB) cpuset=0x00020000 Core L#26 cpuset=0x00020000 PU L#26 (P#17) cpuset=0x00020000 L2 L#27 (256KB) cpuset=0x00080000 L1d L#27 (32KB) cpuset=0x00080000 L1i L#27 (32KB) cpuset=0x00080000 Core L#27 cpuset=0x00080000 PU L#27 (P#19) cpuset=0x00080000 L2 L#28 (256KB) cpuset=0x00200000 L1d L#28 (32KB) cpuset=0x00200000 L1i L#28 (32KB) cpuset=0x00200000 Core L#28 cpuset=0x00200000 PU L#28 (P#21) cpuset=0x00200000 L2 L#29 (256KB) cpuset=0x00800000 L1d L#29 (32KB) cpuset=0x00800000 L1i L#29 (32KB) cpuset=0x00800000 Core L#29 cpuset=0x00800000 PU L#29 (P#23) cpuset=0x00800000 L2 L#30 (256KB) cpuset=0x02000000 L1d L#30 (32KB) cpuset=0x02000000 L1i L#30 (32KB) cpuset=0x02000000 Core L#30 cpuset=0x02000000 PU L#30 (P#25) cpuset=0x02000000 L2 L#31 (256KB) cpuset=0x08000000 L1d L#31 (32KB) cpuset=0x08000000 L1i L#31 (32KB) cpuset=0x08000000 Core L#31 cpuset=0x08000000 PU L#31 (P#27) cpuset=0x08000000 L2 L#32 (256KB) cpuset=0x20000000 L1d L#32 (32KB) cpuset=0x20000000 L1i L#32 (32KB) cpuset=0x20000000 Core L#32 cpuset=0x20000000 PU L#32 (P#29) cpuset=0x20000000 L2 L#33 (256KB) cpuset=0x80000000 L1d L#33 (32KB) cpuset=0x80000000 L1i L#33 (32KB) cpuset=0x80000000 Core L#33 cpuset=0x80000000 PU L#33 (P#31) cpuset=0x80000000 L2 L#34 (256KB) cpuset=0x00000002,0x0 L1d L#34 (32KB) cpuset=0x00000002,0x0 L1i L#34 (32KB) cpuset=0x00000002,0x0 Core L#34 cpuset=0x00000002,0x0 PU L#34 (P#33) cpuset=0x00000002,0x0 L2 L#35 (256KB) cpuset=0x00000008,0x0 L1d L#35 (32KB) cpuset=0x00000008,0x0 L1i L#35 (32KB) cpuset=0x00000008,0x0 Core L#35 cpuset=0x00000008,0x0 PU L#35 (P#35) cpuset=0x00000008,0x0 [root@dcalph001 ~]# =========================================================================== BTW, we are not building slurm with hwloc.
Can you re-test using the --exclusive flag with your sbatch command? I'm curious if the placement will change; it may be that the allocation itself isn't considering the -m request, and the srun is just using whichever cpus were allocated to the process. Also - without hwloc none of this should be working properly - can you verify whether you're running with or without it?
================================================================================================================ -sh-4.1$ sbatch --exclusive -N1 -n 1 --cpus-per-task=18 -m block:fcyclic --wrap="srun --cpu_bind=v g09 r.in" Submitted batch job 844 -sh-4.1$ cat slurm-844.out cpu_bind=MASK - dcalph004, task 0 0 [132959]: mask 0xfffffffff set -sh-4.1$ ================================================================================================================ I've just installed hwloc to accommodate your request to porvide "lstopo -c" Tt was not present during slurm build. Here is a snippet from our config.log: ... configure:20747: checking for hwloc installation configure:20810: result: configure:20814: WARNING: unable to locate hwloc installation ... Shall we rebuild with hwloc?
JFYI, I have just rebuild slurm with hwloc - the results are pretty much the same: -sh-4.1$ sbatch -N1 -n 1 --cpus-per-task=18 -m block:fcyclic --wrap="srun --cpu_bind=v g09 r.in" Submitted batch job 845 -sh-4.1$ cat slurm-845.out cpu_bind=MASK - dcalph004, task 0 0 [135114]: mask 0x555555555 set -sh-4.1$
And with hwloc and --exclusive: -sh-4.1$ sbatch --exclusive -N1 -n 1 --cpus-per-task=18 -m block:fcyclic --wrap="srun --cpu_bind=v g09 r.in" Submitted batch job 846 -sh-4.1$ cat slurm-846.out cpu_bind=MASK - dcalph004, task 0 0 [136719]: mask 0xfffffffff set -sh-4.1$
Will you try this? sbatch -N1 -n1 --cpus-per-task=6 --exclusive -mblock:fcyclic --wrap="srun cat /proc/self/status | grep -i cpus_ | sort -n"
-sh-4.1$ sbatch -N1 -n1 --cpus-per-task=6 --exclusive -mblock:fcyclic --wrap="srun cat /proc/self/status | grep -i cpus_ | sort -n" Submitted batch job 910 -sh-4.1$ cat slurm-910.out Cpus_allowed: 0000,00000000,00000000,0000000f,ffffffff Cpus_allowed_list: 0-35 -sh-4.1$
What about? sbatch -N1 -n1 --cpus-per-task=6 --exclusive -mblock:fcyclic --wrap="srun --cpu_bind=cores cat /proc/self/status | grep -i cpus_ | sort -n"
-sh-4.1$ sbatch -N1 -n1 --cpus-per-task=6 --exclusive -mblock:fcyclic --wrap="srun --cpu_bind=cores cat /proc/self/status | grep -i cpus_ | sort -n" Submitted batch job 912 -sh-4.1$ cat slurm-912.out Cpus_allowed: 0000,00000000,00000000,00000000,0000003f Cpus_allowed_list: 0-5 -sh-4.1$
Hi, Is that expected and "-m block:fcyclic" has to be accomplished with "srun --cpu_bind=cores..." to achieve what we need?
Yes. You need to have --exclusive to get an allocation on both sockets and --cpu_bind=cores to bind to just cores, otherwise it is binding to the whole node. Slurm tries to figure out the best binding by matching the number of requested cpus to the number of available resources and binding to the appropriate resource. This is called auto-binding -- see srun man page. When you request 18 cpus -- with exclusive node access --, that doesn't match any total number of resources on the node (e.g. 18 cores != 36 cores) so it just binds to the whole node. You can see this in the slurmd logs. In my case I have 2 sockets with 6 cores each. If I request 6 cpus and exclusive access, the job binds to the whole node: debug: binding tasks:6 to nodes:1 sockets:2:0 cores:12:0 threads:12 lllp_distribution jobid [2545] auto binding off: mask_cpu,one_thread debug: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0xFFF) If I request 2 cpus and exclusive access, the job is bound to the sockets: debug: task affinity : before lllp distribution cpu bind method is '(null type)' ((null)) debug: binding tasks:2 to nodes:1 sockets:2:0 cores:12:0 threads:12 lllp_distribution jobid [2548] implicit auto binding: sockets,one_thread, dist 50 And like in your first case where you are only being allocated the first socket, -- not exclusive node access -- the job is being bound to cores on the first socket: debug: task affinity : before lllp distribution cpu bind method is '(null type)' ((null)) debug: binding tasks:6 to nodes:0 sockets:1:0 cores:6:0 threads:6 lllp_distribution jobid [2554] implicit auto binding: cores,one_thread, dist 50 So in your case it's good to be specific in the binding. Does this help?
Hi, --exclusive requirement is clashing with the goal I described: "The goal of round-robin across sockets for g09 is our recent discovery that for some of applications (e.g. vasp) memory bandwidth per socket becomes a clear bottleneck in case of high cores Intel SKU (we are using http://ark.intel.com/products/81061/Intel-Xeon-Processor-E5-2699-v3-45M-Cache-2_30-GHz ) So for example it doesn't make sense to run 18 cores of g09 on single socket as it preferable to leave two sockets - each having 9 free cores for vasp instead of 1 socket that has 18 cores." We are clearly facing that memory per socket bottleneck. That is not theoretical. We spent a lot of time to figure that out in collaboration with Dell and Intel.
Sorry for the typo: ...We are clearly facing that memory _bandwidth_ per socket bottleneck. That is not theoretical. We spent a lot of time to figure that out in collaboration with Dell and Intel.
To me that is a real deficiency because that is not allowing to leverage some real hardware. That might be much easier to program making all sophisticated --cpu_bind modes depend on --exclusive but this is clearly not matching massive multi-cores processors reality.
Try this: sbatch -N1 -n18 --ntasks-per-socket=9 -mblock:fcyclic --wrap="srun -n1 --cpus-per-task=18 --cpu_bind=core cat /proc/self/status | grep -i cpus_ | sort -n" Doing it this way, I'm able to use use half of each socket and pack two jobs on the node: brian@knc:/localhome/brian/slurm/15.08/knc$ sbatch -N1 -n6 --ntasks-per-socket=3 -mblock:fcyclic --wrap="srun -n1 --cpus-per-task=6 --cpu_bind=core ~/whereami 10" Submitted batch job 2592 brian@knc:/localhome/brian/slurm/15.08/knc$ sbatch -N1 -n6 --ntasks-per-socket=3 -mblock:fcyclic --wrap="srun -n1 --cpus-per-task=6 --cpu_bind=core ~/whereami 10" Submitted batch job 2593 brian@knc:/localhome/brian/slurm/15.08/knc$ sbatch -N1 -n6 --ntasks-per-socket=3 -mblock:fcyclic --wrap="srun -n1 --cpus-per-task=6 --cpu_bind=core ~/whereami 10" Submitted batch job 2594 brian@knc:/localhome/brian/slurm/15.08/knc$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2594 debug wrap brian R 0:00 1 compy2 2592 debug wrap brian R 0:03 1 compy1 2593 debug wrap brian R 0:03 1 compy1 brian@knc:/localhome/brian/slurm/15.08/knc$ cat slurm-2592.out 0 compy1 - MASK:0x3f sleeping 10 seconds brian@knc:/localhome/brian/slurm/15.08/knc$ cat slurm-2593.out 0 compy1 - MASK:0xfc0 sleeping 10 seconds
And in this case --cpu_bind=cores is not needed.
Thanks a lot Brian! That is exactly what we need. I am closing this.
Glad that will work for you. Thanks, Brian