Ticket 2179

Summary: -m ...:fcyclic doesn't distribute cores across sockets in round-robbin for single task
Product: Slurm Reporter: Sergey Meirovich <sergey_meirovich>
Component: User CommandsAssignee: Brian Christiansen <brian>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: brian, da
Version: 15.08.4   
Hardware: Linux   
OS: Linux   
Site: AMAT Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Sergey Meirovich 2015-11-22 05:01:36 MST
Created attachment 2437 [details]
slurm.conf

Hi,

Maybe I am misunderstanding how fcyclic works. My goal is distribute gaussian (g09) threads across sockets. g09 runs in mutli-threaded mode.
=====================================================================================================
-sh-4.1$ sbatch -N1 -n 1 --cpus-per-task=18 -m block:fcyclic --wrap="srun --cpu_bind=v g09 r.in"
Submitted batch job 822
-sh-4.1$ cat slurm-822.out 
cpu_bind=MASK - dcalph055, task  0  0 [58201]: mask 0x555555555 set
-sh-4.1$ head -n1 r.in 
%nprocshared=18
-sh-4.1$ 
=====================================================================================================

However:
=====================================================================================================
[root@dcalph055 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
node 0 size: 65439 MB
node 0 free: 62808 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
node 1 size: 65536 MB
node 1 free: 63892 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
[root@dcalph055 ~]# 
=====================================================================================================

The goal of round-robin across sockets for g09 is our recent discovery that for some of applications (e.g. vasp) memory bandwidth per socket becomes a clear bottleneck in case of high cores Intel SKU (we are using http://ark.intel.com/products/81061/Intel-Xeon-Processor-E5-2699-v3-45M-Cache-2_30-GHz )

So for example it doesn't make sense to run 18 cores of g09 on single socket as it preferable to leave two sockets - each having 9 free cores for vasp instead of 1 socket that has 18 cores.

So contrary to my expectations `-m block:fcyclic' still leads to all threads being put on the same socket.

Please advise.
Comment 1 Tim Wickberg 2015-11-23 08:56:22 MST
I'm looking through this now, and trying to see if there may be a disagreement between the NUMA layout numbering and Slurm's internal representation. 

The mask of 0x555555555 implies that the 18-core constraint is being applied, but it looks like you'd expect 0xffffc0000 or 0x00003ffff given your hardware?

Can you send in the output from "lstopo -c" for comparison?

- Tim
Comment 2 Sergey Meirovich 2015-11-23 09:12:24 MST
Hi Tim,

Yes, exactly. I expect 0xffffc0000 or 0x00003ffff

==========================================================================
[root@dcalph001 ~]# lstopo -c
Machine (128GB) cpuset=0x0000000f,0xffffffff
  NUMANode L#0 (P#0 64GB) cpuset=0x00000005,0x55555555
    Socket L#0 cpuset=0x00000005,0x55555555
      L3 L#0 (45MB) cpuset=0x00000005,0x55555555
        L2 L#0 (256KB) cpuset=0x00000001
          L1d L#0 (32KB) cpuset=0x00000001
            L1i L#0 (32KB) cpuset=0x00000001
              Core L#0 cpuset=0x00000001
                PU L#0 (P#0) cpuset=0x00000001
        L2 L#1 (256KB) cpuset=0x00000004
          L1d L#1 (32KB) cpuset=0x00000004
            L1i L#1 (32KB) cpuset=0x00000004
              Core L#1 cpuset=0x00000004
                PU L#1 (P#2) cpuset=0x00000004
        L2 L#2 (256KB) cpuset=0x00000010
          L1d L#2 (32KB) cpuset=0x00000010
            L1i L#2 (32KB) cpuset=0x00000010
              Core L#2 cpuset=0x00000010
                PU L#2 (P#4) cpuset=0x00000010
        L2 L#3 (256KB) cpuset=0x00000040
          L1d L#3 (32KB) cpuset=0x00000040
            L1i L#3 (32KB) cpuset=0x00000040
              Core L#3 cpuset=0x00000040
                PU L#3 (P#6) cpuset=0x00000040
        L2 L#4 (256KB) cpuset=0x00000100
          L1d L#4 (32KB) cpuset=0x00000100
            L1i L#4 (32KB) cpuset=0x00000100
              Core L#4 cpuset=0x00000100
                PU L#4 (P#8) cpuset=0x00000100
        L2 L#5 (256KB) cpuset=0x00000400
          L1d L#5 (32KB) cpuset=0x00000400
            L1i L#5 (32KB) cpuset=0x00000400
              Core L#5 cpuset=0x00000400
                PU L#5 (P#10) cpuset=0x00000400
        L2 L#6 (256KB) cpuset=0x00001000
          L1d L#6 (32KB) cpuset=0x00001000
            L1i L#6 (32KB) cpuset=0x00001000
              Core L#6 cpuset=0x00001000
                PU L#6 (P#12) cpuset=0x00001000
        L2 L#7 (256KB) cpuset=0x00004000
          L1d L#7 (32KB) cpuset=0x00004000
            L1i L#7 (32KB) cpuset=0x00004000
              Core L#7 cpuset=0x00004000
                PU L#7 (P#14) cpuset=0x00004000
        L2 L#8 (256KB) cpuset=0x00010000
          L1d L#8 (32KB) cpuset=0x00010000
            L1i L#8 (32KB) cpuset=0x00010000
              Core L#8 cpuset=0x00010000
                PU L#8 (P#16) cpuset=0x00010000
        L2 L#9 (256KB) cpuset=0x00040000
          L1d L#9 (32KB) cpuset=0x00040000
            L1i L#9 (32KB) cpuset=0x00040000
              Core L#9 cpuset=0x00040000
                PU L#9 (P#18) cpuset=0x00040000
        L2 L#10 (256KB) cpuset=0x00100000
          L1d L#10 (32KB) cpuset=0x00100000
            L1i L#10 (32KB) cpuset=0x00100000
              Core L#10 cpuset=0x00100000
                PU L#10 (P#20) cpuset=0x00100000
        L2 L#11 (256KB) cpuset=0x00400000
          L1d L#11 (32KB) cpuset=0x00400000
            L1i L#11 (32KB) cpuset=0x00400000
              Core L#11 cpuset=0x00400000
                PU L#11 (P#22) cpuset=0x00400000
        L2 L#12 (256KB) cpuset=0x01000000
          L1d L#12 (32KB) cpuset=0x01000000
            L1i L#12 (32KB) cpuset=0x01000000
              Core L#12 cpuset=0x01000000
                PU L#12 (P#24) cpuset=0x01000000
        L2 L#13 (256KB) cpuset=0x04000000
          L1d L#13 (32KB) cpuset=0x04000000
            L1i L#13 (32KB) cpuset=0x04000000
              Core L#13 cpuset=0x04000000
                PU L#13 (P#26) cpuset=0x04000000
        L2 L#14 (256KB) cpuset=0x10000000
          L1d L#14 (32KB) cpuset=0x10000000
            L1i L#14 (32KB) cpuset=0x10000000
              Core L#14 cpuset=0x10000000
                PU L#14 (P#28) cpuset=0x10000000
        L2 L#15 (256KB) cpuset=0x40000000
          L1d L#15 (32KB) cpuset=0x40000000
            L1i L#15 (32KB) cpuset=0x40000000
              Core L#15 cpuset=0x40000000
                PU L#15 (P#30) cpuset=0x40000000
        L2 L#16 (256KB) cpuset=0x00000001,0x0
          L1d L#16 (32KB) cpuset=0x00000001,0x0
            L1i L#16 (32KB) cpuset=0x00000001,0x0
              Core L#16 cpuset=0x00000001,0x0
                PU L#16 (P#32) cpuset=0x00000001,0x0
        L2 L#17 (256KB) cpuset=0x00000004,0x0
          L1d L#17 (32KB) cpuset=0x00000004,0x0
            L1i L#17 (32KB) cpuset=0x00000004,0x0
              Core L#17 cpuset=0x00000004,0x0
                PU L#17 (P#34) cpuset=0x00000004,0x0
    HostBridge L#0
      PCIBridge
        PCI 14e4:168e
          Net L#0 "eth0"
        PCI 14e4:168e
          Net L#1 "eth1"
      PCIBridge
        PCI 1000:005d
          Block L#2 "sda"
      PCIBridge
        PCI 15b3:1003
          Net L#3 "ib0"
          Net L#4 "ib1"
          OpenFabrics L#5 "mlx4_0"
      PCI 8086:8d62
      PCIBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCI 102b:0534
      PCI 8086:8d02
  NUMANode L#1 (P#1 64GB) cpuset=0x0000000a,0xaaaaaaaa
    Socket L#1 cpuset=0x0000000a,0xaaaaaaaa
      L3 L#1 (45MB) cpuset=0x0000000a,0xaaaaaaaa
        L2 L#18 (256KB) cpuset=0x00000002
          L1d L#18 (32KB) cpuset=0x00000002
            L1i L#18 (32KB) cpuset=0x00000002
              Core L#18 cpuset=0x00000002
                PU L#18 (P#1) cpuset=0x00000002
        L2 L#19 (256KB) cpuset=0x00000008
          L1d L#19 (32KB) cpuset=0x00000008
            L1i L#19 (32KB) cpuset=0x00000008
              Core L#19 cpuset=0x00000008
                PU L#19 (P#3) cpuset=0x00000008
        L2 L#20 (256KB) cpuset=0x00000020
          L1d L#20 (32KB) cpuset=0x00000020
            L1i L#20 (32KB) cpuset=0x00000020
              Core L#20 cpuset=0x00000020
                PU L#20 (P#5) cpuset=0x00000020
        L2 L#21 (256KB) cpuset=0x00000080
          L1d L#21 (32KB) cpuset=0x00000080
            L1i L#21 (32KB) cpuset=0x00000080
              Core L#21 cpuset=0x00000080
                PU L#21 (P#7) cpuset=0x00000080
        L2 L#22 (256KB) cpuset=0x00000200
          L1d L#22 (32KB) cpuset=0x00000200
            L1i L#22 (32KB) cpuset=0x00000200
              Core L#22 cpuset=0x00000200
                PU L#22 (P#9) cpuset=0x00000200
        L2 L#23 (256KB) cpuset=0x00000800
          L1d L#23 (32KB) cpuset=0x00000800
            L1i L#23 (32KB) cpuset=0x00000800
              Core L#23 cpuset=0x00000800
                PU L#23 (P#11) cpuset=0x00000800
        L2 L#24 (256KB) cpuset=0x00002000
          L1d L#24 (32KB) cpuset=0x00002000
            L1i L#24 (32KB) cpuset=0x00002000
              Core L#24 cpuset=0x00002000
                PU L#24 (P#13) cpuset=0x00002000
        L2 L#25 (256KB) cpuset=0x00008000
          L1d L#25 (32KB) cpuset=0x00008000
            L1i L#25 (32KB) cpuset=0x00008000
              Core L#25 cpuset=0x00008000
                PU L#25 (P#15) cpuset=0x00008000
        L2 L#26 (256KB) cpuset=0x00020000
          L1d L#26 (32KB) cpuset=0x00020000
            L1i L#26 (32KB) cpuset=0x00020000
              Core L#26 cpuset=0x00020000
                PU L#26 (P#17) cpuset=0x00020000
        L2 L#27 (256KB) cpuset=0x00080000
          L1d L#27 (32KB) cpuset=0x00080000
            L1i L#27 (32KB) cpuset=0x00080000
              Core L#27 cpuset=0x00080000
                PU L#27 (P#19) cpuset=0x00080000
        L2 L#28 (256KB) cpuset=0x00200000
          L1d L#28 (32KB) cpuset=0x00200000
            L1i L#28 (32KB) cpuset=0x00200000
              Core L#28 cpuset=0x00200000
                PU L#28 (P#21) cpuset=0x00200000
        L2 L#29 (256KB) cpuset=0x00800000
          L1d L#29 (32KB) cpuset=0x00800000
            L1i L#29 (32KB) cpuset=0x00800000
              Core L#29 cpuset=0x00800000
                PU L#29 (P#23) cpuset=0x00800000
        L2 L#30 (256KB) cpuset=0x02000000
          L1d L#30 (32KB) cpuset=0x02000000
            L1i L#30 (32KB) cpuset=0x02000000
              Core L#30 cpuset=0x02000000
                PU L#30 (P#25) cpuset=0x02000000
        L2 L#31 (256KB) cpuset=0x08000000
          L1d L#31 (32KB) cpuset=0x08000000
            L1i L#31 (32KB) cpuset=0x08000000
              Core L#31 cpuset=0x08000000
                PU L#31 (P#27) cpuset=0x08000000
        L2 L#32 (256KB) cpuset=0x20000000
          L1d L#32 (32KB) cpuset=0x20000000
            L1i L#32 (32KB) cpuset=0x20000000
              Core L#32 cpuset=0x20000000
                PU L#32 (P#29) cpuset=0x20000000
        L2 L#33 (256KB) cpuset=0x80000000
          L1d L#33 (32KB) cpuset=0x80000000
            L1i L#33 (32KB) cpuset=0x80000000
              Core L#33 cpuset=0x80000000
                PU L#33 (P#31) cpuset=0x80000000
        L2 L#34 (256KB) cpuset=0x00000002,0x0
          L1d L#34 (32KB) cpuset=0x00000002,0x0
            L1i L#34 (32KB) cpuset=0x00000002,0x0
              Core L#34 cpuset=0x00000002,0x0
                PU L#34 (P#33) cpuset=0x00000002,0x0
        L2 L#35 (256KB) cpuset=0x00000008,0x0
          L1d L#35 (32KB) cpuset=0x00000008,0x0
            L1i L#35 (32KB) cpuset=0x00000008,0x0
              Core L#35 cpuset=0x00000008,0x0
                PU L#35 (P#35) cpuset=0x00000008,0x0
[root@dcalph001 ~]# 
===========================================================================

BTW, we are not building slurm with hwloc.
Comment 3 Tim Wickberg 2015-11-23 09:39:44 MST
Can you re-test using the --exclusive flag with your sbatch command? I'm 
curious if the placement will change; it may be that the allocation 
itself isn't considering the -m request, and the srun is just using 
whichever cpus were allocated to the process.

Also - without hwloc none of this should be working properly - can you 
verify whether you're running with or without it?
Comment 4 Sergey Meirovich 2015-11-23 10:08:33 MST
================================================================================================================
-sh-4.1$ sbatch --exclusive  -N1 -n 1 --cpus-per-task=18 -m block:fcyclic --wrap="srun --cpu_bind=v g09 r.in"
Submitted batch job 844
-sh-4.1$ cat slurm-844.out 
cpu_bind=MASK - dcalph004, task  0  0 [132959]: mask 0xfffffffff set
-sh-4.1$ 
================================================================================================================

I've just installed hwloc to accommodate your request to porvide "lstopo -c" 
Tt was not present during slurm build. Here is a snippet from our config.log:
...
configure:20747: checking for hwloc installation
configure:20810: result: 
configure:20814: WARNING: unable to locate hwloc installation
...

Shall we rebuild with hwloc?
Comment 5 Sergey Meirovich 2015-11-23 10:31:11 MST
JFYI,

I have just rebuild slurm with hwloc - the results are pretty much the same:

-sh-4.1$ sbatch -N1 -n 1 --cpus-per-task=18 -m block:fcyclic --wrap="srun --cpu_bind=v g09 r.in"
Submitted batch job 845
-sh-4.1$ cat slurm-845.out 
cpu_bind=MASK - dcalph004, task  0  0 [135114]: mask 0x555555555 set
-sh-4.1$
Comment 6 Sergey Meirovich 2015-11-23 11:01:14 MST
And with hwloc and  --exclusive:

-sh-4.1$ sbatch --exclusive -N1 -n 1 --cpus-per-task=18 -m block:fcyclic --wrap="srun --cpu_bind=v g09 r.in"
Submitted batch job 846
-sh-4.1$ cat slurm-846.out
cpu_bind=MASK - dcalph004, task  0  0 [136719]: mask 0xfffffffff set
-sh-4.1$
Comment 7 Brian Christiansen 2015-11-24 10:40:24 MST
Will you try this?

sbatch -N1 -n1 --cpus-per-task=6 --exclusive -mblock:fcyclic --wrap="srun cat /proc/self/status | grep -i cpus_ | sort -n"
Comment 8 Sergey Meirovich 2015-11-24 10:50:26 MST
-sh-4.1$ sbatch -N1 -n1 --cpus-per-task=6 --exclusive -mblock:fcyclic --wrap="srun cat /proc/self/status | grep -i cpus_ | sort -n"
Submitted batch job 910
-sh-4.1$ cat slurm-910.out 
Cpus_allowed:	0000,00000000,00000000,0000000f,ffffffff
Cpus_allowed_list:	0-35
-sh-4.1$
Comment 9 Danny Auble 2015-11-24 10:51:49 MST
What about?

sbatch -N1 -n1 --cpus-per-task=6 --exclusive -mblock:fcyclic --wrap="srun --cpu_bind=cores cat /proc/self/status | grep -i cpus_ | sort -n"
Comment 10 Sergey Meirovich 2015-11-24 11:12:16 MST
-sh-4.1$ sbatch -N1 -n1 --cpus-per-task=6 --exclusive -mblock:fcyclic --wrap="srun --cpu_bind=cores cat /proc/self/status | grep -i cpus_ | sort -n"
Submitted batch job 912
-sh-4.1$ cat slurm-912.out
Cpus_allowed:	0000,00000000,00000000,00000000,0000003f
Cpus_allowed_list:	0-5
-sh-4.1$
Comment 11 Sergey Meirovich 2015-11-24 21:53:33 MST
Hi,

Is that expected and "-m block:fcyclic" has to be accomplished with "srun --cpu_bind=cores..." to achieve what we need?
Comment 12 Brian Christiansen 2015-11-25 02:53:26 MST
Yes. You need to have --exclusive to get an allocation on both sockets and --cpu_bind=cores to bind to just cores, otherwise it is binding to the whole node. Slurm tries to figure out the best binding by matching the number of requested cpus to the number of available resources and binding to the appropriate resource. This is called auto-binding -- see srun man page. 

When you request 18 cpus -- with exclusive node access --, that doesn't match any total number of resources on the node (e.g. 18 cores != 36 cores) so it just binds to the whole node. 

You can see this in the slurmd logs. In my case I have 2 sockets with 6 cores each. If I request 6 cpus and exclusive access, the job binds to the whole node:

debug:  binding tasks:6 to nodes:1 sockets:2:0 cores:12:0 threads:12
lllp_distribution jobid [2545] auto binding off: mask_cpu,one_thread
debug:  task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0xFFF)


If I request 2 cpus and exclusive access, the job is bound to the sockets:

debug:  task affinity : before lllp distribution cpu bind method is '(null type)' ((null))
debug:  binding tasks:2 to nodes:1 sockets:2:0 cores:12:0 threads:12
lllp_distribution jobid [2548] implicit auto binding: sockets,one_thread, dist 50


And like in your first case where you are only being allocated the first socket, -- not exclusive node access --  the job is being bound to cores on the first socket:

debug:  task affinity : before lllp distribution cpu bind method is '(null type)' ((null))
debug:  binding tasks:6 to nodes:0 sockets:1:0 cores:6:0 threads:6
lllp_distribution jobid [2554] implicit auto binding: cores,one_thread, dist 50


So in your case it's good to be specific in the binding. Does this help?
Comment 13 Sergey Meirovich 2015-11-25 03:06:49 MST
Hi,

--exclusive requirement is clashing  with the goal I described:

"The goal of round-robin across sockets for g09 is our recent discovery that for some of applications (e.g. vasp) memory bandwidth per socket becomes a clear bottleneck in case of high cores Intel SKU (we are using http://ark.intel.com/products/81061/Intel-Xeon-Processor-E5-2699-v3-45M-Cache-2_30-GHz )

So for example it doesn't make sense to run 18 cores of g09 on single socket as it preferable to leave two sockets - each having 9 free cores for vasp instead of 1 socket that has 18 cores."

We are clearly facing that memory per socket bottleneck. That is not theoretical. We spent a lot of time to figure that out in collaboration with Dell and Intel.
Comment 14 Sergey Meirovich 2015-11-25 03:10:00 MST
Sorry for the typo:
...We are clearly facing that memory _bandwidth_ per socket bottleneck. That is not theoretical. We spent a lot of time to figure that out in collaboration with Dell and Intel.
Comment 15 Sergey Meirovich 2015-11-25 03:43:31 MST
To me that is a real deficiency because that is not allowing to leverage some real hardware.

That might be much easier to program making all sophisticated --cpu_bind modes depend on --exclusive but this is clearly not matching massive multi-cores processors reality.
Comment 16 Brian Christiansen 2015-11-25 03:54:45 MST
Try this:
sbatch -N1 -n18 --ntasks-per-socket=9 -mblock:fcyclic --wrap="srun -n1 --cpus-per-task=18 --cpu_bind=core cat /proc/self/status | grep -i cpus_ | sort -n"

Doing it this way, I'm able to use use half of each socket and pack two jobs on the node:

brian@knc:/localhome/brian/slurm/15.08/knc$ sbatch -N1 -n6 --ntasks-per-socket=3 -mblock:fcyclic --wrap="srun -n1 --cpus-per-task=6 --cpu_bind=core ~/whereami 10"
Submitted batch job 2592
brian@knc:/localhome/brian/slurm/15.08/knc$ sbatch -N1 -n6 --ntasks-per-socket=3 -mblock:fcyclic --wrap="srun -n1 --cpus-per-task=6 --cpu_bind=core ~/whereami 10"
Submitted batch job 2593
brian@knc:/localhome/brian/slurm/15.08/knc$ sbatch -N1 -n6 --ntasks-per-socket=3 -mblock:fcyclic --wrap="srun -n1 --cpus-per-task=6 --cpu_bind=core ~/whereami 10"
Submitted batch job 2594
brian@knc:/localhome/brian/slurm/15.08/knc$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2594     debug     wrap    brian  R       0:00      1 compy2
              2592     debug     wrap    brian  R       0:03      1 compy1
              2593     debug     wrap    brian  R       0:03      1 compy1
brian@knc:/localhome/brian/slurm/15.08/knc$ cat slurm-2592.out 
   0 compy1     - MASK:0x3f
sleeping 10 seconds
brian@knc:/localhome/brian/slurm/15.08/knc$ cat slurm-2593.out 
   0 compy1     - MASK:0xfc0
sleeping 10 seconds
Comment 17 Brian Christiansen 2015-11-25 03:58:05 MST
And in this case --cpu_bind=cores is not needed.
Comment 18 Sergey Meirovich 2015-11-25 04:21:27 MST
Thanks a lot Brian! That is exactly what we need. I am closing this.
Comment 19 Brian Christiansen 2015-11-25 04:25:41 MST
Glad that will work for you. 

Thanks,
Brian