Ticket 11819

Summary: Requesting particular CPU or GPU ids for jobs.
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: OtherAssignee: Unassigned Developer <dev-unassigned>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: bart, cinek, ezellma, tim
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Kilian Cavalotti 2021-06-11 17:59:48 MDT
Hi!

[This is a sort of spin-off of #11226 and #11247, to summarize the discussions there].

## Request

It would be beneficial to be able to request particular CPU and GPU ids when submitting a job, to provide a way to manually set affinity between the different devices on a node when it can't be detected automatically.


## Use case

The use case is to control resource allocation on nodes with complex topologies, to ensure proper process pinning, and optimal performance when accessing GPUs and NICs. 

The initial idea was to use --cpu-bind and --gpu-bind, as discussed in #11226 and #11247, but those options don't allow to allocate particular CPU/GPU ids.


## Demonstration

Considering the following node topology:

-- 8< ------------------------------------------------------------------
[root@sh03-14n15 ~]# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     SYS     32-63           1
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     SYS     32-63           1
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     0-31            0
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     0-31            0
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     PXB     96-127          3
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     PXB     96-127          3
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     64-95           2
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     64-95           2
mlx5_0  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS
mlx5_1  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
-- 8< ------------------------------------------------------------------

Although the 8 GPUs on that node are identical, their CPU affinity varies, and each pair of GPU has a differently privileged set of CPUs they can work with. More importantly, on multi-rail nodes like this one, each pair of GPU has a strong affinity with a specific IB interface.

So here, for multi-node GPU-to-GPU communication, you'll want to use GPU[0-1] and CPU[32-63] together with mlx5_0, for instance. Using say mlx5_1 with GPU0 will result in disastrous performance, as the data will need to take unnecessary trips through an SMP interconnect to go from the GPU to the IB interface, instead of going straight from the IB HCA to a GPU that's directly connected to it.

And the performance hit could easily reach 80%.

Here's a concrete example running osu_bw, the OSU MPI-CUDA bandwidth test (http://mvapich.cse.ohio-state.edu/benchmarks) between two machines with the topology mentioned above. All the sruns below are done within a 2-node exclusive allocation ((full nodes):

- without particular consideration for pinning, we get 5GB/s:

$ srun -N 2 --ntasks-per-node=1 --gpus-per-node=1 get_local_rank osu_bw -m 4194304:8388608 D D
# OSU MPI-CUDA Bandwidth Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
4194304              4984.95
8388608              4987.90

- with explicit pinning, making sure to use GPU 0 and mlx5_0, we get close to line-speed: over 24GB/s (on a IB HDR link):

$ UCX_NET_DEVICES=mlx5_0:1 srun -N 2 --ntasks-per-node=1 --gpus-per-node=8 bash -c 'CUDA_VISIBLE_DEVICES=0 get_local_rank osu_bw -m 4194304:8388608 D D'
# OSU MPI-CUDA Bandwidth Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
4194304             24373.33
8388608             24412.20

(CPU pinning doesn't really matter here, as it's a pure GPU-to-GPU RDMA transfer, so there's no data transfer to the CPUs)

The actual GPU pinning above is done by requesting all the GPUs on each node, and explicitly choosing the right one with CUDA_VISIBLE_DEVICES. 


If we request GPU2 (which is what happens by default, in the first example), or if we use the 2nd IB interface and GPU0, we get the same huge performance hit:

$ UCX_NET_DEVICES=mlx5_0:1 srun -N 2 --ntasks-per-node=1 --gpus-per-node=8 bash -c 'CUDA_VISIBLE_DEVICES=2 get_local_rank osu_bw -m 4194304:8388608 D D'# OSU MPI-CUDA Bandwidth Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
4194304              4989.13
8388608              4990.53

$ UCX_NET_DEVICES=mlx5_1:1 srun -N 2 --ntasks-per-node=1 --gpus-per-node=8 bash -c 'CUDA_VISIBLE_DEVICES=0 get_local_rank osu_bw -m 4194304:8388608 D D'
# OSU MPI-CUDA Bandwidth Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
4194304              5407.02
8388608              5408.89


## Proposal

Automatic CPU-to-GPU binding is already somewhat possible via gres.conf. But users can't request specific "domains" (groups of close CPUs and GPUs), which may be necessary to achieve optimal performance.

I guess the missing piece here would be to either:

1. provide a way to schedule interconnect interfaces (IB HCAs or other), and take into account their affinity to particular CPUs and GPUs to automatically allocate the one(s) that are the closest. I guess that would require a way to describe and/or discover all of the nodes' topology... That may be a huge undertaking, but could be a worthwhile investment on the longer term, as new device classes, new topologies and new hierarchical levels will keep appearing.

or

2. provide a way to manually control device affinity, by letting the user specify the particular devices she wants to use, like:
- one CPU among CPU[0-15]
- one GPU among GPU[0-3]
or even just CPU0 and GPU0, for instance.


I'm aware that over-specifying resource requests down to the CPU/GPU id will lead to increased pending times. But maybe a mechanism analog to the --switches=count@max-time can be provided, to achieve some sort of balance?


Of course, a possibility that exists today is to allocate full nodes and do the pinning manually on a subset of the allocated resources. But it's a bit of a waste of resources, and one that doesn't really require a scheduler in the first place. :)


Anyway, thanks for considering this, I hope it makes sense!

Cheers,
--
Kilian
Comment 2 Marcin Stolarek 2021-07-22 03:55:51 MDT
Kilian,

  We had an internal discussion on that where we didn't conclude with an idea that could be implemented in 21.08. I'll keep the ticket open as Severity 5 - Enhancement, so we can get back to it after the busy time of 21.08 release.

cheers,
Marcin
Comment 3 Kilian Cavalotti 2021-07-22 04:36:00 MDT
(In reply to Marcin Stolarek from comment #2)
>   We had an internal discussion on that where we didn't conclude with an
> idea that could be implemented in 21.08. I'll keep the ticket open as
> Severity 5 - Enhancement, so we can get back to it after the busy time of
> 21.08 release.

Noted, thanks!

Cheers,
--
Kilian
Comment 6 Marcin Stolarek 2021-10-12 11:41:57 MDT
Kilian,

We took a look at this one more time and while we see the increasing need for specific binding of different node resources - caused by the increasing complication of the internal topology of computing nodes, we don't think that creating a way to skip resource selection logic will be an overall solution of the issue.

We're having some ideas on how to provide better handling for those types of workloads, but it's noting we can commit to yet.

cheers,
Marcin