Hi! [This is a sort of spin-off of #11226 and #11247, to summarize the discussions there]. ## Request It would be beneficial to be able to request particular CPU and GPU ids when submitting a job, to provide a way to manually set affinity between the different devices on a node when it can't be detected automatically. ## Use case The use case is to control resource allocation on nodes with complex topologies, to ensure proper process pinning, and optimal performance when accessing GPUs and NICs. The initial idea was to use --cpu-bind and --gpu-bind, as discussed in #11226 and #11247, but those options don't allow to allocate particular CPU/GPU ids. ## Demonstration Considering the following node topology: -- 8< ------------------------------------------------------------------ [root@sh03-14n15 ~]# nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB SYS 32-63 1 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB SYS 32-63 1 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS 0-31 0 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS 0-31 0 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS PXB 96-127 3 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS PXB 96-127 3 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS 64-95 2 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS 64-95 2 mlx5_0 PXB PXB SYS SYS SYS SYS SYS SYS X SYS mlx5_1 SYS SYS SYS SYS PXB PXB SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks -- 8< ------------------------------------------------------------------ Although the 8 GPUs on that node are identical, their CPU affinity varies, and each pair of GPU has a differently privileged set of CPUs they can work with. More importantly, on multi-rail nodes like this one, each pair of GPU has a strong affinity with a specific IB interface. So here, for multi-node GPU-to-GPU communication, you'll want to use GPU[0-1] and CPU[32-63] together with mlx5_0, for instance. Using say mlx5_1 with GPU0 will result in disastrous performance, as the data will need to take unnecessary trips through an SMP interconnect to go from the GPU to the IB interface, instead of going straight from the IB HCA to a GPU that's directly connected to it. And the performance hit could easily reach 80%. Here's a concrete example running osu_bw, the OSU MPI-CUDA bandwidth test (http://mvapich.cse.ohio-state.edu/benchmarks) between two machines with the topology mentioned above. All the sruns below are done within a 2-node exclusive allocation ((full nodes): - without particular consideration for pinning, we get 5GB/s: $ srun -N 2 --ntasks-per-node=1 --gpus-per-node=1 get_local_rank osu_bw -m 4194304:8388608 D D # OSU MPI-CUDA Bandwidth Test v5.7 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 4194304 4984.95 8388608 4987.90 - with explicit pinning, making sure to use GPU 0 and mlx5_0, we get close to line-speed: over 24GB/s (on a IB HDR link): $ UCX_NET_DEVICES=mlx5_0:1 srun -N 2 --ntasks-per-node=1 --gpus-per-node=8 bash -c 'CUDA_VISIBLE_DEVICES=0 get_local_rank osu_bw -m 4194304:8388608 D D' # OSU MPI-CUDA Bandwidth Test v5.7 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 4194304 24373.33 8388608 24412.20 (CPU pinning doesn't really matter here, as it's a pure GPU-to-GPU RDMA transfer, so there's no data transfer to the CPUs) The actual GPU pinning above is done by requesting all the GPUs on each node, and explicitly choosing the right one with CUDA_VISIBLE_DEVICES. If we request GPU2 (which is what happens by default, in the first example), or if we use the 2nd IB interface and GPU0, we get the same huge performance hit: $ UCX_NET_DEVICES=mlx5_0:1 srun -N 2 --ntasks-per-node=1 --gpus-per-node=8 bash -c 'CUDA_VISIBLE_DEVICES=2 get_local_rank osu_bw -m 4194304:8388608 D D'# OSU MPI-CUDA Bandwidth Test v5.7 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 4194304 4989.13 8388608 4990.53 $ UCX_NET_DEVICES=mlx5_1:1 srun -N 2 --ntasks-per-node=1 --gpus-per-node=8 bash -c 'CUDA_VISIBLE_DEVICES=0 get_local_rank osu_bw -m 4194304:8388608 D D' # OSU MPI-CUDA Bandwidth Test v5.7 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 4194304 5407.02 8388608 5408.89 ## Proposal Automatic CPU-to-GPU binding is already somewhat possible via gres.conf. But users can't request specific "domains" (groups of close CPUs and GPUs), which may be necessary to achieve optimal performance. I guess the missing piece here would be to either: 1. provide a way to schedule interconnect interfaces (IB HCAs or other), and take into account their affinity to particular CPUs and GPUs to automatically allocate the one(s) that are the closest. I guess that would require a way to describe and/or discover all of the nodes' topology... That may be a huge undertaking, but could be a worthwhile investment on the longer term, as new device classes, new topologies and new hierarchical levels will keep appearing. or 2. provide a way to manually control device affinity, by letting the user specify the particular devices she wants to use, like: - one CPU among CPU[0-15] - one GPU among GPU[0-3] or even just CPU0 and GPU0, for instance. I'm aware that over-specifying resource requests down to the CPU/GPU id will lead to increased pending times. But maybe a mechanism analog to the --switches=count@max-time can be provided, to achieve some sort of balance? Of course, a possibility that exists today is to allocate full nodes and do the pinning manually on a subset of the allocated resources. But it's a bit of a waste of resources, and one that doesn't really require a scheduler in the first place. :) Anyway, thanks for considering this, I hope it makes sense! Cheers, -- Kilian
Kilian, We had an internal discussion on that where we didn't conclude with an idea that could be implemented in 21.08. I'll keep the ticket open as Severity 5 - Enhancement, so we can get back to it after the busy time of 21.08 release. cheers, Marcin
(In reply to Marcin Stolarek from comment #2) > We had an internal discussion on that where we didn't conclude with an > idea that could be implemented in 21.08. I'll keep the ticket open as > Severity 5 - Enhancement, so we can get back to it after the busy time of > 21.08 release. Noted, thanks! Cheers, -- Kilian
Kilian, We took a look at this one more time and while we see the increasing need for specific binding of different node resources - caused by the increasing complication of the internal topology of computing nodes, we don't think that creating a way to skip resource selection logic will be an overall solution of the issue. We're having some ideas on how to provide better handling for those types of workloads, but it's noting we can commit to yet. cheers, Marcin