Hi Slurm experts: After we corrected our gres.conf, the option '--gres-flags=enforce-binding' works nicely, i.e. CPU affinity with GPU card is respected. Now we looking into possibility of allocating two GPU cards on the same socket for one job requesting two GPU cards. For example, on a node with topology as the following, either (GPU0 & GPU1), or (GPU2 & GPU3) are assigned to the job, so that the communication between cards traverses only a single PCIe switch (PIX). Thank you! $ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity GPU0 X PIX SOC SOC PHB 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26 GPU1 PIX X SOC SOC PHB 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26 GPU2 SOC SOC X PIX SOC 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27 GPU3 SOC SOC PIX X SOC 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27 mlx5_0 PHB PHB SOC SOC X Legend: X = Self SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI) PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks
Unfortunately there's no easy way to do this right now. There are some potential ways to simulate this by setting different type values for the GPUs in different NUMA domains. E.g., setting the type of two cards to numa1 and the other two as numa2 would let a job request --gres=gpu:numa1:2 and know which cards it would receive. But that would restrict jobs to needing to request either the numa1 or numa2 GPUs. There's no way to say that either type would be sufficient, but that the type needs to match. This is something we've considered adding, but don't have it on our roadmap currently (most new functionality is underpinned by sponsored development). I'm trying to chase down a Sev5 enhancement bug discussing this but haven't found it yet; if you'd like I may reclassify this bug as such rather than open a new one if we don't have something covering this already. - Tim
I have to retract my previous statement. I couldn't find the Sev5 bug as it was already marked as resolved - bug 1725. This should work as you expect with --gres-flags=enforce-binding as long as the socket mappings are setup appropriately. If the number of GPUs requested is equal to the number available on each socket, the scheduler should assign that pair to the job. Please let me know if that's not what you're seeing and I'll look into it further, otherwise I'll close this as a duplicate of that already-resolved enhancement request. - Tim
Marking as a duplicate of 1725. Please let me know if you have any further questions, or if it doesn't appear to be working properly. - Tim *** This ticket has been marked as a duplicate of ticket 1725 ***
Hi, I'm trying to figure out whether this bug was actually resolved in the general case where there are non-PIX GPU pairs that have the same CPU affinity. Here is our topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity GPU0 X PIX PHB PHB SOC SOC SOC SOC 0-11,24-35 GPU1 PIX X PHB PHB SOC SOC SOC SOC 0-11,24-35 GPU2 PHB PHB X PIX SOC SOC SOC SOC 0-11,24-35 GPU3 PHB PHB PIX X SOC SOC SOC SOC 0-11,24-35 GPU4 SOC SOC SOC SOC X PIX PHB PHB 12-23,36-47 GPU5 SOC SOC SOC SOC PIX X PHB PHB 12-23,36-47 GPU6 SOC SOC SOC SOC PHB PHB X PIX 12-23,36-47 GPU7 SOC SOC SOC SOC PHB PHB PIX X 12-23,36-47 When launching a 2-GPU job, is there a way to ensure that the allocated GPU cards will be, e.g., GPU0-GPU1, and not GPU1-GPU2? If I understand correctly, the option '--gres-flags=enforce-binding' would only ensure that two cards out of either 0-3 or 4-7 are picked.
Hi, I image running something similar to the following would tell slurmctld and slurmd to allocate GPUs and CPU cores as wanted: sbatch --ntasks=1 --ntasks-per-node=1 --ntasks-per-socket=1 --cpus-per-task=2 --gres-flags=enforce-binding ... We did not look into this further down, since no urgent use cases for us.