Ticket 3705

Summary:	How to request two GPU cards on the same socket?
Product:	Slurm	Reporter:	NYU HPC Team <hpc-staff>
Component:	Other	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	leroux
Version:	17.02.1
Hardware:	Linux
OS:	Linux
Site:	NYU	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description NYU HPC Team 2017-04-17 12:39:04 MDT

Hi Slurm experts:

After we corrected our gres.conf, the option '--gres-flags=enforce-binding' works nicely, i.e. CPU affinity with GPU card is respected. 

Now we looking into possibility of allocating two GPU cards on the same socket for one job requesting two GPU cards. For example, on a node with topology as the following, either (GPU0 & GPU1), or (GPU2 & GPU3) are assigned to the job, so that the communication between cards traverses only a single PCIe switch (PIX). Thank you!
 

$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	mlx5_0	CPU Affinity
GPU0	 X 	PIX	SOC	SOC	PHB	0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26
GPU1	PIX	 X 	SOC	SOC	PHB	0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26
GPU2	SOC	SOC	 X 	PIX	SOC	1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27
GPU3	SOC	SOC	PIX	 X 	SOC	1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27
mlx5_0	PHB	PHB	SOC	SOC	 X 	

Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Comment 1 Tim Wickberg 2017-04-17 14:29:03 MDT

Unfortunately there's no easy way to do this right now.

There are some potential ways to simulate this by setting different type values for the GPUs in different NUMA domains. E.g., setting the type of two cards to numa1 and the other two as numa2 would let a job request --gres=gpu:numa1:2 and know which cards it would receive.

But that would restrict jobs to needing to request either the numa1 or numa2 GPUs. There's no way to say that either type would be sufficient, but that the type needs to match.

This is something we've considered adding, but don't have it on our roadmap currently (most new functionality is underpinned by sponsored development). I'm trying to chase down a Sev5 enhancement bug discussing this but haven't found it yet; if you'd like I may reclassify this bug as such rather than open a new one if we don't have something covering this already.

- Tim

Comment 2 Tim Wickberg 2017-04-17 14:44:57 MDT

I have to retract my previous statement. I couldn't find the Sev5 bug as it was already marked as resolved - bug 1725.

This should work as you expect with --gres-flags=enforce-binding as long as the socket mappings are setup appropriately. If the number of GPUs requested is equal to the number available on each socket, the scheduler should assign that pair to the job.

Please let me know if that's not what you're seeing and I'll look into it further, otherwise I'll close this as a duplicate of that already-resolved enhancement request.

- Tim

Comment 3 Tim Wickberg 2017-04-18 20:13:43 MDT

Marking as a duplicate of 1725. Please let me know if you have any further questions, or if it doesn't appear to be working properly.

- Tim

*** This ticket has been marked as a duplicate of ticket 1725 ***

Comment 4 Jonathan Le Roux 2017-09-20 09:21:16 MDT

Hi,

I'm trying to figure out whether this bug was actually resolved in the general case where there are non-PIX GPU pairs that have the same CPU affinity.
Here is our topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity
GPU0     X      PIX     PHB     PHB     SOC     SOC     SOC     SOC     0-11,24-35
GPU1    PIX      X      PHB     PHB     SOC     SOC     SOC     SOC     0-11,24-35
GPU2    PHB     PHB      X      PIX     SOC     SOC     SOC     SOC     0-11,24-35
GPU3    PHB     PHB     PIX      X      SOC     SOC     SOC     SOC     0-11,24-35
GPU4    SOC     SOC     SOC     SOC      X      PIX     PHB     PHB     12-23,36-47
GPU5    SOC     SOC     SOC     SOC     PIX      X      PHB     PHB     12-23,36-47
GPU6    SOC     SOC     SOC     SOC     PHB     PHB      X      PIX     12-23,36-47
GPU7    SOC     SOC     SOC     SOC     PHB     PHB     PIX      X      12-23,36-47

When launching a 2-GPU job, is there a way to ensure that the allocated GPU cards will be, e.g., GPU0-GPU1, and not GPU1-GPU2?
If I understand correctly, the option '--gres-flags=enforce-binding' would only ensure that two cards out of either 0-3 or 4-7 are picked.

Comment 5 NYU HPC Team 2017-09-20 09:50:27 MDT

Hi,

I image running something similar to the following would tell slurmctld and slurmd to allocate GPUs and CPU cores as wanted:
sbatch --ntasks=1 --ntasks-per-node=1 --ntasks-per-socket=1 --cpus-per-task=2 --gres-flags=enforce-binding ...

We did not look into this further down, since no urgent use cases for us.