12655 – Using AMD chiplets in heterogeneous cluster

Ticket 12655 - Using AMD chiplets in heterogeneous cluster

Summary: Using AMD chiplets in heterogeneous cluster

Status:	RESOLVED DUPLICATE of ticket 10679

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	21.08.2
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-10-12 13:10 MDT by Richard Lefebvre
Modified:	2021-11-01 01:35 MDT (History)
CC List:	1 user (show)

See Also:	10679
Site:	Calcul Quebec McGill
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Output of lscpu, lstopo-no-graphics, and version of hwloc (9.22 KB, text/plain) 2021-10-15 11:09 MDT, Richard Lefebvre	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Richard Lefebvre 2021-10-12 13:10:54 MDT

Hi,

We are looking to optimize our use of the new AMD EPYC 7532 on our new cluster. We would like to have an affinity to the chiplets inside the EPYC CPU. We are running non-homogeneous jobs on the cluster which can vary very wildly, 1 core to 1000s of cores. We would like jobs with small amounts of cores to have an affinity to run on the same chiplet to maximize performance.

Do you have any suggestions? Preferred configuration? This using the latest version of Slurm (21.08.2).

Richard

Comment 4 Marcin Stolarek 2021-10-14 08:53:32 MDT

Richard,

  From Slurm perspective chiplets are not the visible entity, since the hardware topology is mapped to hwloc objects[1] by the operating system.
  Could you please elaborate on what you mean in the reference to the results of `lstopo-no-graphics`? Please share the result of the command from the computing node of interest.
 What is your hwloc version?
cheers,
Marcin

[1] https://www.open-mpi.org/projects/hwloc/doc/v2.3.0/a00165.php

Comment 5 Richard Lefebvre 2021-10-15 11:09:24 MDT

Created attachment 21779 [details]
Output of lscpu, lstopo-no-graphics, and version of hwloc

Comment 6 Richard Lefebvre 2021-10-15 12:39:12 MDT

Attached is the output of the commands you asked. hwloc version is: hwloc-2.4.1-3.el8.x86_64

Since the new AMD CPUs with chiplets is going to be more popular. Some support for the chiplet will be requested in the future. The question can be rephrased like this:

Can we replicate the Slurm socket affinity in chiplets?
or
Can there be a NUMA Node awareness to the scheduler level affinity?
or
How does SchedMD suggest handling the new level of chunking that's introduced with these AMD chiplets?


Currently test jobs are being allocated across chiplets.

Does SchedMD have evidence whether the performance implications of CPU distribution within/across chiplets is significant or minimal?

We are normally running a large set of different jobs, lots of single core jobs, 2 cores, 4 cores ... Say I submit 4 cores jobs, and it get scheduled on a node that hat lots of single core jobs, we would like the scheduler to put the job on the same chiplet and not have the jobs running across chiplet, or even select another node (if available) that has 4 cores on the same chiplet. Note that the chiplets seem to follow the structure of the NUMA nodes. (see output of lscpu).

Richard

Comment 7 Richard Lefebvre 2021-10-15 14:46:46 MDT

I have read bug 10679. If we use l3cache_as_socket, does the number of sockets needs to be changed in the definition of the nodes to reflect that increase if sockets?

Richard

Comment 8 Marcin Stolarek 2021-10-18 02:01:49 MDT

 Richard,

>I have read bug 10679. If we use l3cache_as_socket

That's a place where I was going to mention as a start point. It's good that you're on hwloc2 so you can use both `l3cache_as_socket` and you can give a try a patch attached there - the patch introduces a similar option called "numa_node_as_socket" which makes a binding meaning being dependent on the platform configuration. It would be great if you can give the patch (attachment 21486 [details]) a try and share the feedback with us.

>in the definition of the nodes to reflect that increase if sockets?

Yep - it still has to be adjusted. It's a part of bigger changes we're considering for the future to make the nodes' data structure more dynamic.

cheers,
Marcin

Comment 9 Richard Lefebvre 2021-10-18 08:18:32 MDT

Are the all the changes inside a specific branch of Git?

Also, after realizing that using l3cache_as_socket would make the system like having 32 sockets, that would be too granular, I think. with numa node, it would be 8 sockets. We will probably try that first.

Richard

Comment 10 Richard Lefebvre 2021-10-18 12:50:26 MDT

While compiling the patch under 21.08.2 we get the following compiling error:


xcpuinfo.c: In function 'slurmd_parameter_as_socket':
xcpuinfo.c:271:3: error: 'obj' undeclared (first use in this function)
   obj = hwloc_get_next_obj_by_type(topology, HWLOC_OBJ_NODE,
   ^~~
xcpuinfo.c:271:3: note: each undeclared identifier is reported only once for each function it appears in
xcpuinfo.c:271:36: error: 'topology' undeclared (first use in this function); did you mean 'openlog'?
   obj = hwloc_get_next_obj_by_type(topology, HWLOC_OBJ_NODE,
                                    ^~~~~~~~
                                    openlog
make[4]: *** [Makefile:614: xcpuinfo.lo] Error 1


Richard

Comment 11 Marcin Stolarek 2021-10-19 04:10:36 MDT

Richard,

Sorry for that. Please try attachment 21810 [details] where the issue should be fixed.

cheers,
Marcin

Comment 12 Richard Lefebvre 2021-10-19 07:54:49 MDT

Hi,

The new patch compiles, thank you. We will try it later today. Will this patch part of future versions of 21.08.x?

Richard

Comment 13 Marcin Stolarek 2021-10-19 09:05:37 MDT

Our standard approach for new features is to include those only on major releases, however, because of the importance of both - new architectures and hwloc2 support we agreed that we'll do our best to include at least basic support in 21.08.

We're now looking forward for the feedback on the approach from sites testing it.

cheers,
Marcin

Comment 14 Richard Lefebvre 2021-10-22 09:59:46 MDT

Question about the patch

With order should servers and client need to be restarted with the patch. Does the DBserver needs to be patched too.

Richard

Comment 15 Marcin Stolarek 2021-10-25 03:19:52 MDT

The patch has an impact only on slurmd - no need to restart slurmctld/slurmdbd.

Comment 16 Marcin Stolarek 2021-11-01 01:35:21 MDT

Richard,

I'll go ahead and mark the case as duplicate. You'll get added to CC of the original bug, so you get notifications if anything changes there

cheers,
Marcin

*** This ticket has been marked as a duplicate of ticket 10679 ***