Ticket 10679

Summary: Using hwloc 2.x instead of 1.x requires changes
Product: Slurm Reporter: peter.georg
Component: OtherAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek, csc-slurm-tickets, ezellma, felip.moll, kaizaad, richard.lefebvre, stornetn, tim
Version: 21.08.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=12655
https://bugs.schedmd.com/show_bug.cgi?id=12903
https://bugs.schedmd.com/show_bug.cgi?id=13931
https://bugs.schedmd.com/show_bug.cgi?id=18047
Site: ORNL-OLCF Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 22.05.0 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: lstopo-no-graphics for Spock
hwloc_numa_check.c
xcpuinfo.patch
poc.patch
v3
v4
v5

Description peter.georg 2021-01-22 05:45:37 MST
This is sort of a bug report, but also a heads up to notify you in advance of a possible future issue.
I only tested 20.02.6, but the corresponing lines of code haven't been touched for quite a while, hence I assume it affects all versions (including master).

RHEL 8.4 will very likely rebase hwloc from 1.11 to 2.2 as a requirement to rebase MPICH to 3.3 upstream series. Corresponding bug report can be found here (and the bugs it depends on): https://bugzilla.redhat.com/show_bug.cgi?id=1732982

Hence it is very likely that there'll be Slurm users using hwloc 2.x soon.


Copiling slurm against hwloc 2.x works fine, however there is one issue:
In hwloc 2.x the organization of NUMA nodes has been changed, see https://www.open-mpi.org/projects/hwloc/doc/v2.2.0/a00339.php

Hence when compiling slurmd against hwloc 2.x the code block you can find here
https://github.com/SchedMD/slurm/blob/f89648c05dd27e49906bb81d6f5dd19bbb2b89b1/src/slurmd/common/xcpuinfo.c#L309
does not work anymore as expected. It always works as if "Ignore_NUMA" is set for "SchedulerParameters" in slurm.conf.

This is probably not what users want, especially on modern CPUs (both AMD and Intel) that allow to set having various NUMA nodes within a socket to improve total memory bandwidth.

I have not had time yet to work on a patch myself, but wanted to notify you early on anyway.
Comment 1 Felip Moll 2021-07-14 10:18:25 MDT
Closing as dup question.

*** This ticket has been marked as a duplicate of ticket 11787 ***
Comment 2 peter.georg 2021-07-14 17:13:51 MDT
Felip Moll,

Let me reopen this issue here as it is indeed not a duplicate of bug 11787 and has also not been fixed.

In bug 11787 it has been asked whether hwloc 2.x imposes any issues running on a system with no NUMA domains within a socket. This is indeed working and does not require a fix.

This bug report here is about running on a system with NUMA domains within a socket. E.g. AMD Epyc with NPS > 1 or Intel Xeon (Phi) with SNC > 1.

In bug 11787 you correctly understood that I claim the following code to be wrong:

	if (hwloc_get_type_depth(topology, HWLOC_OBJ_NODE) >
	    hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET)) {


You only tested it on a machine with no NUMA domains within a socket and it returned (quote from bug 11787):
(gdb) p	hwloc_get_type_depth(topology, HWLOC_OBJ_NODE)
$8 = -3
(gdb) p	hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET)
$9 = 2



Ok, that's fine. On such a system we do not want the expression to be true.
However the issue is that it makes no difference on which system you run.

E.g. running on a system with 1 Socket and no NUMA domains within a socket (NPS=1 or SNC=1):
hwloc_get_type_depth(topology, HWLOC_OBJ_NODE): -3
hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET): 1

Running the same code on a system with 1 Socket and 4 NUMA domains within this socket (NPS=4 or SNC=4):
hwloc_get_type_depth(topology, HWLOC_OBJ_NODE): -3
hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET): 1


This looks bad. The code is supposed to detect whether there are NUMA domains within a socket. In case there are it should set objtype[SOCKET] = HWLOC_OBJ_NODE, but only in case Ignore_NUMA is not set.

It obviously fails to do so, i.e. your code in xcpuinfo *is* doing an incorrect check. I actually think that the expression never returns true on any system topology using hwloc 2.x. I.e. everything within the if clause is never going to be called and could as well be removed.

So as written in the original bug report, the code "always works as if "Ignore_NUMA" is set for "SchedulerParameters" in slurm.conf."

The original bug report also contains a link to hwloc documentation. Let me quote what is written there about using the NUMA depth:

"The NUMA depth should not be compared with others. An unmodified code that still compares NUMA and Package depths (to find out whether Packages contain NUMA or the contrary) would now always assume Packages contain NUMA (because the NUMA depth is negative)."

So even the hwloc documentation clearly states that your code needs to be updated to work with hwloc >=2.2.
Comment 3 Felip Moll 2021-07-15 03:14:46 MDT
> Ok, that's fine. On such a system we do not want the expression to be true.
> However the issue is that it makes no difference on which system you run.

I get it. Thanks for your report.

> 
> This looks bad. The code is supposed to detect whether there are NUMA
> domains within a socket. 

Do you have an example of such topology, i.e. an lstopo output?


> The original bug report also contains a link to hwloc documentation. Let me
> quote what is written there about using the NUMA depth:
> 
> "The NUMA depth should not be compared with others. An unmodified code that
> still compares NUMA and Package depths (to find out whether Packages contain
> NUMA or the contrary) would now always assume Packages contain NUMA (because
> the NUMA depth is negative)."

My fault to have ignored this chunk, you're right.


We clearly need to take another round to this.



Thanks for your input.
Comment 5 peter.georg 2021-07-15 04:26:48 MDT
(In reply to Felip Moll from comment #3)
> Do you have an example of such topology, i.e. an lstopo output?

Sure. Output of lstopo running on an AMD EPYC 7502P:

```
Machine (252GB total)
  Package L#0
    L3 L#0 (16MB)
      NUMANode L#0 (P#0 31GB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#32)
      L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#33)
      L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#34)
      L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#35)
      HostBridge
        PCIBridge
          PCIBridge
            PCI c4:00.0 (VGA)
        PCIBridge
          PCI c5:00.0 (SATA)
        PCIBridge
          PCI c6:00.0 (Ethernet)
            Net "eth0"
          PCI c6:00.1 (Ethernet)
            Net "eth1"
    L3 L#1 (16MB)
      NUMANode L#1 (P#1 31GB)
      L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#36)
      L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#37)
      L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#38)
      L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#39)
    L3 L#2 (16MB)
      NUMANode L#2 (P#2 31GB)
      L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#40)
      L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#41)
      L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#42)
      L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#43)
      HostBridge
        PCIBridge
          PCI 81:00.0 (NVMExp)
            Block(Disk) "nvme0n1"
    L3 L#3 (16MB)
      NUMANode L#3 (P#3 31GB)
      L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#44)
      L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#45)
      L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#46)
      L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#47)
    L3 L#4 (16MB)
      NUMANode L#4 (P#4 31GB)
      L2 L#16 (512KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
        PU L#32 (P#16)
        PU L#33 (P#48)
      L2 L#17 (512KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
        PU L#34 (P#17)
        PU L#35 (P#49)
      L2 L#18 (512KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
        PU L#36 (P#18)
        PU L#37 (P#50)
      L2 L#19 (512KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
        PU L#38 (P#19)
        PU L#39 (P#51)
      HostBridge
        PCIBridge
          PCI 43:00.0 (SATA)
        PCIBridge
          PCI 44:00.0 (SATA)
    L3 L#5 (16MB)
      NUMANode L#5 (P#5 31GB)
      L2 L#20 (512KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
        PU L#40 (P#20)
        PU L#41 (P#52)
      L2 L#21 (512KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
        PU L#42 (P#21)
        PU L#43 (P#53)
      L2 L#22 (512KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
        PU L#44 (P#22)
        PU L#45 (P#54)
      L2 L#23 (512KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
        PU L#46 (P#23)
        PU L#47 (P#55)
    L3 L#6 (16MB)
      NUMANode L#6 (P#6 31GB)
      L2 L#24 (512KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
        PU L#48 (P#24)
        PU L#49 (P#56)
      L2 L#25 (512KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
        PU L#50 (P#25)
        PU L#51 (P#57)
      L2 L#26 (512KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
        PU L#52 (P#26)
        PU L#53 (P#58)
      L2 L#27 (512KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
        PU L#54 (P#27)
        PU L#55 (P#59)
      HostBridge
        PCIBridge
          PCI 01:00.0 (InfiniBand)
            Net "ib0"
            Net "ib1"
            OpenFabrics "mlx5_0"
    L3 L#7 (16MB)
      NUMANode L#7 (P#7 31GB)
      L2 L#28 (512KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
        PU L#56 (P#28)
        PU L#57 (P#60)
      L2 L#29 (512KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
        PU L#58 (P#29)
        PU L#59 (P#61)
      L2 L#30 (512KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
        PU L#60 (P#30)
        PU L#61 (P#62)
      L2 L#31 (512KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
        PU L#62 (P#31)
        PU L#63 (P#63)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
```

Please let me know if you need further information.
Comment 6 Matt Ezell 2021-08-03 18:21:04 MDT
This is biting us on our Bones cluster where we are testing 21.08.0rc1. We need the 4 NUMA nodes for our GPU scheduling to work. For various reasons we were previously building against SLES15sp1's hwloc (version 1.x), but now we are building against SP2's 2.1.
Comment 12 Marcin Stolarek 2021-08-18 06:17:01 MDT
Matt,

>[...]We need the 4 NUMA nodes for our GPU scheduling to work.[...]
Could you please attach lstopo from the node configured with 4 NUMA nodes? (Am I reading you correctly that the reason is the use of options like --gpu-bind=closest / --gpu-bind=single?)

Peter,
Slurm 21.08 comes with an alternative option of SlurmdParameters=l3cache_as_socket that is supported only on systems with hwloc2.0+. It allows using L3Cache as a socket level in the discussed context, could you please give it a try and let us know if that works for you? (Slurm 21.08rc1 is available for download).

>E.g. running on a system with 1 Socket and no NUMA domains within a socket (NPS=1 or SNC=1):
This case is probably not an issue since just using a hwloc package as socket should work as expected, but.. do you see a need for a switch like "SlurmdParameters=machine_as_socket" to use one domain even if there are more sockets visible over hwloc?

>I actually think that the expression never returns true on any system topology using hwloc 2.x.[...]
Correct. However, the subsequent code that makes use of the value changed in the conditional is only correct for hwloc prior to changes in cf70b9c56c9cd00[1].

I'm not sure how to approach that yet and need some help from both of you since I don't have access to such a system at the moment.
We have to make sure that changing Slurm code here we're doing this intact with the approach change on hwloc side, which as far as I understand (long story short) is to take the Numa Node object outside of the main (CPUs) tree. This has some important consequences on the way the object can be used.

cheers,
Marcin
[1]https://github.com/open-mpi/hwloc/commit/cf70b9c56c9cd00df891763936ac1d2d9385fb7e
Comment 13 Matt Ezell 2021-08-18 08:03:36 MDT
Created attachment 20884 [details]
lstopo-no-graphics for Spock

> Could you please attach lstopo from the node configured with 4 NUMA nodes? (Am I reading you correctly that the reason is the use of options like --gpu-bind=closest / --gpu-bind=single?)

See attached. Yes, as you see there is one GPU per NUMA and we use binding based on that (closest and single).
Comment 15 Marcin Stolarek 2021-08-18 08:50:48 MDT
Matt,

Thanks for sharing.

Peter,
>It allows using L3Cache as a socket level in the discussed context, could you please give it a try and let us know if that works for you? (Slurm 21.08rc1 is available for download).

Actually using the current state of master branch is probably a better idea, since we already have many fixes done after rc1.

cheers,
Marcin
Comment 16 peter.georg 2021-08-19 10:03:30 MDT
(In reply to Marcin Stolarek from comment #12)
Marcin,

> Slurm 21.08 comes with an alternative option of
> SlurmdParameters=l3cache_as_socket that is supported only on systems with
> hwloc2.0+. It allows using L3Cache as a socket level in the discussed
> context, could you please give it a try and let us know if that works for
> you? (Slurm 21.08rc1 is available for download).

I haven't been able to test it yet, but had a look at the code. Looks good and should fix the issue for me as I have all systems configured to NUMA == L3Cache.
 
> >E.g. running on a system with 1 Socket and no NUMA domains within a socket (NPS=1 or SNC=1):
> This case is probably not an issue since just using a hwloc package as
> socket should work as expected, but.. do you see a need for a switch like
> "SlurmdParameters=machine_as_socket" to use one domain even if there are
> more sockets visible over hwloc?

This is indeed no issue, the example was merely given to show that the existing code always returns the same values independent of the hardware.

> >I actually think that the expression never returns true on any system topology using hwloc 2.x.[...]
> Correct. However, the subsequent code that makes use of the value changed in
> the conditional is only correct for hwloc prior to changes in
> cf70b9c56c9cd00[1].
> 
> I'm not sure how to approach that yet and need some help from both of you
> since I don't have access to such a system at the moment.
> We have to make sure that changing Slurm code here we're doing this intact
> with the approach change on hwloc side, which as far as I understand (long
> story short) is to take the Numa Node object outside of the main (CPUs)
> tree. This has some important consequences on the way the object can be used.

That's how I understand it as well. Following [1] I created a small test application to verify correct SOCKET, L3CACHE and NUMA detection. On my local systems the output looks good. I attached the source code (hwloc_numa.c), maybe Matt can test it as well?
In addition I attached a patch (numa_as_socket.patch) how one might implement this in slurm by introducing an option "numa_as_socket" similar to the new "l3cache_as_socket". The patch is incomplete and completely untested.

Another possibility is to try fixing the "NUMA > SOCKET" check. An attempt to fix this has been attached as well ("poc.patch"). The idea is to get the parent of the first obj of type NODE and compare its depth to SOCKET. The code looks ugly, but you get the idea.

[1] https://www.open-mpi.org/projects/hwloc/doc/v2.5.0/a00374.php
Comment 17 peter.georg 2021-08-19 10:04:10 MDT
Created attachment 20926 [details]
hwloc_numa_check.c
Comment 18 peter.georg 2021-08-19 10:04:55 MDT
Created attachment 20927 [details]
xcpuinfo.patch
Comment 19 peter.georg 2021-08-19 10:05:15 MDT
Created attachment 20928 [details]
poc.patch
Comment 24 Marcin Stolarek 2021-09-16 06:33:18 MDT
Matt,
Peter,

I have a patch that that's under our QA. Do you want to give it a try on test systems you have, your feedback will be much appreciated. Keep in mind it's not yet merged.

The solution follows approach from bug 12239 and adds SlurmdParameters=numa_node_as_socket flag to slurm.conf.

cheers,
Marcin
Comment 25 peter.georg 2021-09-23 03:00:35 MDT
(In reply to Marcin Stolarek from comment #24)
Marcin,

I can give it a try. Is the patch already available somewhere?

Best,

Peter
Comment 26 Marcin Stolarek 2021-09-23 03:40:24 MDT
Comment on attachment 21312 [details]
v3

Switching the patch visibility to public now. 

Please keep in mind that it didn't pass SchedMD QA yet and is not yet scheduled for release in the current state. However, the feedback is much appreciated.

cheers,
Marcin
Comment 28 Marcin Stolarek 2021-09-28 05:40:42 MDT
Peter,
Matt,

We're holding off on the review waiting for your feedback. If because of any reason you can't or just don't want to test it please let me know.

cheers,
Marcin
Comment 29 peter.georg 2021-09-28 05:54:23 MDT
(In reply to Marcin Stolarek from comment #28)
Marcin,

Sorry for the late reply. I have been busy working on other stuff lately. I've only had time to do some basic testing with the patch v3 you provided. These tests all look good. Reviewing the code, it looks good as well. I think the proposed patch v3 is a very good solution for this issue.
Minor remark: In line 222 of the patch v3 there is a typo (numbe -> number).

In addition the new flags numa_node_as_socket and l3cache_as_socket give new flexibility not possible previously.
Comment 30 Marcin Stolarek 2021-09-28 06:00:47 MDT
Created attachment 21486 [details]
v4

Peter,

Thanks for the feedback and pointing out the docs typo.

cheers,
Marcin
Comment 32 Marcin Stolarek 2021-10-19 04:08:06 MDT
Created attachment 21810 [details]
v5

Fix
Comment 33 Marcin Stolarek 2021-11-01 01:35:21 MDT
*** Ticket 12655 has been marked as a duplicate of this ticket. ***
Comment 34 Marcin Stolarek 2021-11-01 01:36:42 MDT
Matt,

Did you have a chance to verify the patch?

cheers,
Marcin
Comment 35 Marcin Stolarek 2021-11-10 01:35:11 MST
Matt,

Did you have any chance to give v5(attachment 21810 [details]) a try?

cheers,
Marcin
Comment 36 Matt Ezell 2021-11-10 07:28:11 MST
(In reply to Marcin Stolarek from comment #35)
> Matt,
> 
> Did you have any chance to give v5(attachment 21810 [details]) a try?
> 
> cheers,
> Marcin

Sorry for the delay - 100% of my time has been dedicated to deploying a new system that does not need this patch. It's on my list to test, and I will try to get to it soon.  Thanks.
Comment 38 Marcin Stolarek 2022-01-18 09:15:16 MST
Matt,

Any update from your side?

cheers,
Marcin
Comment 41 Marcin Stolarek 2022-02-28 02:18:48 MST
Matt,

Just a kindly reminder - we're waiting for your test results since the only way we can test it is a VM with artificial node topology.

cheers,
Marcin
Comment 42 Marcin Stolarek 2022-03-21 02:43:10 MDT
Matt,

Should we reduce ticket severity to 4?

cheers,
Marcin
Comment 43 Matt Ezell 2022-03-21 11:02:11 MDT
(In reply to Marcin Stolarek from comment #42)
> Matt,
> 
> Should we reduce ticket severity to 4?
> 
> cheers,
> Marcin

Sorry we've been slammed with other priorities. Yes, let's move it to 4 and I will try to schedule some time to test this on our Bones system.
Comment 44 Marcin Stolarek 2022-04-20 02:29:37 MDT
Matt - just a monthly follow-up/ping.
Comment 45 Matt Ezell 2022-05-03 17:01:16 MDT
I built 21.08.7 with v5 of the patch. I don't think it worked:

[2022-05-03T18:56:45.304] Node reconfigured socket/core boundaries SocketsPerBoard=4:1(hw) CoresPerSocket=16:64(hw)
[2022-05-03T18:56:45.306] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
[2022-05-03T18:56:45.306] CPU frequency setting not configured for this node
[2022-05-03T18:56:45.309] slurmd version 21.08.7 started
[2022-05-03T18:56:45.310] slurmd started on Tue, 03 May 2022 18:56:45 -0400
[2022-05-03T18:56:45.310] CPUs=128 Boards=1 Sockets=1 Cores=64 Threads=2 Memory=257263 TmpDisk=128631 Uptime=6581365 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
Comment 46 Matt Ezell 2022-05-03 17:43:16 MDT
(In reply to Matt Ezell from comment #45)
> I built 21.08.7 with v5 of the patch. I don't think it worked:

I overlooked the requirement to set SlurmdParameters=numa_node_as_socket - testing again
Comment 47 Matt Ezell 2022-05-06 22:05:13 MDT
We have confirmed that this is working. I think 13931 confirmed it as well.  Is it too late to get in for 22.05?
Comment 55 Marcin Stolarek 2022-05-17 06:37:00 MDT
I'm closing the bug as fixed now. The changes discussed were merged to our public repository - commits 716be51fc0..e99e539449.

cheers,
Marcin