|
Description
peter.georg
2021-01-22 05:45:37 MST
Closing as dup question. *** This ticket has been marked as a duplicate of ticket 11787 *** Felip Moll, Let me reopen this issue here as it is indeed not a duplicate of bug 11787 and has also not been fixed. In bug 11787 it has been asked whether hwloc 2.x imposes any issues running on a system with no NUMA domains within a socket. This is indeed working and does not require a fix. This bug report here is about running on a system with NUMA domains within a socket. E.g. AMD Epyc with NPS > 1 or Intel Xeon (Phi) with SNC > 1. In bug 11787 you correctly understood that I claim the following code to be wrong: if (hwloc_get_type_depth(topology, HWLOC_OBJ_NODE) > hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET)) { You only tested it on a machine with no NUMA domains within a socket and it returned (quote from bug 11787): (gdb) p hwloc_get_type_depth(topology, HWLOC_OBJ_NODE) $8 = -3 (gdb) p hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET) $9 = 2 Ok, that's fine. On such a system we do not want the expression to be true. However the issue is that it makes no difference on which system you run. E.g. running on a system with 1 Socket and no NUMA domains within a socket (NPS=1 or SNC=1): hwloc_get_type_depth(topology, HWLOC_OBJ_NODE): -3 hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET): 1 Running the same code on a system with 1 Socket and 4 NUMA domains within this socket (NPS=4 or SNC=4): hwloc_get_type_depth(topology, HWLOC_OBJ_NODE): -3 hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET): 1 This looks bad. The code is supposed to detect whether there are NUMA domains within a socket. In case there are it should set objtype[SOCKET] = HWLOC_OBJ_NODE, but only in case Ignore_NUMA is not set. It obviously fails to do so, i.e. your code in xcpuinfo *is* doing an incorrect check. I actually think that the expression never returns true on any system topology using hwloc 2.x. I.e. everything within the if clause is never going to be called and could as well be removed. So as written in the original bug report, the code "always works as if "Ignore_NUMA" is set for "SchedulerParameters" in slurm.conf." The original bug report also contains a link to hwloc documentation. Let me quote what is written there about using the NUMA depth: "The NUMA depth should not be compared with others. An unmodified code that still compares NUMA and Package depths (to find out whether Packages contain NUMA or the contrary) would now always assume Packages contain NUMA (because the NUMA depth is negative)." So even the hwloc documentation clearly states that your code needs to be updated to work with hwloc >=2.2. > Ok, that's fine. On such a system we do not want the expression to be true. > However the issue is that it makes no difference on which system you run. I get it. Thanks for your report. > > This looks bad. The code is supposed to detect whether there are NUMA > domains within a socket. Do you have an example of such topology, i.e. an lstopo output? > The original bug report also contains a link to hwloc documentation. Let me > quote what is written there about using the NUMA depth: > > "The NUMA depth should not be compared with others. An unmodified code that > still compares NUMA and Package depths (to find out whether Packages contain > NUMA or the contrary) would now always assume Packages contain NUMA (because > the NUMA depth is negative)." My fault to have ignored this chunk, you're right. We clearly need to take another round to this. Thanks for your input. (In reply to Felip Moll from comment #3) > Do you have an example of such topology, i.e. an lstopo output? Sure. Output of lstopo running on an AMD EPYC 7502P: ``` Machine (252GB total) Package L#0 L3 L#0 (16MB) NUMANode L#0 (P#0 31GB) L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#32) L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#33) L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#34) L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#35) HostBridge PCIBridge PCIBridge PCI c4:00.0 (VGA) PCIBridge PCI c5:00.0 (SATA) PCIBridge PCI c6:00.0 (Ethernet) Net "eth0" PCI c6:00.1 (Ethernet) Net "eth1" L3 L#1 (16MB) NUMANode L#1 (P#1 31GB) L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#36) L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#37) L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#38) L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#39) L3 L#2 (16MB) NUMANode L#2 (P#2 31GB) L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 PU L#16 (P#8) PU L#17 (P#40) L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 PU L#18 (P#9) PU L#19 (P#41) L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 PU L#20 (P#10) PU L#21 (P#42) L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 PU L#22 (P#11) PU L#23 (P#43) HostBridge PCIBridge PCI 81:00.0 (NVMExp) Block(Disk) "nvme0n1" L3 L#3 (16MB) NUMANode L#3 (P#3 31GB) L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 PU L#24 (P#12) PU L#25 (P#44) L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 PU L#26 (P#13) PU L#27 (P#45) L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 PU L#28 (P#14) PU L#29 (P#46) L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 PU L#30 (P#15) PU L#31 (P#47) L3 L#4 (16MB) NUMANode L#4 (P#4 31GB) L2 L#16 (512KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 PU L#32 (P#16) PU L#33 (P#48) L2 L#17 (512KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 PU L#34 (P#17) PU L#35 (P#49) L2 L#18 (512KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 PU L#36 (P#18) PU L#37 (P#50) L2 L#19 (512KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 PU L#38 (P#19) PU L#39 (P#51) HostBridge PCIBridge PCI 43:00.0 (SATA) PCIBridge PCI 44:00.0 (SATA) L3 L#5 (16MB) NUMANode L#5 (P#5 31GB) L2 L#20 (512KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 PU L#40 (P#20) PU L#41 (P#52) L2 L#21 (512KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 PU L#42 (P#21) PU L#43 (P#53) L2 L#22 (512KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 PU L#44 (P#22) PU L#45 (P#54) L2 L#23 (512KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 PU L#46 (P#23) PU L#47 (P#55) L3 L#6 (16MB) NUMANode L#6 (P#6 31GB) L2 L#24 (512KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 PU L#48 (P#24) PU L#49 (P#56) L2 L#25 (512KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 PU L#50 (P#25) PU L#51 (P#57) L2 L#26 (512KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 PU L#52 (P#26) PU L#53 (P#58) L2 L#27 (512KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 PU L#54 (P#27) PU L#55 (P#59) HostBridge PCIBridge PCI 01:00.0 (InfiniBand) Net "ib0" Net "ib1" OpenFabrics "mlx5_0" L3 L#7 (16MB) NUMANode L#7 (P#7 31GB) L2 L#28 (512KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 PU L#56 (P#28) PU L#57 (P#60) L2 L#29 (512KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 PU L#58 (P#29) PU L#59 (P#61) L2 L#30 (512KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 PU L#60 (P#30) PU L#61 (P#62) L2 L#31 (512KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 PU L#62 (P#31) PU L#63 (P#63) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) Misc(MemoryModule) ``` Please let me know if you need further information. This is biting us on our Bones cluster where we are testing 21.08.0rc1. We need the 4 NUMA nodes for our GPU scheduling to work. For various reasons we were previously building against SLES15sp1's hwloc (version 1.x), but now we are building against SP2's 2.1. Matt, >[...]We need the 4 NUMA nodes for our GPU scheduling to work.[...] Could you please attach lstopo from the node configured with 4 NUMA nodes? (Am I reading you correctly that the reason is the use of options like --gpu-bind=closest / --gpu-bind=single?) Peter, Slurm 21.08 comes with an alternative option of SlurmdParameters=l3cache_as_socket that is supported only on systems with hwloc2.0+. It allows using L3Cache as a socket level in the discussed context, could you please give it a try and let us know if that works for you? (Slurm 21.08rc1 is available for download). >E.g. running on a system with 1 Socket and no NUMA domains within a socket (NPS=1 or SNC=1): This case is probably not an issue since just using a hwloc package as socket should work as expected, but.. do you see a need for a switch like "SlurmdParameters=machine_as_socket" to use one domain even if there are more sockets visible over hwloc? >I actually think that the expression never returns true on any system topology using hwloc 2.x.[...] Correct. However, the subsequent code that makes use of the value changed in the conditional is only correct for hwloc prior to changes in cf70b9c56c9cd00[1]. I'm not sure how to approach that yet and need some help from both of you since I don't have access to such a system at the moment. We have to make sure that changing Slurm code here we're doing this intact with the approach change on hwloc side, which as far as I understand (long story short) is to take the Numa Node object outside of the main (CPUs) tree. This has some important consequences on the way the object can be used. cheers, Marcin [1]https://github.com/open-mpi/hwloc/commit/cf70b9c56c9cd00df891763936ac1d2d9385fb7e Created attachment 20884 [details] lstopo-no-graphics for Spock > Could you please attach lstopo from the node configured with 4 NUMA nodes? (Am I reading you correctly that the reason is the use of options like --gpu-bind=closest / --gpu-bind=single?) See attached. Yes, as you see there is one GPU per NUMA and we use binding based on that (closest and single). Matt,
Thanks for sharing.
Peter,
>It allows using L3Cache as a socket level in the discussed context, could you please give it a try and let us know if that works for you? (Slurm 21.08rc1 is available for download).
Actually using the current state of master branch is probably a better idea, since we already have many fixes done after rc1.
cheers,
Marcin
(In reply to Marcin Stolarek from comment #12) Marcin, > Slurm 21.08 comes with an alternative option of > SlurmdParameters=l3cache_as_socket that is supported only on systems with > hwloc2.0+. It allows using L3Cache as a socket level in the discussed > context, could you please give it a try and let us know if that works for > you? (Slurm 21.08rc1 is available for download). I haven't been able to test it yet, but had a look at the code. Looks good and should fix the issue for me as I have all systems configured to NUMA == L3Cache. > >E.g. running on a system with 1 Socket and no NUMA domains within a socket (NPS=1 or SNC=1): > This case is probably not an issue since just using a hwloc package as > socket should work as expected, but.. do you see a need for a switch like > "SlurmdParameters=machine_as_socket" to use one domain even if there are > more sockets visible over hwloc? This is indeed no issue, the example was merely given to show that the existing code always returns the same values independent of the hardware. > >I actually think that the expression never returns true on any system topology using hwloc 2.x.[...] > Correct. However, the subsequent code that makes use of the value changed in > the conditional is only correct for hwloc prior to changes in > cf70b9c56c9cd00[1]. > > I'm not sure how to approach that yet and need some help from both of you > since I don't have access to such a system at the moment. > We have to make sure that changing Slurm code here we're doing this intact > with the approach change on hwloc side, which as far as I understand (long > story short) is to take the Numa Node object outside of the main (CPUs) > tree. This has some important consequences on the way the object can be used. That's how I understand it as well. Following [1] I created a small test application to verify correct SOCKET, L3CACHE and NUMA detection. On my local systems the output looks good. I attached the source code (hwloc_numa.c), maybe Matt can test it as well? In addition I attached a patch (numa_as_socket.patch) how one might implement this in slurm by introducing an option "numa_as_socket" similar to the new "l3cache_as_socket". The patch is incomplete and completely untested. Another possibility is to try fixing the "NUMA > SOCKET" check. An attempt to fix this has been attached as well ("poc.patch"). The idea is to get the parent of the first obj of type NODE and compare its depth to SOCKET. The code looks ugly, but you get the idea. [1] https://www.open-mpi.org/projects/hwloc/doc/v2.5.0/a00374.php Created attachment 20926 [details]
hwloc_numa_check.c
Created attachment 20927 [details]
xcpuinfo.patch
Created attachment 20928 [details]
poc.patch
Matt, Peter, I have a patch that that's under our QA. Do you want to give it a try on test systems you have, your feedback will be much appreciated. Keep in mind it's not yet merged. The solution follows approach from bug 12239 and adds SlurmdParameters=numa_node_as_socket flag to slurm.conf. cheers, Marcin (In reply to Marcin Stolarek from comment #24) Marcin, I can give it a try. Is the patch already available somewhere? Best, Peter Comment on attachment 21312 [details]
v3
Switching the patch visibility to public now.
Please keep in mind that it didn't pass SchedMD QA yet and is not yet scheduled for release in the current state. However, the feedback is much appreciated.
cheers,
Marcin
Peter, Matt, We're holding off on the review waiting for your feedback. If because of any reason you can't or just don't want to test it please let me know. cheers, Marcin (In reply to Marcin Stolarek from comment #28) Marcin, Sorry for the late reply. I have been busy working on other stuff lately. I've only had time to do some basic testing with the patch v3 you provided. These tests all look good. Reviewing the code, it looks good as well. I think the proposed patch v3 is a very good solution for this issue. Minor remark: In line 222 of the patch v3 there is a typo (numbe -> number). In addition the new flags numa_node_as_socket and l3cache_as_socket give new flexibility not possible previously. Created attachment 21486 [details]
v4
Peter,
Thanks for the feedback and pointing out the docs typo.
cheers,
Marcin
Created attachment 21810 [details]
v5
Fix
*** Ticket 12655 has been marked as a duplicate of this ticket. *** Matt, Did you have a chance to verify the patch? cheers, Marcin Matt,
Did you have any chance to give v5(attachment 21810 [details]) a try?
cheers,
Marcin
(In reply to Marcin Stolarek from comment #35) > Matt, > > Did you have any chance to give v5(attachment 21810 [details]) a try? > > cheers, > Marcin Sorry for the delay - 100% of my time has been dedicated to deploying a new system that does not need this patch. It's on my list to test, and I will try to get to it soon. Thanks. Matt, Any update from your side? cheers, Marcin Matt, Just a kindly reminder - we're waiting for your test results since the only way we can test it is a VM with artificial node topology. cheers, Marcin Matt, Should we reduce ticket severity to 4? cheers, Marcin (In reply to Marcin Stolarek from comment #42) > Matt, > > Should we reduce ticket severity to 4? > > cheers, > Marcin Sorry we've been slammed with other priorities. Yes, let's move it to 4 and I will try to schedule some time to test this on our Bones system. Matt - just a monthly follow-up/ping. I built 21.08.7 with v5 of the patch. I don't think it worked: [2022-05-03T18:56:45.304] Node reconfigured socket/core boundaries SocketsPerBoard=4:1(hw) CoresPerSocket=16:64(hw) [2022-05-03T18:56:45.306] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file. [2022-05-03T18:56:45.306] CPU frequency setting not configured for this node [2022-05-03T18:56:45.309] slurmd version 21.08.7 started [2022-05-03T18:56:45.310] slurmd started on Tue, 03 May 2022 18:56:45 -0400 [2022-05-03T18:56:45.310] CPUs=128 Boards=1 Sockets=1 Cores=64 Threads=2 Memory=257263 TmpDisk=128631 Uptime=6581365 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) (In reply to Matt Ezell from comment #45) > I built 21.08.7 with v5 of the patch. I don't think it worked: I overlooked the requirement to set SlurmdParameters=numa_node_as_socket - testing again We have confirmed that this is working. I think 13931 confirmed it as well. Is it too late to get in for 22.05? I'm closing the bug as fixed now. The changes discussed were merged to our public repository - commits 716be51fc0..e99e539449. cheers, Marcin |