This is more of a question than a specific bug request. Hi, recently RedHat released a new CentOS 8.4 which includes a breaking change on hwloc package: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/8.4_release_notes/index Related to: https://bugzilla.redhat.com/show_bug.cgi?id=1841354 Can you please provide guidance on the impact, if any, on Slurm installations on CentOS 8? I will also note that I did see a similar question in this more narrowly scoped bug report here: https://bugs.schedmd.com/show_bug.cgi?id=10679, however, I did not see any follow-up discussion there or anywhere else from SchedMD.
(In reply to Nathan Stornetta from comment #0) > This is more of a question than a specific bug request. > > Hi, recently RedHat released a new CentOS 8.4 which includes a breaking > change on hwloc package: > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/ > html-single/8.4_release_notes/index > Related to: https://bugzilla.redhat.com/show_bug.cgi?id=1841354 > Can you please provide guidance on the impact, if any, on Slurm > installations on CentOS 8? > > I will also note that I did see a similar question in this more narrowly > scoped bug report here: https://bugs.schedmd.com/show_bug.cgi?id=10679, > however, I did not see any follow-up discussion there or anywhere else from > SchedMD. Thanks for your question. Let me investigate a bit the impact and I will come back to you. Next week I am off but will do my best to respond asap.
Hi Nathan, First of all I want to apologize for not having given you a response before. I've been studying a bit the matter and indeed it seems the hierarchy has been changed in hwloc 2.2. My machine is in this version and everything works but I only have 1 NUMA node. I had to setup a testing environment with several numa nodes to see the differences better. My next step is to test Slurm and figure out what needs to be changed. As you may have seen the differences are where the NUMANode is placed, so it seems we'll need to change a bit the code to adapt to this. The task placement using task/cgroup must not be accurate using 2.2 and multiple numa nodes, so the suggestion is to use task/affinity for placement, which is what we really recommend: stack task/cgroup,task/affinity and let the first one do constrains, while the second one affinity. Just for the record: This is on a Centos8 (hwloc 2.2) [root@moll11 ~]# lstopo-no-graphics Machine (7765MB total) Group0 L#0 NUMANode L#0 (P#0 1758MB) Package L#0 + L3 L#0 (16MB) + L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) Package L#1 + L3 L#1 (16MB) + L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) Package L#2 + L3 L#2 (16MB) + L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2) Package L#3 + L3 L#3 (16MB) + L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3) Group0 L#1 NUMANode L#1 (P#1 2015MB) Package L#4 + L3 L#4 (16MB) + L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4) Package L#5 + L3 L#5 (16MB) + L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5) Package L#6 + L3 L#6 (16MB) + L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6) Package L#7 + L3 L#7 (16MB) + L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7) Group0 L#2 NUMANode L#2 (P#2 1978MB) Package L#8 + L3 L#8 (16MB) + L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 (P#8) Package L#9 + L3 L#9 (16MB) + L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 (P#9) Package L#10 + L3 L#10 (16MB) + L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU L#10 (P#10) Package L#11 + L3 L#11 (16MB) + L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU L#11 (P#11) Group0 L#3 NUMANode L#3 (P#3 2015MB) Package L#12 + L3 L#12 (16MB) + L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU L#12 (P#12) Package L#13 + L3 L#13 (16MB) + L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU L#13 (P#13) Package L#14 + L3 L#14 (16MB) + L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU L#14 (P#14) Package L#15 + L3 L#15 (16MB) + L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU L#15 (P#15) HostBridge PCI 00:01.0 (VGA) PCI 00:03.0 (Ethernet) Net "enp0s3" PCI 00:04.0 (Ethernet) Net "enp0s4" PCI 00:08.0 (SCSI) Block "vda" PCI 00:1f.2 (SATA) Misc(MemoryModule) This is the same config in Centos 7 (hwloc 1.11) [root@moll1 lipi]# lstopo-no-graphics Machine (8191MB total) NUMANode L#0 (P#0 2047MB) Package L#0 + L3 L#0 (16MB) + L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) Package L#1 + L3 L#1 (16MB) + L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) Package L#2 + L3 L#2 (16MB) + L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2) Package L#3 + L3 L#3 (16MB) + L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3) NUMANode L#1 (P#1 2048MB) Package L#4 + L3 L#4 (16MB) + L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4) Package L#5 + L3 L#5 (16MB) + L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5) Package L#6 + L3 L#6 (16MB) + L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6) Package L#7 + L3 L#7 (16MB) + L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7) NUMANode L#2 (P#2 2048MB) Package L#8 + L3 L#8 (16MB) + L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 (P#8) Package L#9 + L3 L#9 (16MB) + L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 (P#9) Package L#10 + L3 L#10 (16MB) + L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU L#10 (P#10) Package L#11 + L3 L#11 (16MB) + L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU L#11 (P#11) NUMANode L#3 (P#3 2048MB) Package L#12 + L3 L#12 (16MB) + L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU L#12 (P#12) Package L#13 + L3 L#13 (16MB) + L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU L#13 (P#13) Package L#14 + L3 L#14 (16MB) + L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU L#14 (P#14) Package L#15 + L3 L#15 (16MB) + L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU L#15 (P#15) Misc(MemoryModule) HostBridge L#0 PCI 1b36:0100 GPU L#0 "card0" GPU L#1 "controlD64" 2 x { PCI 1af4:1000 } PCI 1af4:1001 PCI 8086:2922 Will keep you informed.
Hi Felip, we currently do set `TaskPlugin=task/affinity,task/cgroup`. Are you saying that with this setting Slurm already works properly with hwloc 2.2?
(In reply to Francesco De Martino from comment #5) > Hi Felip, > > we currently do set `TaskPlugin=task/affinity,task/cgroup`. Are you saying > that with this setting Slurm already works properly with hwloc 2.2? Probably yes, can you share your current slurm.conf and cgroup.conf?
(In reply to Felip Moll from comment #6) > (In reply to Francesco De Martino from comment #5) > > Hi Felip, > > > > we currently do set `TaskPlugin=task/affinity,task/cgroup`. Are you saying > > that with this setting Slurm already works properly with hwloc 2.2? > > Probably yes, can you share your current slurm.conf and cgroup.conf? slurm.conf: https://github.com/aws/aws-parallelcluster-cookbook/blob/v2.11.0/templates/default/slurm/slurm.conf.erb cgroup.conf: https://github.com/aws/aws-parallelcluster-cookbook/blob/v2.11.0/templates/default/slurm/cgroup.conf.erb
Francesco, Sorry to not having responding before. I am finding some unexpected issue that could effectively make Slurm installations with hwloc not to work correctly, not just causing issues to affinity. Do you have any node with CentOS 8.4? If so, a 'lstopo-no-graphics' and a 'lscpu' would be useful to me. I'll keep you informed.
(In reply to Felip Moll from comment #8) > Francesco, > > Sorry to not having responding before. I am finding some unexpected issue > that could effectively make Slurm installations with hwloc not to work > correctly, not just causing issues to affinity. > > Do you have any node with CentOS 8.4? If so, a 'lstopo-no-graphics' and a > 'lscpu' would be useful to me. > > I'll keep you informed. [centos@ip-192-168-40-41 ~]$ cat /etc/redhat-release CentOS Linux release 8.4.2105 [centos@ip-192-168-40-41 ~]$ uname -a Linux ip-192-168-40-41 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux [centos@ip-192-168-40-41 ~]$ sudo yum list --installed | grep hwloc hwloc-devel.x86_64 2.2.0-1.el8 @powertools hwloc-libs.x86_64 2.2.0-1.el8 @baseos [centos@ip-192-168-40-41 ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Stepping: 7 CPU MHz: 3599.957 BogoMIPS: 5999.96 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke [centos@ip-192-168-40-41 ~]$ sudo yum install -y hwloc ... [centos@ip-192-168-40-41 ~]$ lstopo-no-graphics Machine (15GB total) Package L#0 NUMANode L#0 (P#0 15GB) L3 L#0 (36MB) L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#4) L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#5) L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#6) L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#7) HostBridge PCI 00:03.0 (VGA) PCI 00:04.0 (NVMExp) Block(Disk) "nvme0n1" PCI 00:05.0 (Ethernet) Net "eth0" PCI 00:1f.0 (NVMExp) Block(Disk) "nvme1n1"
(In reply to Francesco De Martino from comment #11) > (In reply to Felip Moll from comment #8) > > Francesco, > > > > Sorry to not having responding before. I am finding some unexpected issue > > that could effectively make Slurm installations with hwloc not to work > > correctly, not just causing issues to affinity. > > > > Do you have any node with CentOS 8.4? If so, a 'lstopo-no-graphics' and a > > 'lscpu' would be useful to me. > > > > I'll keep you informed. > > > [centos@ip-192-168-40-41 ~]$ cat /etc/redhat-release > CentOS Linux release 8.4.2105 > [centos@ip-192-168-40-41 ~]$ uname -a > Linux ip-192-168-40-41 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 > UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > [centos@ip-192-168-40-41 ~]$ sudo yum list --installed | grep hwloc > hwloc-devel.x86_64 2.2.0-1.el8 > @powertools > hwloc-libs.x86_64 2.2.0-1.el8 > @baseos > [centos@ip-192-168-40-41 ~]$ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 8 > On-line CPU(s) list: 0-7 > Thread(s) per core: 2 > Core(s) per socket: 4 > Socket(s): 1 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping: 7 > CPU MHz: 3599.957 > BogoMIPS: 5999.96 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache: 1024K > L3 cache: 36608K > NUMA node0 CPU(s): 0-7 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp > lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf > tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 > erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke > > [centos@ip-192-168-40-41 ~]$ sudo yum install -y hwloc > ... > [centos@ip-192-168-40-41 ~]$ lstopo-no-graphics > Machine (15GB total) > Package L#0 > NUMANode L#0 (P#0 15GB) > L3 L#0 (36MB) > L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 > PU L#0 (P#0) > PU L#1 (P#4) > L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 > PU L#2 (P#1) > PU L#3 (P#5) > L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 > PU L#4 (P#2) > PU L#5 (P#6) > L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 > PU L#6 (P#3) > PU L#7 (P#7) > HostBridge > PCI 00:03.0 (VGA) > PCI 00:04.0 (NVMExp) > Block(Disk) "nvme0n1" > PCI 00:05.0 (Ethernet) > Net "eth0" > PCI 00:1f.0 (NVMExp) > Block(Disk) "nvme1n1" Thank you. This topology is not affected for the issue I've seen. Also, bug 10679 is inaccurate. The xcpuinfo code which is referring to is fine because the depth of the numa nodes are given in negative numbers so the if-else condition is still correct. Let me do some final checks and fix the issue I've found, which refers to counting the sockets per board on topologies with multiple sockets per NUMA node. For the rest I think all is good.
Thank you for the quick reply. The topology I shared with you is just the one for a random AWS instance I picked, but the cluster can be used with any EC2 instance. Can you elaborate a bit more on the issue in order for me to understand what types of instances are affected and what problems we might face?
Francesco, After having looked more I fixed the issue about the boards. It turns out is a cosmetic issue only when running 'slurmd -C' so it shouldn't affect anything. So, my conclusion is that CentOS 8.4 should not show any issue even if it is shipped with the most recent version of hwloc. Slurm's code already detects everything correctly. I think bug 10679 assumed that our code in xcpuinfo was doing an incorrect check: if (hwloc_get_type_depth(topology, HWLOC_OBJ_NODE) > hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET)) { because in previous versions, the NUMANode element was the parent of the sockets (Packages): Machine (8191MB total) NUMANode L#0 (P#0 2047MB) Package L#0 + L3 L#0 (16MB) + L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) Package L#1 + L3 L#1 (16MB) + L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) but in version 2.x it is at the same level and is no parent anymore: Machine (7765MB total) Group0 L#0 NUMANode L#0 (P#0 1721MB) Package L#0 + L3 L#0 (16MB) L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) Package L#1 + L3 L#1 (16MB) L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3) But the point is that even if it is displayed at the same level in the tree, the real 'depth' is given with a negative number, so the comparison is still valid and does the correct thing. This is from the documentation of hwloc: https://www.open-mpi.org/projects/hwloc/doc/v2.4.0/a00208.php > depth > int hwloc_obj::depth > ... > For normal objects, this is the depth of the horizontal level that contains this object and its cousins of the same type. If the > topology is symmetric, this is equal to the parent depth plus one, and also equal to the number of parent/child links from the >root object to here. > >For special objects (NUMA nodes, I/O and Misc) that are not in the main tree, this is a special negative value that corresponds to >their dedicated level, see hwloc_get_type_depth() and hwloc_get_type_depth_e. Those special values can be passed to hwloc functions >such hwloc_get_nbobjs_by_depth() as usual. A debug session with GDB confirms that: (gdb) p hwloc_get_type_depth(topology, HWLOC_OBJ_NODE) $8 = -3 (gdb) p hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET) $9 = 2 I also checked some fixes done one year ago in bug 9523 about assigning the current boards number, and it seems correct. commit 6566c1b1c1735768fb4beff9566c9dd894ec44d0 Author: Marcin Stolarek <cinek@schedmd.com> AuthorDate: Mon Aug 10 14:15:05 2020 +0000 Commit: Danny Auble <da@schedmd.com> CommitDate: Fri Aug 21 12:55:35 2020 -0600 Improve number of boards discovery Message in initial implementation mentioned this issue c157ccfc22a. HWLOC_OBJ_GROUP doesn't have to be board and may happen on different levels. For instance for AMD Epyc configuring different NPS (NUMA nodes per socket) we're getting groups below package object. Bug 9523 So, at this point, the only issue I see is when printing the configuration with 'slurmd -C', which is a cosmetic-only issue. The conclusion is that it is safe to go with CentOS 8.4 and hwloc 2.x. In any case I am attaching a patch to fix this 'slurmd -C' case. Thanks for your patience.
*** Ticket 10679 has been marked as a duplicate of this ticket. ***
A patch has been added to master, so the fix will be in 21.08 when it is released. commit ac6963efbffc38e4c5c1323118514518c3c8cd4d Author: Felip Moll <felip.moll@schedmd.com> AuthorDate: Wed Jul 14 18:01:39 2021 +0200 Show correct number of SocketsPerBoard in slurmd -C On systems with multiple NUMA nodes and more than one socket per each domain, the output of slurmd -C was showing an incorrect number of sockets per board. (This is an issue only with hwloc 2.) Bug 11787. Please, mark again the bug as open if you have further questions.