(Split off from bug 10827 comment 63) AutoDetect is not setting the Cores field properly on Stanford's Sherlock cluster (both Intel and AMD nodes with GPUs). Here's an example of an Intel node with bad Cores (i.e. topo_core_bitmap) (from bug 10827 comment 39): gres/gpu: state for sh02-16n03 gres_cnt found:8 configured:8 avail:8 alloc:2 gres_bit_alloc:0,7 of 8 gres_used:(null) links[0]:-1, 0, 0, 0, 0, 0, 0, 0 links[1]:0, -1, 0, 0, 0, 0, 0, 0 links[2]:0, 0, -1, 0, 0, 0, 0, 0 links[3]:0, 0, 0, -1, 0, 0, 0, 0 links[4]:0, 0, 0, 0, -1, 0, 0, 0 links[5]:0, 0, 0, 0, 0, -1, 0, 0 links[6]:0, 0, 0, 0, 0, 0, -1, 0 links[7]:0, 0, 0, 0, 0, 0, 0, -1 topo[0]:(null)(0) topo_core_bitmap[0]:0-19 of 20 topo_gres_bitmap[0]:0 of 8 topo_gres_cnt_alloc[0]:1 topo_gres_cnt_avail[0]:1 topo[1]:(null)(0) topo_core_bitmap[1]:0-19 of 20 topo_gres_bitmap[1]:1 of 8 topo_gres_cnt_alloc[1]:0 topo_gres_cnt_avail[1]:1 topo[2]:(null)(0) topo_core_bitmap[2]:0-19 of 20 topo_gres_bitmap[2]:2 of 8 topo_gres_cnt_alloc[2]:0 topo_gres_cnt_avail[2]:1 topo[3]:(null)(0) topo_core_bitmap[3]:0-19 of 20 topo_gres_bitmap[3]:3 of 8 topo_gres_cnt_alloc[3]:0 topo_gres_cnt_avail[3]:1 topo[4]:(null)(0) topo_core_bitmap[4]:0-19 of 20 topo_gres_bitmap[4]:4 of 8 topo_gres_cnt_alloc[4]:0 topo_gres_cnt_avail[4]:1 topo[5]:(null)(0) topo_core_bitmap[5]:0-19 of 20 topo_gres_bitmap[5]:5 of 8 topo_gres_cnt_alloc[5]:0 topo_gres_cnt_avail[5]:1 topo[6]:(null)(0) topo_core_bitmap[6]:0-19 of 20 topo_gres_bitmap[6]:6 of 8 topo_gres_cnt_alloc[6]:0 topo_gres_cnt_avail[6]:1 topo[7]:(null)(0) topo_core_bitmap[7]:0-19 of 20 topo_gres_bitmap[7]:7 of 8 topo_gres_cnt_alloc[7]:1 topo_gres_cnt_avail[7]:1 Here's an example of an AMD node with bad Cores (topo_core_bitmap): gres/gpu: state for sh03-13n14 gres_cnt found:4 configured:4 avail:4 alloc:0 gres_bit_alloc: of 4 gres_used:(null) links[0]:0, 0, 0, -1 links[1]:0, 0, -1, 0 links[2]:0, -1, 0, 0 links[3]:-1, 0, 0, 0 topo[0]:(null)(0) topo_core_bitmap[0]:0-31 of 32 topo_gres_bitmap[0]:0 of 4 topo_gres_cnt_alloc[0]:0 topo_gres_cnt_avail[0]:1 topo[1]:(null)(0) topo_core_bitmap[1]:0-31 of 32 topo_gres_bitmap[1]:1 of 4 topo_gres_cnt_alloc[1]:0 topo_gres_cnt_avail[1]:1 topo[2]:(null)(0) topo_core_bitmap[2]:0-31 of 32 topo_gres_bitmap[2]:2 of 4 topo_gres_cnt_alloc[2]:0 topo_gres_cnt_avail[2]:1 topo[3]:(null)(0) topo_core_bitmap[3]:0-31 of 32 topo_gres_bitmap[3]:3 of 4 topo_gres_cnt_alloc[3]:0 topo_gres_cnt_avail[3]:1 And from the slurmd.log for sh03-13n14 (from bug 10827 comment 59): GPU index 0: Name: geforce_rtx_2080_ti UUID: GPU-1118f950-5276-c3ca-dce8-5c1a24ca732e PCI Domain/Bus/Device: 0:5:0 PCI Bus ID: 00000000:05:00.0 NVLinks: -1,0,0,0 Device File (minor number): /dev/nvidia3 Note: GPU index 0 is different from minor number 3 CPU Affinity Range - Machine: 16-31 Core Affinity Range - Abstract: 0-31 GPU index 1: Name: geforce_rtx_2080_ti UUID: GPU-4558f0f3-a437-5acc-4501-4b066593f31a PCI Domain/Bus/Device: 0:68:0 PCI Bus ID: 00000000:44:00.0 NVLinks: 0,-1,0,0 Device File (minor number): /dev/nvidia2 Note: GPU index 1 is different from minor number 2 CPU Affinity Range - Machine: 16-31 Core Affinity Range - Abstract: 0-31 GPU index 2: Name: geforce_rtx_2080_ti UUID: GPU-565c7761-2b61-7cbc-69bf-3c9dff3b59a1 PCI Domain/Bus/Device: 0:137:0 PCI Bus ID: 00000000:89:00.0 NVLinks: 0,0,-1,0 Device File (minor number): /dev/nvidia1 Note: GPU index 2 is different from minor number 1 CPU Affinity Range - Machine: 0-15 Core Affinity Range - Abstract: 0-31 GPU index 3: Name: geforce_rtx_2080_ti UUID: GPU-ae9eb151-45c4-4af3-3513-01b533f08770 PCI Domain/Bus/Device: 0:196:0 PCI Bus ID: 00000000:C4:00.0 NVLinks: 0,0,0,-1 Device File (minor number): /dev/nvidia0 Note: GPU index 3 is different from minor number 0 CPU Affinity Range - Machine: 0-15 Core Affinity Range - Abstract: 0-31 In all cases, NVML is reporting the correct CPU affinity range, but when it's converted into a Core affinity range, it spans the whole node. I don't think this is a multiply by 2 + 1 error, or else CPU range 16-31 would have turned into Core range 32-63...
Created attachment 18094 [details] 20.11.4 DEBUG v1 Kilian, Would you mind applying the following DEBUG patch, restarting some slurmds with multiple GPUs that are using AutoDetect, and attaching the slurmd.log outputs? I'd like at least one for an Intel node and one for an AMD node. I'm hoping this will explain why the core affinities are being incorrectly converted in xcpuinfo_mac_to_abs(). Thanks, -Michael
Hi Kilian, Do you think you will be able to run the debug patch at some point?
Hi Michael (In reply to Michael Hinton from comment #3) > Do you think you will be able to run the debug patch at some point? Yep, sorry for the delay, I'll send you the info shortly. Cheers, -- Kilian
Created attachment 18163 [details] slurmd logs w/ debug path Here are the slurmd logs with the debug patch, from an Intel node (sh02-14n08) and an AMD ndoe (sh03-13n14). I provided the output of lstopo and nvidia-smi topo -m for both. Let me know if you need anything else! Cheers, -- Kilian
Great! I'll investigate this some more and get back to you. Thanks, -Michael
Created attachment 18165 [details] v2 I have some good news (at the expense of exposing myself as being stupid): I found the issue and have a patch for you! Attached is a v2 that fixes the issue in commit [1/2] and then adds optional debugging in [2/2]. Once v2 appears to fix everything, you can go ahead and drop commit [2/2], though it won't hurt to leave it in, since it's just extra logging info in the slurmd.log. It turns out that in my effort to improve AutoDetect for 20.11 in the case where threads per core > 1, AutoDetect erroneously makes the detected GPUs have affinity for ALL cores. Hence, I'm stupid :) Commit [1/2] should fix that. Just to clarify what you should now see for a GPUs on sh03-13n14 in the slurmd.log NVML debug2 output: CPU Affinity Range - Machine: 0-15 Core Affinity Range - Abstract: 0-31 ... CPU Affinity Range - Machine: 16-31 Core Affinity Range - Abstract: 0-31 Should now show up as: CPU Affinity Range - Machine: 0-15 Core Affinity Range - Abstract: 0-15 ... CPU Affinity Range - Machine: 16-31 Core Affinity Range - Abstract: 16-31 For that node (and for a lot of your nodes), the physical/machine CPU ids happen to be the same as the logical/"Slurm abstract" CPU ids, as verified by the lstopo output. And since you have threads per core == 1, CPU ids == core ids. After applying v2 and restarting your slurmds, make sure to check a few of your nodes to see if they have the correct core affinity, and cross-check that with `nvidia-smi topo -m` (which will show physical/machine CPU ids, I think) and `lstopo-no-graphics` (which should show both physcial (P#) ids and logical ids (L#)). Sorry for the regression, and thanks for helping us find this. Let me know how it goes! -Michael
(In reply to Michael Hinton from comment #8) > Just to clarify what you should now see for a GPUs on sh03-13n14 in the > slurmd.log NVML debug2 output: Via commit [2/2], you should also see something like: BUG_10932: xcpuinfo_mac_to_abs: INPUT: machine CPU range: 0-11 ... BUG_10932: xcpuinfo_mac_to_abs: OUTPUT: abstract core range: 0-5 Proving that the conversion from the machine CPU range reported by NVML to the abstract core range used by Slurm occurred correctly.
Created attachment 18179 [details] slurmd logs w/ topo info after patch v2 Hi Michael, Thanks you very much for the patch, I think that fixes the issue in all the topologies I was able to check (including another one where the physical CPU ids are spread around the sockets in a round-robin fashion). I'm attaching lstopo, nvidia-smi topo and slurmd debug logs with your patch applied for those 4 topologies * sh03-13n15: AMD node with highest CPU ids on NUMA 0 node (and GPU ids not matching the device minor numbers). This is the same hardware as sh03-13n14. * sh02-14n08: Intel node with contiguous physical CPU ids * sh02-16n05: Intel node with physical CPU ids alternating over sockets, GPUs attached to a single socket * sh01-27n21: Intel node with physical CPU ids alternating over sockets, GPUs attached to both sockets I think that all those cases are covered correctly and CPU affinities generated by the NVML gres plugin are matching the actual topology, so that looks great! And I aslo believe that this may be fixing bug 10827, correct? On sh03-13n15, I know see a links matrix that seems to be correctly oriented: sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: -1,0,0,0 sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,-1,0,0 sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,0,-1,0 sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,0,0,-1 Thanks! -- Kilian
(In reply to Kilian Cavalotti from comment #10) > Hi Michael, > > Thanks you very much for the patch, I think that fixes the issue in all the > topologies I was able to check (including another one where the physical CPU > ids are spread around the sockets in a round-robin fashion). I'm attaching > lstopo, nvidia-smi topo and slurmd debug logs with your patch applied for > those 4 topologies > > * sh03-13n15: AMD node with highest CPU ids on NUMA 0 node (and GPU ids not > matching the device minor numbers). This is the same hardware as sh03-13n14. > * sh02-14n08: Intel node with contiguous physical CPU ids > * sh02-16n05: Intel node with physical CPU ids alternating over sockets, > GPUs attached to a single socket > * sh01-27n21: Intel node with physical CPU ids alternating over sockets, > GPUs attached to both sockets > > I think that all those cases are covered correctly and CPU affinities > generated by the NVML gres plugin are matching the actual topology, so that > looks great! Excellent! I'll review your logs just to make sure everything looks correct, but barring anything strange, I'll go ahead and get the patch on the review queue so it can make it into 20.11.5. > And I also believe that this may be fixing bug 10827, correct? On > sh03-13n15, I know see a links matrix that seems to be correctly oriented: > sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: > NVLinks: -1,0,0,0 > sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: > NVLinks: 0,-1,0,0 > sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: > NVLinks: 0,0,-1,0 > sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: > NVLinks: 0,0,0,-1 I don't see how this patch could possibly fix the nvlinks issue... Maybe this is just proof of the intermittent nature of the issue. I'll continue my thoughts on this in bug 10827. Thanks! -Michael
(In reply to Michael Hinton from comment #11) > Excellent! I'll review your logs just to make sure everything looks correct, > but barring anything strange, I'll go ahead and get the patch on the review > queue so it can make it into 20.11.5. Great, thank you! > I don't see how this patch could possibly fix the nvlinks issue... Maybe > this is just proof of the intermittent nature of the issue. I'll continue my > thoughts on this in bug 10827. Ah you're right: on sh03-13n14, even though the matrix looks correct, job submission with --gres gpu:4 still fails with the "Invalid generic resource (gres) specification" error. I'll follow up in bug 10827. Cheers, -- Kilian
Kilian, This has now been fixed in 20.11.5 with commit https://github.com/SchedMD/slurm/commit/d98b3538ea. I'll go ahead and close this out. Thanks for helping us find and fix this bug! -Michael
Hi Michael, On Mon, Mar 8, 2021 at 2:33 PM <bugs@schedmd.com> wrote: > This has now been fixed in 20.11.5 with commit > https://github.com/SchedMD/slurm/commit/d98b3538ea. Awesome, thanks for the update! Cheers, -- Kilian