Ticket 10932

Summary:	AutoDetect not setting Core affinity correctly
Product:	Slurm	Reporter:	Michael Hinton <hinton>
Component:	GPU	Assignee:	Director of Support <support>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	kilian
Version:	20.11.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=10827 https://bugs.schedmd.com/show_bug.cgi?id=10613 https://bugs.schedmd.com/show_bug.cgi?id=9211
Site:	Stanford	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.11.5
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	20.11.4 DEBUG v1 slurmd logs w/ debug path v2 slurmd logs w/ topo info after patch v2

Description Michael Hinton 2021-02-23 12:09:09 MST

(Split off from bug 10827 comment 63)
AutoDetect is not setting the Cores field properly on Stanford's Sherlock cluster (both Intel and AMD nodes with GPUs).

Here's an example of an Intel node with bad Cores (i.e. topo_core_bitmap) (from bug 10827 comment 39):

gres/gpu: state for sh02-16n03
 gres_cnt found:8 configured:8 avail:8 alloc:2
 gres_bit_alloc:0,7 of 8
 gres_used:(null)
 links[0]:-1, 0, 0, 0, 0, 0, 0, 0
 links[1]:0, -1, 0, 0, 0, 0, 0, 0
 links[2]:0, 0, -1, 0, 0, 0, 0, 0
 links[3]:0, 0, 0, -1, 0, 0, 0, 0
 links[4]:0, 0, 0, 0, -1, 0, 0, 0
 links[5]:0, 0, 0, 0, 0, -1, 0, 0
 links[6]:0, 0, 0, 0, 0, 0, -1, 0
 links[7]:0, 0, 0, 0, 0, 0, 0, -1
 topo[0]:(null)(0)
  topo_core_bitmap[0]:0-19 of 20
  topo_gres_bitmap[0]:0 of 8
  topo_gres_cnt_alloc[0]:1
  topo_gres_cnt_avail[0]:1
 topo[1]:(null)(0)
  topo_core_bitmap[1]:0-19 of 20
  topo_gres_bitmap[1]:1 of 8
  topo_gres_cnt_alloc[1]:0
  topo_gres_cnt_avail[1]:1
 topo[2]:(null)(0)
  topo_core_bitmap[2]:0-19 of 20
  topo_gres_bitmap[2]:2 of 8
  topo_gres_cnt_alloc[2]:0
  topo_gres_cnt_avail[2]:1
 topo[3]:(null)(0)
  topo_core_bitmap[3]:0-19 of 20
  topo_gres_bitmap[3]:3 of 8
  topo_gres_cnt_alloc[3]:0
  topo_gres_cnt_avail[3]:1
 topo[4]:(null)(0)
  topo_core_bitmap[4]:0-19 of 20
  topo_gres_bitmap[4]:4 of 8
  topo_gres_cnt_alloc[4]:0
  topo_gres_cnt_avail[4]:1
 topo[5]:(null)(0)
  topo_core_bitmap[5]:0-19 of 20
  topo_gres_bitmap[5]:5 of 8
  topo_gres_cnt_alloc[5]:0
  topo_gres_cnt_avail[5]:1
 topo[6]:(null)(0)
  topo_core_bitmap[6]:0-19 of 20
  topo_gres_bitmap[6]:6 of 8
  topo_gres_cnt_alloc[6]:0
  topo_gres_cnt_avail[6]:1
 topo[7]:(null)(0)
  topo_core_bitmap[7]:0-19 of 20
  topo_gres_bitmap[7]:7 of 8
  topo_gres_cnt_alloc[7]:1
  topo_gres_cnt_avail[7]:1


Here's an example of an AMD node with bad Cores (topo_core_bitmap):

gres/gpu: state for sh03-13n14
 gres_cnt found:4 configured:4 avail:4 alloc:0
 gres_bit_alloc: of 4
 gres_used:(null)
 links[0]:0, 0, 0, -1
 links[1]:0, 0, -1, 0
 links[2]:0, -1, 0, 0
 links[3]:-1, 0, 0, 0
 topo[0]:(null)(0)
  topo_core_bitmap[0]:0-31 of 32
  topo_gres_bitmap[0]:0 of 4
  topo_gres_cnt_alloc[0]:0
  topo_gres_cnt_avail[0]:1
 topo[1]:(null)(0)
  topo_core_bitmap[1]:0-31 of 32
  topo_gres_bitmap[1]:1 of 4
  topo_gres_cnt_alloc[1]:0
  topo_gres_cnt_avail[1]:1
 topo[2]:(null)(0)
  topo_core_bitmap[2]:0-31 of 32
  topo_gres_bitmap[2]:2 of 4
  topo_gres_cnt_alloc[2]:0
  topo_gres_cnt_avail[2]:1
 topo[3]:(null)(0)
  topo_core_bitmap[3]:0-31 of 32
  topo_gres_bitmap[3]:3 of 4
  topo_gres_cnt_alloc[3]:0
  topo_gres_cnt_avail[3]:1

And from the slurmd.log for sh03-13n14 (from bug 10827 comment 59):

GPU index 0:
    Name: geforce_rtx_2080_ti
    UUID: GPU-1118f950-5276-c3ca-dce8-5c1a24ca732e
    PCI Domain/Bus/Device: 0:5:0
    PCI Bus ID: 00000000:05:00.0
    NVLinks: -1,0,0,0
    Device File (minor number): /dev/nvidia3
Note: GPU index 0 is different from minor number 3
    CPU Affinity Range - Machine: 16-31
    Core Affinity Range - Abstract: 0-31
GPU index 1:
    Name: geforce_rtx_2080_ti
    UUID: GPU-4558f0f3-a437-5acc-4501-4b066593f31a
    PCI Domain/Bus/Device: 0:68:0
    PCI Bus ID: 00000000:44:00.0
    NVLinks: 0,-1,0,0
    Device File (minor number): /dev/nvidia2
Note: GPU index 1 is different from minor number 2
    CPU Affinity Range - Machine: 16-31
    Core Affinity Range - Abstract: 0-31
GPU index 2:
    Name: geforce_rtx_2080_ti
    UUID: GPU-565c7761-2b61-7cbc-69bf-3c9dff3b59a1
    PCI Domain/Bus/Device: 0:137:0
    PCI Bus ID: 00000000:89:00.0
    NVLinks: 0,0,-1,0
    Device File (minor number): /dev/nvidia1
Note: GPU index 2 is different from minor number 1
    CPU Affinity Range - Machine: 0-15
    Core Affinity Range - Abstract: 0-31
GPU index 3:
    Name: geforce_rtx_2080_ti
    UUID: GPU-ae9eb151-45c4-4af3-3513-01b533f08770
    PCI Domain/Bus/Device: 0:196:0
    PCI Bus ID: 00000000:C4:00.0
    NVLinks: 0,0,0,-1
    Device File (minor number): /dev/nvidia0
Note: GPU index 3 is different from minor number 0
    CPU Affinity Range - Machine: 0-15
    Core Affinity Range - Abstract: 0-31

In all cases, NVML is reporting the correct CPU affinity range, but when it's converted into a Core affinity range, it spans the whole node.

I don't think this is a multiply by 2 + 1 error, or else CPU range 16-31 would have turned into Core range 32-63...

Comment 2 Michael Hinton 2021-02-24 11:43:01 MST

Created attachment 18094 [details]
20.11.4 DEBUG v1

Kilian,

Would you mind applying the following DEBUG patch, restarting some slurmds with multiple GPUs that are using AutoDetect, and attaching the slurmd.log outputs? I'd like at least one for an Intel node and one for an AMD node. I'm hoping this will explain why the core affinities are being incorrectly converted in xcpuinfo_mac_to_abs().

Thanks,
-Michael

Comment 3 Michael Hinton 2021-02-26 13:59:53 MST

Hi Kilian,

Do you think you will be able to run the debug patch at some point?

Comment 5 Kilian Cavalotti 2021-02-26 14:11:10 MST

Hi Michael

(In reply to Michael Hinton from comment #3)
> Do you think you will be able to run the debug patch at some point?

Yep, sorry for the delay, I'll send you the info shortly.

Cheers,
--
Kilian

Comment 6 Kilian Cavalotti 2021-02-26 15:02:10 MST

Created attachment 18163 [details]
slurmd logs w/ debug path

Here are the slurmd logs with the debug patch, from an Intel node (sh02-14n08) and an AMD ndoe (sh03-13n14). I provided the output of lstopo and nvidia-smi topo -m for both.

Let me know if you need anything else!

Cheers,
--
Kilian

Comment 7 Michael Hinton 2021-02-26 15:09:54 MST

Great! I'll investigate this some more and get back to you.

Thanks,
-Michael

Comment 8 Michael Hinton 2021-02-26 17:10:01 MST

Created attachment 18165 [details]
v2

I have some good news (at the expense of exposing myself as being stupid): I found the issue and have a patch for you!

Attached is a v2 that fixes the issue in commit [1/2] and then adds optional debugging in [2/2]. Once v2 appears to fix everything, you can go ahead and drop commit [2/2], though it won't hurt to leave it in, since it's just extra logging info in the slurmd.log.

It turns out that in my effort to improve AutoDetect for 20.11 in the case where threads per core > 1, AutoDetect erroneously makes the detected GPUs have affinity for ALL cores. Hence, I'm stupid :) Commit [1/2] should fix that.

Just to clarify what you should now see for a GPUs on sh03-13n14 in the slurmd.log NVML debug2 output:

    CPU Affinity Range - Machine: 0-15
    Core Affinity Range - Abstract: 0-31
...
    CPU Affinity Range - Machine: 16-31
    Core Affinity Range - Abstract: 0-31

Should now show up as: 

    CPU Affinity Range - Machine: 0-15
    Core Affinity Range - Abstract: 0-15
...
    CPU Affinity Range - Machine: 16-31
    Core Affinity Range - Abstract: 16-31

For that node (and for a lot of your nodes), the physical/machine CPU ids happen to be the same as the logical/"Slurm abstract" CPU ids, as verified by the lstopo output. And since you have threads per core == 1, CPU ids == core ids.

After applying v2 and restarting your slurmds, make sure to check a few of your nodes to see if they have the correct core affinity, and cross-check that with `nvidia-smi topo -m` (which will show physical/machine CPU ids, I think) and `lstopo-no-graphics` (which should show both physcial (P#) ids and logical ids (L#)).

Sorry for the regression, and thanks for helping us find this. Let me know how it goes!
-Michael

Comment 9 Michael Hinton 2021-02-26 17:13:38 MST

(In reply to Michael Hinton from comment #8)
> Just to clarify what you should now see for a GPUs on sh03-13n14 in the
> slurmd.log NVML debug2 output:
Via commit [2/2], you should also see something like:

BUG_10932: xcpuinfo_mac_to_abs: INPUT: machine CPU range: 0-11
...
BUG_10932: xcpuinfo_mac_to_abs: OUTPUT: abstract core range: 0-5

Proving that the conversion from the machine CPU range reported by NVML to the abstract core range used by Slurm occurred correctly.

Comment 10 Kilian Cavalotti 2021-03-01 12:31:20 MST

Created attachment 18179 [details]
slurmd logs w/ topo info after patch v2

Hi Michael, 

Thanks you very much for the patch, I think that fixes the issue in all the topologies I was able to check (including another one where the physical CPU ids are spread around the sockets in a round-robin fashion). I'm attaching lstopo, nvidia-smi topo and slurmd debug logs with your patch applied for those 4 topologies

* sh03-13n15: AMD node with highest CPU ids on NUMA 0 node (and GPU ids not matching the device minor numbers). This is the same hardware as sh03-13n14.
* sh02-14n08: Intel node with contiguous physical CPU ids
* sh02-16n05: Intel node with physical CPU ids alternating over sockets, GPUs attached to a single socket
* sh01-27n21: Intel node with physical CPU ids alternating over sockets, GPUs attached to both sockets


I think that all those cases are covered correctly and CPU affinities generated by the NVML gres plugin are matching the actual topology, so that looks great!

And I aslo believe that this may be fixing bug 10827, correct? On sh03-13n15, I know see a links matrix that seems to be correctly oriented:
sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: -1,0,0,0
sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 0,-1,0,0
sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 0,0,-1,0
sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 0,0,0,-1

Thanks!
--
Kilian

Comment 11 Michael Hinton 2021-03-01 12:48:52 MST

(In reply to Kilian Cavalotti from comment #10)
> Hi Michael, 
> 
> Thanks you very much for the patch, I think that fixes the issue in all the
> topologies I was able to check (including another one where the physical CPU
> ids are spread around the sockets in a round-robin fashion). I'm attaching
> lstopo, nvidia-smi topo and slurmd debug logs with your patch applied for
> those 4 topologies
> 
> * sh03-13n15: AMD node with highest CPU ids on NUMA 0 node (and GPU ids not
> matching the device minor numbers). This is the same hardware as sh03-13n14.
> * sh02-14n08: Intel node with contiguous physical CPU ids
> * sh02-16n05: Intel node with physical CPU ids alternating over sockets,
> GPUs attached to a single socket
> * sh01-27n21: Intel node with physical CPU ids alternating over sockets,
> GPUs attached to both sockets
> 
> I think that all those cases are covered correctly and CPU affinities
> generated by the NVML gres plugin are matching the actual topology, so that
> looks great!
Excellent! I'll review your logs just to make sure everything looks correct, but barring anything strange, I'll go ahead and get the patch on the review queue so it can make it into 20.11.5.

> And I also believe that this may be fixing bug 10827, correct? On
> sh03-13n15, I know see a links matrix that seems to be correctly oriented:
> sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: 
> NVLinks: -1,0,0,0
> sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: 
> NVLinks: 0,-1,0,0
> sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: 
> NVLinks: 0,0,-1,0
> sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: 
> NVLinks: 0,0,0,-1
I don't see how this patch could possibly fix the nvlinks issue... Maybe this is just proof of the intermittent nature of the issue. I'll continue my thoughts on this in bug 10827.

Thanks!
-Michael

Comment 12 Kilian Cavalotti 2021-03-01 14:07:46 MST

(In reply to Michael Hinton from comment #11)
> Excellent! I'll review your logs just to make sure everything looks correct,
> but barring anything strange, I'll go ahead and get the patch on the review
> queue so it can make it into 20.11.5.

Great, thank you!

> I don't see how this patch could possibly fix the nvlinks issue... Maybe
> this is just proof of the intermittent nature of the issue. I'll continue my
> thoughts on this in bug 10827.

Ah you're right: on sh03-13n14, even though the matrix looks correct, job submission with --gres gpu:4 still fails with the "Invalid generic resource (gres) specification" error. I'll follow up in bug 10827.

Cheers,
--
Kilian

Comment 18 Michael Hinton 2021-03-08 15:33:11 MST

Kilian,

This has now been fixed in 20.11.5 with commit https://github.com/SchedMD/slurm/commit/d98b3538ea.

I'll go ahead and close this out. Thanks for helping us find and fix this bug!
-Michael

Comment 19 Kilian Cavalotti 2021-03-08 15:37:54 MST

Hi Michael,

On Mon, Mar 8, 2021 at 2:33 PM <bugs@schedmd.com> wrote:
> This has now been fixed in 20.11.5 with commit
> https://github.com/SchedMD/slurm/commit/d98b3538ea.

Awesome, thanks for the update!

Cheers,
--
Kilian