Ticket 16547

Summary: CPU affinity is wrong for Intel GPU with oneapi plugin
Product: Slurm Reporter: huanxing <huanxing.shen>
Component: GPUAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: cinek
Version: 22.05.8   
Hardware: Linux   
OS: Linux   
Site: Intel CRT Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 23.02.2
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description huanxing 2023-04-18 23:24:41 MDT
Summary:
CPU affinity detected by oneapi GPU plugin is wrong. CPU affinity of GPU is "0-23,48-71", but reported as "0-23,4-71".

Details:
We configured Slurm to use Intel GPU in gres.conf. And we observed that the CPU affinity of the GPU is wrong from the slurmd.log. CPU affinity of GPU is "0-23,48-71", but reported as "0-23,4-71".

slurmd: debug2: gpu/oneapi: _oneapi_get_device_name: Device name is: card1
slurmd: debug2: gpu/oneapi: _oneapi_read_cpu_affinity_list: Read file: /sys/class/drm/card1/device/local_cpulist
slurmd: debug2: gpu/oneapi: _oneapi_read_cpu_affinity_list: line is: 0-23,48-71
slurmd: debug2: gpu/oneapi: _oneapi_read_cpu_affinity_list: tok is :0-23
slurmd: debug2: gpu/oneapi: _oneapi_read_cpu_affinity_list: cpu range is: 0~23
slurmd: debug2: gpu/oneapi: _oneapi_read_cpu_affinity_list: tok is :48-71
slurmd: debug2: gpu/oneapi: _oneapi_read_cpu_affinity_list: cpu range is: 4~71

The bug is caused by the strlcpy change in https://github.com/SchedMD/slurm/blob/f504d215ab9324e1dfd04879a3be286dc1afd7bb/src/plugins/gpu/oneapi/gpu_oneapi.c#L811

strlcpy(buf, tok, pos); //should be strlcpy(buf, tok, pos + 1);
Comment 6 Jason Booth 2023-04-25 04:09:47 MDT
commit 2d8014d976404e453bc127d362cffe5c2e73289d
Author:     Marcin Stolarek <cinek@schedmd.com>
AuthorDate: Wed Apr 19 10:28:58 2023 +0000

    gpu_oneapi - Fix CPU range parsing
    
    A regression from ceae922dc3 caused the string to be truncated by one
    character. Rather than correcting the strlcpy() math, just pivot to
    using atoi() directly on the existing strings.
    
    Bug 16547
Comment 7 Marcin Stolarek 2023-04-25 09:16:04 MDT
The reported issue is fixed by the mentioned commit. It will be part of Slurm 23.02.2 release.

cheers,
Marcin