Ticket 3986 - Updated hwloc behavior breaks slurm cpu bind
Summary: Updated hwloc behavior breaks slurm cpu bind
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: KNL (show other tickets)
Version: 17.02.5
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-07-12 04:53 MDT by peter.georg
Modified: 2019-01-21 00:14 MST (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Proposed patch (2.47 KB, patch)
2017-07-12 04:53 MDT, peter.georg
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description peter.georg 2017-07-12 04:53:01 MDT
Created attachment 4905 [details]
Proposed patch

Hi,

An updated version of hwloc (provided by Intel in xppsl >= 1.5.1) changes the behavior of some functions. This change of behavior causes Slurm to, at least for KNL configured as SNC4 + Flat, bind processes to the wrong cores/hwthreads.
See https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/737918#comment-1908677 for more details.

Currently we are using Slurm 17.02.5, however 16.05.x (and most likely 17.11.x) is affected as well.

See attached a proposed patch that *should* work for both the old and new hwloc behavior. This patch adds a slight overhead (only executed once) as it re-checks the number of cpus per socket.

Best,

peter