Ticket 12239 - Remap hwloc's l3cache as a Slurm socket
Summary: Remap hwloc's l3cache as a Slurm socket
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 21.08.x
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-08-09 12:34 MDT by Tim Wickberg
Modified: 2021-08-09 16:17 MDT (History)
1 user (show)

See Also:
Site: ORNL-OLCF
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.0rc2
Target Release: 21.08
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Tim Wickberg 2021-08-09 12:34:49 MDT

    
Comment 1 Tim Wickberg 2021-08-09 12:38:25 MDT
Matt - 

As has been discussed extensively elsewhere, we're adding support to map hwloc's l3cache as a Slurm socket in 21.08.

Support for this has just been pushed, and will be in 21.08.0rc2 which should be out this week.

To enable, SlurmdParameters=l3cache_as_socket must be set.

I'll note that the config file is explicitly ignored by 'slurmd -C', so you'll need to work out the appropriate NodeName definitions by hand and get them into the config.

If you note any problems with this please let me know and I can look at making some changes. Unfortunately I don't have access to a test system at the moment with l3cache != socket, but I believe this should work as intended, and is functionally equivalent to patches we'd provided in the past. (Albeit those lacked a configuration option to change the behavior.)

- Tim
Comment 2 Matt Ezell 2021-08-09 16:17:37 MDT
Thanks.

It might make sense to coalesce the 3 options into a single parameter to avoid confusion, something like:
SlurmdParameters=socket=package (to match hwloc 2.x nomenclature, functionally equivalent to and replaces ignore_numa)
SlurmdParameters=socket=numa (current default on systems with multiple numa per package)
SlurmdParameters=socket=l3cache (the new mode introduced in this ticket)

I could also imagine heterogenous clusters where this needs to be set per node type instead of globally, but that is certainly outside the scope of this ticket.