Ticket 12239

Summary: Remap hwloc's l3cache as a Slurm socket
Product: Slurm Reporter: Tim Wickberg <tim>
Component: slurmdAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: ezellma
Version: 21.08.x   
Hardware: Linux   
OS: Linux   
Site: ORNL-OLCF Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 21.08.0rc2
Target Release: 21.08 DevPrio: ---
Emory-Cloud Sites: ---

Description Tim Wickberg 2021-08-09 12:34:49 MDT

    
Comment 1 Tim Wickberg 2021-08-09 12:38:25 MDT
Matt - 

As has been discussed extensively elsewhere, we're adding support to map hwloc's l3cache as a Slurm socket in 21.08.

Support for this has just been pushed, and will be in 21.08.0rc2 which should be out this week.

To enable, SlurmdParameters=l3cache_as_socket must be set.

I'll note that the config file is explicitly ignored by 'slurmd -C', so you'll need to work out the appropriate NodeName definitions by hand and get them into the config.

If you note any problems with this please let me know and I can look at making some changes. Unfortunately I don't have access to a test system at the moment with l3cache != socket, but I believe this should work as intended, and is functionally equivalent to patches we'd provided in the past. (Albeit those lacked a configuration option to change the behavior.)

- Tim
Comment 2 Matt Ezell 2021-08-09 16:17:37 MDT
Thanks.

It might make sense to coalesce the 3 options into a single parameter to avoid confusion, something like:
SlurmdParameters=socket=package (to match hwloc 2.x nomenclature, functionally equivalent to and replaces ignore_numa)
SlurmdParameters=socket=numa (current default on systems with multiple numa per package)
SlurmdParameters=socket=l3cache (the new mode introduced in this ticket)

I could also imagine heterogenous clusters where this needs to be set per node type instead of globally, but that is certainly outside the scope of this ticket.