Ticket 16264 - Extend NodeAddr to support ranges
Summary: Extend NodeAddr to support ranges
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 22.05.8
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
: 16927 17109 17590 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2023-03-13 23:04 MDT by Chris Samuel (NERSC)
Modified: 2023-09-04 09:09 MDT (History)
11 users (show)

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.11.0rc1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Chris Samuel (NERSC) 2023-03-13 23:04:40 MDT
Hi there,

Currently NodeAddr has to specified a host at a time, which for us results in very long lines that look like this (taken from a test system):

> NodeName=nid[001000-001001,001004-001005,001008-001009,001012-001013,001016-001017,001020-001021,001024-001025,001028-001029,001032-001033,001036-001037,001040-001041,001044-001045,001048-001049,001052-001053,001056-001057,001060-001061] CPUs=128 Boards=1 SocketsPerBoard=4 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257290 MemSpecLimit=27298 Feature=gpu,ss11,a100,hbm40g Gres=gpu:a100:4 NodeAddr=nid001000-hsn0,nid001001-hsn0,nid001004-hsn0,nid001005-hsn0,nid001008-hsn0,nid001009-hsn0,nid001012-hsn0,nid001013-hsn0,nid001016-hsn0,nid001017-hsn0,nid001020-hsn0,nid001021-hsn0,nid001024-hsn0,nid001025-hsn0,nid001028-hsn0,nid001029-hsn0,nid001032-hsn0,nid001033-hsn0,nid001036-hsn0,nid001037-hsn0,nid001040-hsn0,nid001041-hsn0,nid001044-hsn0,nid001045-hsn0,nid001048-hsn0,nid001049-hsn0,nid001052-hsn0,nid001053-hsn0,nid001056-hsn0,nid001057-hsn0,nid001060-hsn0,nid001061-hsn0 Weight=1000

This came up as a digression in bug#15322 and Tim M channelled Tim W to suggest that instead of a prefix extending the NodeAddr to support a range would result in something nicer.

So for us that would turn the above into:

NodeName=nid[001000-001001,001004-001005,001008-001009,001012-001013,001016-001017,001020-001021,001024-001025,001028-001029,001032-001033,001036-001037,001040-001041,001044-001045,001048-001049,001052-001053,001056-001057,001060-001061] CPUs=128 Boards=1 SocketsPerBoard=4 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257290 MemSpecLimit=27298 Feature=gpu,ss11,a100,hbm40g Gres=gpu:a100:4 NodeAddr=nid[001000-001001,001004-001005,001008-001009,001012-001013,001016-001017,001020-001021,001024-001025,001028-001029,001032-001033,001036-001037,001040-001041,001044-001045,001048-001049,001052-001053,001056-001057,001060-001061]-hsn0,nid001061-hsn0 Weight=1000

Which is more manageable, and especially for Perlmutter where our CPU nodes are defined as nid[004174-007143] which results in 2970 separate entries on one line for NodeAddr at present.

All the best,
Chris
Comment 3 Dominik Bartkiewicz 2023-04-19 08:16:55 MDT
Hi

Sorry for the late response. We have a patch that implements this feature, but we still need to internally discuss some details. I will let you know when we push this to the repo.

Dominik
Comment 5 Jason Booth 2023-06-08 12:48:15 MDT
*** Ticket 16927 has been marked as a duplicate of this ticket. ***
Comment 8 Marshall Garey 2023-06-15 15:06:52 MDT
*** Ticket 16927 has been marked as a duplicate of this ticket. ***
Comment 12 Marshall Garey 2023-07-05 09:25:06 MDT
*** Ticket 17109 has been marked as a duplicate of this ticket. ***
Comment 17 Dominik Bartkiewicz 2023-07-20 03:20:34 MDT
Hi

In 23.11 we added support for arbitrary suffixes to hostlist (eg.: aaa[1-100]-bbb, aaa[1-100]-b0).

Internally this will behave similarly as coma separate list of such hosts or two range hostlist with one entry in the last range (aaa[1-100]-b[0]).
This means such hostlist will not be printed in reduced format after parsing.

Dominik
Comment 18 Chris Samuel (NERSC) 2023-08-01 23:45:51 MDT
Hi Dominik,

Thanks so much for that, sounds great!

All the best,
Chris
Comment 19 Dominik Bartkiewicz 2023-08-04 04:49:27 MDT
Hi

I'll go ahead and close this ticket.
If anything else comes up feel free to reopen the ticket.

Dominik
Comment 20 Jason Booth 2023-09-04 09:09:01 MDT
*** Ticket 17590 has been marked as a duplicate of this ticket. ***