Ticket 8969 - Comestic change to Cray contributed Slurm config generation file
Summary: Comestic change to Cray contributed Slurm config generation file
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 20.02.2
Hardware: Cray XC Linux
: 5 - Enhancement
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-04-30 20:55 MDT by Kevin Buckley
Modified: 2020-05-01 15:13 MDT (History)
0 users

See Also:
Site: Pawsey
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version: 6 UP07
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Ignore empty slots on XC service blades (706 bytes, patch)
2020-04-30 20:55 MDT, Kevin Buckley
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Kevin Buckley 2020-04-30 20:55:21 MDT
Created attachment 14063 [details]
Ignore empty slots on XC service blades

Given this output 

smw# xtcli status s0 | grep service
       c0-0c0s0n0: service           |       empty      [noflags|]
       c0-0c0s0n1: service  SB08  X86|       ready      [noflags|]
       c0-0c0s0n2: service  SB08  X86|       ready      [noflags|]
       c0-0c0s0n3: service           |       empty      [noflags|]
       c0-0c0s1n0: service           |       empty      [noflags|]
       c0-0c0s1n1: service  SB08  X86|       ready      [noflags|]
       c0-0c0s1n2: service  SB08  X86|       ready      [noflags|]
       c0-0c0s1n3: service           |       empty      [noflags|]
       c0-0c0s2n0: service           |       empty      [noflags|]
       c0-0c0s2n1: service  SB08  X86|       ready      [noflags|]
       c0-0c0s2n2: service  SB08  X86|       ready      [noflags|]
       c0-0c0s2n3: service           |       empty      [noflags|]
       c0-0c0s3n0: service  IV20  X86|       ready      [noflags|]

it has always "bugged me" that 

<slurm_source>/contribs/cray/csm/slurmconfgen_smw.py
       -t $SLURM_DIR/contribs/cray/csm/ -o $SLURM_CONF_DIR \
       sdb p0

returns, for example

Getting list of service nodes...
Found 13 service nodes.
Gathering hardware inventory...
Found 19 compute nodes.
Compacting node configuration...
Compacted into 5 group(s).
Writing Slurm configuration to /root/20200501/slurm-20.02.1/slurm.conf...
Writing gres configuration to /root/20200501/slurm-20.02.1/gres.conf...
Done

when there are clearly only SEVEN service nodes, not 13.


The attached patch 

             if cname:
-                service.append(cname.group(1))
+                empty = re.search(
+                    r'\|\s+empty\s+\[',
+                    line)
+                if empty is None:
+                    service.append(cname.group(1))


ignores empty slots on the service blades and so returns the 
correct number of service nodes, vis:

<slurm_source>/contribs/cray/csm/slurmconfgen_smw_kmb.py
       -t $SLURM_DIR/contribs/cray/csm/ -o $SLURM_CONF_DIR \
       sdb p0
Getting list of service nodes...
Found 7 service nodes.
Gathering hardware inventory...
Found 19 compute nodes.
Compacting node configuration...


You might want to run this one past someone at Cray, just
in case they really do want to "configure" those empty slots.


BTW, this was run on our Test&Dev system, hence the small
number statistics.

Kevin M. Buckley
-- 
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
Comment 1 Tim Wickberg 2020-05-01 15:13:55 MDT
Hey Kevin -

Thanks for the submission, but unfortunately since the XC series is EOL I don't have a good contact at Cray/HPE to validate this type of change - they've all shifted their focus to Shasta - and I'm quite hesitant to roll it out globally on the chance that it could cause issues.

You're certainly welcome to keep using this as a site local patch though.

cheers,
- Tim