Ticket 8969

Summary: Comestic change to Cray contributed Slurm config generation file
Product: Slurm Reporter: Kevin Buckley <kevin.buckley>
Component: ConfigurationAssignee: Tim Wickberg <tim>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 5 - Enhancement    
Priority: ---    
Version: 20.02.2   
Hardware: Cray XC   
OS: Linux   
Site: Pawsey Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version: 6 UP07
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Ignore empty slots on XC service blades

Description Kevin Buckley 2020-04-30 20:55:21 MDT
Created attachment 14063 [details]
Ignore empty slots on XC service blades

Given this output 

smw# xtcli status s0 | grep service
       c0-0c0s0n0: service           |       empty      [noflags|]
       c0-0c0s0n1: service  SB08  X86|       ready      [noflags|]
       c0-0c0s0n2: service  SB08  X86|       ready      [noflags|]
       c0-0c0s0n3: service           |       empty      [noflags|]
       c0-0c0s1n0: service           |       empty      [noflags|]
       c0-0c0s1n1: service  SB08  X86|       ready      [noflags|]
       c0-0c0s1n2: service  SB08  X86|       ready      [noflags|]
       c0-0c0s1n3: service           |       empty      [noflags|]
       c0-0c0s2n0: service           |       empty      [noflags|]
       c0-0c0s2n1: service  SB08  X86|       ready      [noflags|]
       c0-0c0s2n2: service  SB08  X86|       ready      [noflags|]
       c0-0c0s2n3: service           |       empty      [noflags|]
       c0-0c0s3n0: service  IV20  X86|       ready      [noflags|]

it has always "bugged me" that 

<slurm_source>/contribs/cray/csm/slurmconfgen_smw.py
       -t $SLURM_DIR/contribs/cray/csm/ -o $SLURM_CONF_DIR \
       sdb p0

returns, for example

Getting list of service nodes...
Found 13 service nodes.
Gathering hardware inventory...
Found 19 compute nodes.
Compacting node configuration...
Compacted into 5 group(s).
Writing Slurm configuration to /root/20200501/slurm-20.02.1/slurm.conf...
Writing gres configuration to /root/20200501/slurm-20.02.1/gres.conf...
Done

when there are clearly only SEVEN service nodes, not 13.


The attached patch 

             if cname:
-                service.append(cname.group(1))
+                empty = re.search(
+                    r'\|\s+empty\s+\[',
+                    line)
+                if empty is None:
+                    service.append(cname.group(1))


ignores empty slots on the service blades and so returns the 
correct number of service nodes, vis:

<slurm_source>/contribs/cray/csm/slurmconfgen_smw_kmb.py
       -t $SLURM_DIR/contribs/cray/csm/ -o $SLURM_CONF_DIR \
       sdb p0
Getting list of service nodes...
Found 7 service nodes.
Gathering hardware inventory...
Found 19 compute nodes.
Compacting node configuration...


You might want to run this one past someone at Cray, just
in case they really do want to "configure" those empty slots.


BTW, this was run on our Test&Dev system, hence the small
number statistics.

Kevin M. Buckley
-- 
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
Comment 1 Tim Wickberg 2020-05-01 15:13:55 MDT
Hey Kevin -

Thanks for the submission, but unfortunately since the XC series is EOL I don't have a good contact at Cray/HPE to validate this type of change - they've all shifted their focus to Shasta - and I'm quite hesitant to roll it out globally on the chance that it could cause issues.

You're certainly welcome to keep using this as a site local patch though.

cheers,
- Tim