Ticket 178

Summary: HTC Broken
Product: Slurm Reporter: Don Lipari <lipari1>
Component: Bluegene select pluginAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: ---    
Version: 2.4.x   
Hardware: IBM BlueGene   
OS: Linux   
Site: LLNL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Don Lipari 2012-11-27 10:10:37 MST
I'm not sure whether this applies to BG/Q, but it certainly applies to our BG/P machines.  When rzdawndev was updated to v2.4, --conn-type=HTC_? no longer worked.  The block that was provided indicated "ConnType=Small" no matter which of the HTC_ options were requested.

I have located what I believe is the cause of the problem in bg_job_place.c, lines 2026-2032:

	if (jobinfo->conn_type[0] != SELECT_NAV) {
		for (dim=0; dim<SYSTEM_DIMENSIONS;
		     dim++)
			jobinfo->conn_type[dim] =
				bg_record->conn_type[
					dim];
	}

Prior to this section jobinfo->conn_type[0] == 4 (SELECT_HTC_S).  After it, it gets overwritten to 3 (SELECT_SMALL).

If I create a build with the above lines commented out, the problem goes away!  But that can't be the right solution.  Your thoughts?
Comment 1 Danny Auble 2012-11-27 10:52:12 MST
This only applies to BGP.  You fix was in the right area.  A safer patch has been added to 2.4 (https://github.com/SchedMD/slurm/commit/27e7b048baefe02d1a72eba03faab1d2e25a43a9).  Thanks for reporting.