| Summary: | HTC Broken | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | ||
| Version: | 2.4.x | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
This only applies to BGP. You fix was in the right area. A safer patch has been added to 2.4 (https://github.com/SchedMD/slurm/commit/27e7b048baefe02d1a72eba03faab1d2e25a43a9). Thanks for reporting. |
I'm not sure whether this applies to BG/Q, but it certainly applies to our BG/P machines. When rzdawndev was updated to v2.4, --conn-type=HTC_? no longer worked. The block that was provided indicated "ConnType=Small" no matter which of the HTC_ options were requested. I have located what I believe is the cause of the problem in bg_job_place.c, lines 2026-2032: if (jobinfo->conn_type[0] != SELECT_NAV) { for (dim=0; dim<SYSTEM_DIMENSIONS; dim++) jobinfo->conn_type[dim] = bg_record->conn_type[ dim]; } Prior to this section jobinfo->conn_type[0] == 4 (SELECT_HTC_S). After it, it gets overwritten to 3 (SELECT_SMALL). If I create a build with the above lines commented out, the problem goes away! But that can't be the right solution. Your thoughts?