| Summary: | HTC Broken | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | ||
| Version: | 2.4.x | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
This only applies to BGP. You fix was in the right area. A safer patch has been added to 2.4 (https://github.com/SchedMD/slurm/commit/27e7b048baefe02d1a72eba03faab1d2e25a43a9). Thanks for reporting. |
I'm not sure whether this applies to BG/Q, but it certainly applies to our BG/P machines. When rzdawndev was updated to v2.4, --conn-type=HTC_? no longer worked. The block that was provided indicated "ConnType=Small" no matter which of the HTC_ options were requested. I have located what I believe is the cause of the problem in bg_job_place.c, lines 2026-2032: if (jobinfo->conn_type[0] != SELECT_NAV) { for (dim=0; dim<SYSTEM_DIMENSIONS; dim++) jobinfo->conn_type[dim] = bg_record->conn_type[ dim]; } Prior to this section jobinfo->conn_type[0] == 4 (SELECT_HTC_S). After it, it gets overwritten to 3 (SELECT_SMALL). If I create a build with the above lines commented out, the problem goes away! But that can't be the right solution. Your thoughts?