| Summary: | Intranode core allocation method issue | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Martin Perry <martin.perry> |
| Component: | Other | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | da |
| Version: | 14.03.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Atos/Eviden Sites | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Fix for bug introduced in v2.5.5 | ||
I would guess this is a regression. There was no intentional change in this area. Moe, I also just found a regression in the task/affinity plugin. The option --cpu_bind=mask_cpu is not working in 13.12 (it works in 2.5). I was testing these features of cons_res and task affinity because we're planning to add support for mask_cpu to the task/cgroup plugin. I hope to have this done in time for the 13.12 deadline. Since I'm working on this stuff anyway, I can investigate these regressions if you like. Martin Funny you should mention msk_cpu, I've been working on this for the past couple of hours. It is good in v2.6.1 broken in v2.6.2. I hope to have that resolved today. Found and fixed the cpu_bind problem: https://github.com/SchedMD/slurm/commit/1537c161a4db6e0c2b77d8f62ab62e6e727a5ba6 The CLANG tool reported this "redundant" code and I removed it. Unfortunately the cpu/mem_bind tests in our regression frequently fail due to apparent bugs in Expect. I've also modified the expect tests for this to be more robust. I'll work on the original bug report now. Ok, Moe. Thanks for the update. What version of Slurm are you seeing the proper behaviour with? I am seeing the block layout with the version 2.5 head, so it probably changed sometime during the lifetime of v2.5. I'm seeing the proper layout (default of cyclic unless CR_CORE_DEFAULT_DIST_BLOCK is specified) with 2.5.0. The problem was introduced in v2.5.5. I should have a fix shortly. Created attachment 463 [details]
Fix for bug introduced in v2.5.5
This will be fixed in v2.6.4 when released: https://github.com/SchedMD/slurm/commit/0cbcba1a78cfbaec6c1ad0c931593ca2536a7b37 Thanks. |
In 2.5 and prior releases, the default method for intranode core allocation was CYCLIC (cycle across sockets): [sulu] (slurm) mnp> scontrol show config | grep Select SelectType = select/cons_res SelectTypeParameters = CR_CORE [sulu] (slurm) mnp> srun -n 1 -c 4 scontrol -d show job $SLURM_JOB_ID | grep CPU_ID Nodes=n8 CPU_IDs=0-1,4-5 Mem=0 In order to get block allocation by default, you needed to add CR_CORE_DEFAULT_DIST_BLOCK to SelectTypeParameters. But in 13.12, the default now seems to be BLOCK. Note the difference in CPU_IDs. [sulu] (slurm) mnp> scontrol show config | grep Select SelectType = select/cons_res SelectTypeParameters = CR_CORE [sulu] (slurm) mnp> srun -n 1 -c 4 scontrol -d show job $SLURM_JOB_ID | grep CPU_ID Nodes=n8 CPU_IDs=0-3 Mem=0 And I can’t see any way to override this default and produce CYCLIC intranode allocation. I know there have been some other changes to cons_res. Is this a regression caused by some other change? I didn’t see anything about a change to the core allocation method in the NEWS file, and the SelectTypeParameters documentation still says the default is CYCLIC.