Ticket 466

Summary: Intranode core allocation method issue
Product: Slurm Reporter: Martin Perry <martin.perry>
Component: OtherAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: da
Version: 14.03.x   
Hardware: Linux   
OS: Linux   
Site: Atos/Eviden Sites Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Fix for bug introduced in v2.5.5

Description Martin Perry 2013-10-17 10:11:24 MDT
In 2.5 and prior releases, the default method for intranode core allocation was CYCLIC (cycle across sockets):

[sulu] (slurm) mnp> scontrol show config | grep Select
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE

[sulu] (slurm) mnp> srun -n 1 -c 4 scontrol -d show job $SLURM_JOB_ID | grep CPU_ID
     Nodes=n8 CPU_IDs=0-1,4-5 Mem=0

In order to get block allocation by default, you needed to add CR_CORE_DEFAULT_DIST_BLOCK to SelectTypeParameters.

But in 13.12, the default now seems to be BLOCK. Note the difference in CPU_IDs.

[sulu] (slurm) mnp> scontrol show config | grep Select
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE

[sulu] (slurm) mnp> srun -n 1 -c 4 scontrol -d show job $SLURM_JOB_ID | grep CPU_ID
     Nodes=n8 CPU_IDs=0-3 Mem=0

And I can’t see any way to override this default and produce CYCLIC intranode allocation. I know there have been some other changes to cons_res.  Is this a regression caused by some other change?  I didn’t see anything about a change to the core allocation method in the NEWS file, and the SelectTypeParameters documentation still says the default is CYCLIC.
Comment 1 Moe Jette 2013-10-18 09:37:21 MDT
I would guess this is a regression. There was no intentional change in this area.
Comment 2 Martin Perry 2013-10-18 09:58:43 MDT
Moe,

I also just found a regression in the task/affinity plugin.  The option --cpu_bind=mask_cpu is not working in 13.12 (it works in 2.5).

I was testing these features of cons_res and task affinity because we're planning to add support for mask_cpu to the task/cgroup plugin.  I hope to have this done in time for the 13.12 deadline.

Since I'm working on this stuff anyway, I can investigate these regressions if you like.

Martin
Comment 3 Moe Jette 2013-10-18 10:26:19 MDT
Funny you should mention msk_cpu, I've been working on this for the past couple of hours. It is good in v2.6.1 broken in v2.6.2. I hope to have that resolved today.
Comment 4 Moe Jette 2013-10-18 11:08:27 MDT
Found and fixed the cpu_bind problem:

https://github.com/SchedMD/slurm/commit/1537c161a4db6e0c2b77d8f62ab62e6e727a5ba6

The CLANG tool reported this "redundant" code and I removed it. Unfortunately the cpu/mem_bind tests in our regression frequently fail due to apparent bugs in Expect. I've also modified the expect tests for this to be more robust. I'll work on the original bug report now.
Comment 5 Martin Perry 2013-10-18 11:17:41 MDT
Ok, Moe. Thanks for the update.
Comment 6 Moe Jette 2013-10-21 05:38:42 MDT
What version of Slurm are you seeing the proper behaviour with?

I am seeing the block layout with the version 2.5 head, so it probably changed sometime during the lifetime of v2.5.
Comment 7 Martin Perry 2013-10-21 05:50:34 MDT
I'm seeing the proper layout (default of cyclic unless CR_CORE_DEFAULT_DIST_BLOCK is specified) with 2.5.0.
Comment 8 Moe Jette 2013-10-21 06:21:00 MDT
The problem was introduced in v2.5.5. I should have a fix shortly.
Comment 9 Moe Jette 2013-10-21 06:33:29 MDT
Created attachment 463 [details]
Fix for bug introduced in v2.5.5
Comment 10 Moe Jette 2013-10-21 06:34:15 MDT
This will be fixed in v2.6.4 when released:

https://github.com/SchedMD/slurm/commit/0cbcba1a78cfbaec6c1ad0c931593ca2536a7b37
Comment 11 Martin Perry 2013-10-21 08:32:41 MDT
Thanks.