Ticket 466 - Intranode core allocation method issue
Summary: Intranode core allocation method issue
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 14.03.x
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2013-10-17 10:11 MDT by Martin Perry
Modified: 2013-10-21 08:32 MDT (History)
1 user (show)

See Also:
Site: Atos/Eviden Sites
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Fix for bug introduced in v2.5.5 (3.96 KB, patch)
2013-10-21 06:33 MDT, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Martin Perry 2013-10-17 10:11:24 MDT
In 2.5 and prior releases, the default method for intranode core allocation was CYCLIC (cycle across sockets):

[sulu] (slurm) mnp> scontrol show config | grep Select
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE

[sulu] (slurm) mnp> srun -n 1 -c 4 scontrol -d show job $SLURM_JOB_ID | grep CPU_ID
     Nodes=n8 CPU_IDs=0-1,4-5 Mem=0

In order to get block allocation by default, you needed to add CR_CORE_DEFAULT_DIST_BLOCK to SelectTypeParameters.

But in 13.12, the default now seems to be BLOCK. Note the difference in CPU_IDs.

[sulu] (slurm) mnp> scontrol show config | grep Select
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE

[sulu] (slurm) mnp> srun -n 1 -c 4 scontrol -d show job $SLURM_JOB_ID | grep CPU_ID
     Nodes=n8 CPU_IDs=0-3 Mem=0

And I can’t see any way to override this default and produce CYCLIC intranode allocation. I know there have been some other changes to cons_res.  Is this a regression caused by some other change?  I didn’t see anything about a change to the core allocation method in the NEWS file, and the SelectTypeParameters documentation still says the default is CYCLIC.
Comment 1 Moe Jette 2013-10-18 09:37:21 MDT
I would guess this is a regression. There was no intentional change in this area.
Comment 2 Martin Perry 2013-10-18 09:58:43 MDT
Moe,

I also just found a regression in the task/affinity plugin.  The option --cpu_bind=mask_cpu is not working in 13.12 (it works in 2.5).

I was testing these features of cons_res and task affinity because we're planning to add support for mask_cpu to the task/cgroup plugin.  I hope to have this done in time for the 13.12 deadline.

Since I'm working on this stuff anyway, I can investigate these regressions if you like.

Martin
Comment 3 Moe Jette 2013-10-18 10:26:19 MDT
Funny you should mention msk_cpu, I've been working on this for the past couple of hours. It is good in v2.6.1 broken in v2.6.2. I hope to have that resolved today.
Comment 4 Moe Jette 2013-10-18 11:08:27 MDT
Found and fixed the cpu_bind problem:

https://github.com/SchedMD/slurm/commit/1537c161a4db6e0c2b77d8f62ab62e6e727a5ba6

The CLANG tool reported this "redundant" code and I removed it. Unfortunately the cpu/mem_bind tests in our regression frequently fail due to apparent bugs in Expect. I've also modified the expect tests for this to be more robust. I'll work on the original bug report now.
Comment 5 Martin Perry 2013-10-18 11:17:41 MDT
Ok, Moe. Thanks for the update.
Comment 6 Moe Jette 2013-10-21 05:38:42 MDT
What version of Slurm are you seeing the proper behaviour with?

I am seeing the block layout with the version 2.5 head, so it probably changed sometime during the lifetime of v2.5.
Comment 7 Martin Perry 2013-10-21 05:50:34 MDT
I'm seeing the proper layout (default of cyclic unless CR_CORE_DEFAULT_DIST_BLOCK is specified) with 2.5.0.
Comment 8 Moe Jette 2013-10-21 06:21:00 MDT
The problem was introduced in v2.5.5. I should have a fix shortly.
Comment 9 Moe Jette 2013-10-21 06:33:29 MDT
Created attachment 463 [details]
Fix for bug introduced in v2.5.5
Comment 10 Moe Jette 2013-10-21 06:34:15 MDT
This will be fixed in v2.6.4 when released:

https://github.com/SchedMD/slurm/commit/0cbcba1a78cfbaec6c1ad0c931593ca2536a7b37
Comment 11 Martin Perry 2013-10-21 08:32:41 MDT
Thanks.