1145 – Strange CPU binding with --ntasks-per-node set

Ticket 1145 - Strange CPU binding with --ntasks-per-node set

Summary: Strange CPU binding with --ntasks-per-node set

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	14.11.x
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Danny Auble
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-10-03 06:51 MDT by David Gloe
Modified:	2014-10-15 06:35 MDT (History)
CC List:	2 users (show)

See Also:
Site:	CRAY
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.03.9 14.11.0rc2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description David Gloe 2014-10-03 06:51:58 MDT

Slurm is binding two tasks to the same cpu when launching with -n 4 --ntasks-per-node=4 on a 32 core 2 socket node.

dgloe@opal-p2:~> srun -n 4 --ntasks-per-node=4 --cpu_bind=v grep Cpus_allowed /proc/self/status
cpu_bind=MASK - nid00051, task  0  0 [10760]: mask 0x1 set
Cpus_allowed:   00000001
Cpus_allowed_list:      0
cpu_bind=MASK - nid00051, task  1  1 [10761]: mask 0x10000 set
Cpus_allowed:   00010000
Cpus_allowed_list:      16
cpu_bind=MASK - nid00051, task  2  2 [10762]: mask 0x1 set
Cpus_allowed:   00000001
Cpus_allowed_list:      0
cpu_bind=MASK - nid00051, task  3  3 [10763]: mask 0x2 set
Cpus_allowed:   00000002
Cpus_allowed_list:      1

From the slurmd log:

[2014-10-03T13:33:06.845] debug:  task affinity : before lllp distribution cpu bind method is 'verbose' ((null))
[2014-10-03T13:33:06.845] debug:  binding tasks:4 to nodes:0 sockets:0:1 cores:2:0 threads:4
[2014-10-03T13:33:06.845] lllp_distribution jobid [44113] implicit auto binding: verbose,threads, dist 2
[2014-10-03T13:33:06.845] _task_layout_lllp_cyclic
[2014-10-03T13:33:06.845] _lllp_generate_cpu_bind jobid [44113]: verbose,mask_cpu, 0x00000001,0x00010000,0x00000001,0x00000002
[2014-10-03T13:33:06.845] debug:  task affinity : after lllp distribution cpu bind method is 'verbose,mask_cpu' (0x00000001,0x00010000,0x00000001,0x000000
02)

Some relevant configuration values:
SelectTypeParameters=CR_CORE_Memory,other_cons_res

NodeName=nid000[48-51] Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4

Comment 1 Danny Auble 2014-10-10 10:07:02 MDT

David, I haven't been able to reproduce this yet on a normal cluster (it just works there), but I will try next week on a Cray.

srun -n 4 --ntasks-per-node=4 --cpu_bind=v whereami
cpu_bind=MASK - snowflake, task  2  2 [16569]: mask 0x2 set
cpu_bind=MASK - snowflake, task  0  0 [16567]: mask 0x1 set
cpu_bind=MASK - snowflake, task  1  1 [16568]: mask 0x10 set
cpu_bind=MASK - snowflake, task  3  3 [16570]: mask 0x20 set
   2 snowflake0 - MASK:0x2
   1 snowflake0 - MASK:0x10
   3 snowflake0 - MASK:0x20
   0 snowflake0 - MASK:0x1

Comment 2 David Bigagli 2014-10-14 08:29:27 MDT

Hi David,
         we can reproduce it on our 2 sockets machine as well and investigating.

David

Comment 3 Danny Auble 2014-10-14 11:41:37 MDT

I see what is happening here.  I'll see if I can get a fix for it tomorrow.  It appears the --ntasks-per-node option lays tasks out differently than expected, or requested.

Comment 4 Danny Auble 2014-10-15 06:34:43 MDT

This is fixed in commit 03dc6ea7800.  Please reopen if you still see issues.

Comment 5 Danny Auble 2014-10-15 06:35:54 MDT

This is fixed in commit 03dc6ea7800.  Please reopen if you still see issues.