9009 – job performance and core affinity

Ticket 9009 - job performance and core affinity

Summary: job performance and core affinity

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Profiling (show other tickets)
Version:	19.05.2
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-05-07 16:59 MDT by surendra
Modified:	2020-06-04 09:46 MDT (History)
CC List:	3 users (show)

See Also:	7973
Site:	NREL
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
salloc rank ordering (206 bytes, text/plain) 2020-05-12 16:31 MDT, GregD	Details
whereami.c (3.11 KB, text/x-csrc) 2020-05-13 08:18 MDT, Felip Moll	Details
sbatch_slurm-3174930.out (178 bytes, text/plain) 2020-05-13 21:09 MDT, surendra	Details
srun_out (12.61 KB, text/plain) 2020-05-13 21:09 MDT, surendra	Details
lstopo (3.83 KB, text/plain) 2020-05-15 10:56 MDT, surendra	Details
sbatch_9009 (12.42 KB, text/plain) 2020-05-15 10:57 MDT, surendra	Details
salloc_9009 (12.77 KB, text/plain) 2020-05-15 10:58 MDT, surendra	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description surendra 2020-05-07 16:59:35 MDT

We have been trying to troubleshoot job scaling issues on our cluster and the OSU bencharmark that we are running seems to be giving a big difference in the performance when running interactively vs batch mode. Can you provide us with some insight with what could be contributing to this behavior?

Comment 1 Felip Moll 2020-05-08 05:27:44 MDT

Hi Surendra,

I am aware of your problem and I also remember our conversation @ SC.
I need some information and maybe we can talk later on call, does it sound good?

Here it goes my request:

I would like to know first the details on the tests you have done and how, namely the exact commands you used and the related slurmctld and slurmd logs. I need to know the times of the executions to correlate it with the logs.

I understand all the benchmarks you talk about have succeeded but when running "interactively" (what does it mean? outside slurm? salloc? srun?) it is faster than running via sbatch.

I would need a quick picture of the infrastructure: ib switches, network, nodes, and so on.

From the OSU benchmark I'd like to know exactly which tests you have run and which of them are slow. Do you have the results?
Also, is it slow in general or only in the start phase?
Do you use pmi2 or pmix?

When I have this information and after I analyze it then it could be a better moment to do the call.

Does it make sense?

Comment 8 GregD 2020-05-12 16:31:55 MDT

Created attachment 14210 [details]
salloc rank ordering

Comment 9 GregD 2020-05-12 16:44:26 MDT

I am from HPE and can help explain.

We are running the OSU benchmark named 'osu_mbw_mr'. This job tests the bandwidth of the Infiniband fabric. 

Normally we run with sbatch to submit a job to the scheduler but we wanted to run an interactive test from a node so we submitted the job using salloc. The job runs on 512 nodes. Each node has 36 cores. When running with salloc, the job runs in about 36 seconds but it should take about 30 minutes. 

Analysis reveals that when using salloc, SLURM is changing the order of the cores in each node. 
sbatch: SLURM identifies and orders cores in the first node 0-35, the cores in the second node 36-71 etc.

salloc: SLURM identifies and orders the even cores in the nodes and then the odd cores in the nodes. 

Because of this ordering with salloc, the mpi ranks are paired within a node(intra-node) but they should be paired between nodes(inter-node) for the job to run correctly. 

So the bug here is why is salloc ordering the cores in the nodes starting with all even cores and then odd cores. I attached a file showing the ordering when using salloc.

Comment 10 Felip Moll 2020-05-13 08:18:40 MDT

Created attachment 14218 [details]
whereami.c

 (In reply to GregD from comment #9)
> I am from HPE and can help explain.
> 
> We are running the OSU benchmark named 'osu_mbw_mr'. This job tests the
> bandwidth of the Infiniband fabric. 
> 
> Normally we run with sbatch to submit a job to the scheduler but we wanted
> to run an interactive test from a node so we submitted the job using salloc.
> The job runs on 512 nodes. Each node has 36 cores. When running with salloc,
> the job runs in about 36 seconds but it should take about 30 minutes. 

I don't see a direct connection with the binding to cores and salloc taking 36 seconds instead of 30 minutes, can you explain?
Do you see any error? Can you show me exactly which commands and output do you get when running this? 

> Analysis reveals that when using salloc, SLURM is changing the order of the
> cores in each node. 
> sbatch: SLURM identifies and orders cores in the first node 0-35, the cores
> in the second node 36-71 etc.
> 
> salloc: SLURM identifies and orders the even cores in the nodes and then the
> odd cores in the nodes. 

Affinity of a task to a core is done at node level and by the task/affinity plugin.
The plugin is the same in both situations because it is managed by slurmd/stepd when the task is launched, so there shouldn't be differences.

You have some control of how task binding is performed with the '-m, --distribution=' flag of srun/salloc/sbatch.

You can also use --hint= to instruct how to bind tasks to cores/threads.
TaskPluginParam in slurm.conf also modifies how distribution is done, I suggest to leave it commented out unless you specifically want a different functionality.

Other specific flags to distribute tasks to nodes can be found in slurm.conf man page, options: CR_Core, CR_CORE_DEFAULT_DIST_BLOCK, CR_Pack_Nodes and others.

> Because of this ordering with salloc, the mpi ranks are paired within a
> node(intra-node) but they should be paired between nodes(inter-node) for the
> job to run correctly. 

Do you mean you just want one task per node? Then --ntasks=X --ntasks-per-node=1 may help.


> So the bug here is why is salloc ordering the cores in the nodes starting
> with all even cores and then odd cores. I attached a file showing the
> ordering when using salloc.

I don't reproduce it in my system and I guess is something related to configuration.

Are we talking about cluster "eagle"? or is it another test cluster? If it is another cluster, make sure to have the following settings:

slurm.conf:
TaskPlugin              = task/affinity,task/cgroup
SallocDefaultCommand    = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL  <-- may not be needed.. depends on your use case

cgroup.conf:
ConstrainCores          = yes
TaskAffinity            = no



Also, can you do the following test?:

I upload a whereami.c program, I want you to run it with salloc and sbatch and show me the results. You can spawn as many tasks as needed, i.e. 36.

1. Compile with: gcc -o whereami whereami.c
2. Run:
   srun --nodes=2 --ntasks-per-node=36 whereami
   salloc --nodes=2 --ntasks-per-node=36 
       -> srun whereami
   sbatch --nodes=2 --ntasks-per-node=36 --wrap "./whereami"
3. You can do other combinations to check the binding
4. Let me know about the results

Comment 11 surendra 2020-05-13 21:09:34 MDT

Created attachment 14232 [details]
sbatch_slurm-3174930.out

Comment 12 surendra 2020-05-13 21:09:54 MDT

Created attachment 14233 [details]
srun_out

Comment 13 surendra 2020-05-13 21:11:08 MDT

Greg, Can you attach one of your osu submit scripts to the ticket.

We are running these jobs on the eagle cluster.

Comment 15 Felip Moll 2020-05-15 05:56:12 MDT

From srun output I can say affinity is working so far.


1. I don't see attached the salloc output:

salloc --nodes=2 --ntasks-per-node=36 
       -> srun whereami

2. And I want to see also:
sbatch --nodes=2 --ntasks-per-node=36 --wrap "srun ./whereami"

3. Send me also the output of lstopo or lstopo-no-graphics in a node:

lstopo 
or
lstopo-no-graphics



Thanks

Comment 16 surendra 2020-05-15 10:56:53 MDT

Created attachment 14262 [details]
lstopo

Comment 17 surendra 2020-05-15 10:57:42 MDT

Created attachment 14263 [details]
sbatch_9009

Comment 18 surendra 2020-05-15 10:58:20 MDT

Created attachment 14264 [details]
salloc_9009

Comment 19 surendra 2020-05-15 10:59:13 MDT

i have attached the salloc, sbatch and lstopo outputs to the ticket.

Comment 20 Felip Moll 2020-05-18 05:32:07 MDT

lstopo shows how your cores are numbered by hwloc as:

socket0: cores 0 to 17
socket1: cores 18 to 35

The following excerpt from your attached salloc, srun and sbatch logs shows that binding of each task is done in a different core, and from both sockets:

task1 is bind to first core of socket 0
task2 is bind to first core of socket 1
task3 is bind to second core of socket 0
task4 is bind to second core of socket 1
...

Thats the parsed exceprt which proof binding is done correctly:

srun:
0 @ r4i7n35 |Cpus_allowed_list: 0
1 @ r4i7n35 |Cpus_allowed_list: 18
2 @ r4i7n35 |Cpus_allowed_list: 1
3 @ r4i7n35 |Cpus_allowed_list: 19
4 @ r4i7n35 |Cpus_allowed_list: 2
5 @ r4i7n35 |Cpus_allowed_list: 20
6 @ r4i7n35 |Cpus_allowed_list: 3
7 @ r4i7n35 |Cpus_allowed_list: 21
8 @ r4i7n35 |Cpus_allowed_list: 4
9 @ r4i7n35 |Cpus_allowed_list: 22

sbatch:
0 @ r3i7n35 |Cpus_allowed_list: 0
1 @ r3i7n35 |Cpus_allowed_list: 18
2 @ r3i7n35 |Cpus_allowed_list: 1
3 @ r3i7n35 |Cpus_allowed_list: 19
4 @ r3i7n35 |Cpus_allowed_list: 2
5 @ r3i7n35 |Cpus_allowed_list: 20
6 @ r3i7n35 |Cpus_allowed_list: 3
7 @ r3i7n35 |Cpus_allowed_list: 21
8 @ r3i7n35 |Cpus_allowed_list: 4
9 @ r3i7n35 |Cpus_allowed_list: 22

salloc:
0 @ r3i7n35 |Cpus_allowed_list: 0
1 @ r3i7n35 |Cpus_allowed_list: 18
2 @ r3i7n35 |Cpus_allowed_list: 1
3 @ r3i7n35 |Cpus_allowed_list: 19
4 @ r3i7n35 |Cpus_allowed_list: 2
5 @ r3i7n35 |Cpus_allowed_list: 20
6 @ r3i7n35 |Cpus_allowed_list: 3
7 @ r3i7n35 |Cpus_allowed_list: 21
8 @ r3i7n35 |Cpus_allowed_list: 4
9 @ r3i7n35 |Cpus_allowed_list: 22


If this is not happening in your application there's a chance that the app itself is changing the affinity of the process to cores. Any process run in a set of allowed cores can change its own binding. I don't know what your test is doing internally.

Maybe you can explain better which is the issue with binding you are seeing and that affects the performance?
From here everything seems good at the moment.

Comment 21 Felip Moll 2020-06-04 08:50:31 MDT

> Maybe you can explain better which is the issue with binding you are seeing
> and that affects the performance?
> From here everything seems good at the moment.

Hi surendra, 

Do you have any feedback for me? I am interested in knowing if you still see affinity issues.

Thanks!

Comment 22 surendra 2020-06-04 09:01:06 MDT

This can be closed. Thanks!

Comment 23 Felip Moll 2020-06-04 09:46:56 MDT

(In reply to surendra from comment #22)
> This can be closed. Thanks!

Ok Surendra,

Please, open if questions arise.

I was curious on what was the issue.

Regards