Ticket 5388

Summary:	Determining core ids for gres.conf
Product:	Slurm	Reporter:	Kilian Cavalotti <kilian>
Component:	Configuration	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	felip.moll
Version:	17.11.6
Hardware:	Linux
OS:	Linux
Site:	Stanford	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Kilian Cavalotti 2018-07-05 13:26:04 MDT

This is a follow up on #5189, to specifically discuss the binding of CPUs to GRES in gres.conf, and the way the CPU ids are to be specified.


(In reply to Felip Moll from comment #31)
> Documentation now explain how Cores specification must be done in gres.conf:
> 
> Commit 3ee3795 in 17.11.8 next release.

Thanks for clarifying.

I still have a few questions, though:

* did you get more information about why the Slurm's core numbering scheme formula was removed from the documentation?

* when the new doc from commit 3ee3795 says "Basically Slurm starts numbering from 0 to n [...] and then continuing sequentially to the next thread [...]", what does "next" mean? How is the "next" thing defined, according to which reference?

I guess I don't really understand the idea of "sequential" here. For me, the sequence order is the one that the kernel shows me, and the "next" core for the kernel may not be the same core that comes next according to Slurm's formula. 

That's all very confusing, and if Slurm links to hwloc at build time, why not rely on the topology information that library provides rather than building an other numbering scheme on top of what the kernel exposes? 

I guess I don't understand the need for Slurm to build its own index of compute units (whether they're threads, cores or sockets) when the kernel already provides one, and which is already abstracted and exposed via a standardized interface by hwloc.


To go maybe even one step further, why not automatically use the information provided by hwloc to determine the CPU/GPU binding? It's all there and would apply to NICs as well as GPUs, or any other PCI device. I understand GRES are more than just GPUs, but given the rise of GPU computing these days, it may be worth considering them as a first-class citizen, at least with respect to topology, pretty much like CPUs are. That would remove a lot of confusion and avoid mistakes.


So to sum up:

1. I think what I'm expecting (and that's likely the case for most sysadmins) is that the core ids specificed in gres.conf would match the core ids showed in the output of `nvidia-smi topo -m` or `lstopo -p`. This actually is what matters for GPU/CPU affinity and performance considerations, and what people would get back to in case of doubt.

2. Bonus points would be granted if we could stop having to manually specify coreids in gres.conf, and rely on topology information from the system instead. :)

Cheers,
-- 
Kilian

Comment 2 Kilian Cavalotti 2018-07-06 10:46:40 MDT

Picking up from #5189

> 1. How to configure gres.conf, see: comment 17
> 
> The short answer is to run 'lstopo-no-graphics' command, and configure
> gres.conf accordingly to this output.
> Use the logical number L#i putting it in gres.conf. This logical number is
> related to the Physical number P#i, the same seen by the OS and shown by
> taskset.

Ok, that's useful and that makes more sense now. Except there's still one major issue: nvidia-smi uses the *physical* core ids (P#i) in "nvidia-smi topo -m", not the logical ones (L#i).

And since the COREs= option in gres.cof is predominantly used for GPUs, I still strongly believe that it would be much better to use the same numbering scheme in both places.

From the admin's perspective, using a numbering scheme in a config file, that doesn't match the output of the main tool used to look up information about GPUs, doesn't feel intuitive at all.

And if you ask what Slurm admins use right now, I'm pretty sure that a vast majority of us has been using the cores ids displayed by "nvidia-smi topo" (the physical core ids) in our gres.conf.

I understand that in many cases, the logical numbering and the physical numbering are the same. But some vendors (hi Dell) decide to implement core ids differently at the ACPI level, which results in that interleaved numbering scheme and makes P#i and L#i different.


I think that to minimize confusion, it would make more sense if the ids for COREs= in gres.conf would match the core ids from nvidia-smi.


Cheers,
-- 
Kilian

Comment 3 Felip Moll 2018-07-10 01:07:28 MDT

> * did you get more information about why the Slurm's core numbering scheme
> formula was removed from the documentation?

It was a mistake. This information shouldn't have been removed.
 
> * when the new doc from commit 3ee3795 says "Basically Slurm starts
> numbering from 0 to n [...] and then continuing sequentially to the next
> thread [...]", what does "next" mean? How is the "next" thing defined,
> according to which reference?

I can clarify it more, but the idea is that on a server that you have i.e. 2 sockets,
4 cores per socket, 2 threads per core, there are one socket that's physically labeled as the 0,
and another as the 1. This is seen by the OS. as such.

So, having this, "next" means i+1 with this order:

socket0->core0->ht0->ht1->core1->ht0->ht1->socket1->core0->ht0->ht1->core1->ht0->ht1.

> I guess I don't really understand the idea of "sequential" here. For me, the
> sequence order is the one that the kernel shows me

That's correct as you see and this means the same thing to me.

> 
> That's all very confusing, and if Slurm links to hwloc at build time, why
> not rely on the topology information that library provides rather than
> building an other numbering scheme on top of what the kernel exposes? 

I agree.

> I guess I don't understand the need for Slurm to build its own index of
> compute units (whether they're threads, cores or sockets) when the kernel
> already provides one, and which is already abstracted and exposed via a
> standardized interface by hwloc.

I don't know exactly the reason, but I think that's more historical than other thing.
I can try to figure out more precisely if you need.

> 
> To go maybe even one step further, why not automatically use the information
> provided by hwloc to determine the CPU/GPU binding? 

We expect in Slurm version 19.05 that a Slurm plugin will gather the cores/socket/GPU topology information.
This is at least in part driven by there possibly being a considerable quantity of information such as how GPUs
are interconnected to each other using NVLink (an NVIDIA high-speed interconnect).


> 1. I think what I'm expecting (and that's likely the case for most
> sysadmins) is that the core ids specificed in gres.conf would match the core
> ids showed in the output of `nvidia-smi topo -m` or `lstopo -p`. This
> actually is what matters for GPU/CPU affinity and performance
> considerations, and what people would get back to in case of doubt.

I agree.

> 2. Bonus points would be granted if we could stop having to manually specify
> coreids in gres.conf, and rely on topology information from the system
> instead. :)

I agree, and this is the idea for 19.05!.


From your other comment 2:

> And since the COREs= option in gres.cof is predominantly used for GPUs, I
> still strongly believe that it would be much better to use the same
> numbering scheme in both places.

We also agree here. Using the physical Cores in gres.conf would be the way to go
and as you say doing differently this can confuse Slurm admins and sysadmins.

This is the reason why removing the documentation was a mistake, at least for now, this
has to be explicitly explained.

 
> I understand that in many cases, the logical numbering and the physical
> numbering are the same. But some vendors (hi Dell) decide to implement core
> ids differently at the ACPI level, which results in that interleaved
> numbering scheme and makes P#i and L#i different.

Mostly not. Except for some systems and virtual ones I *always* see P#i and L#i to be interleaved, which
in fact is a concern because as you say many admins that haven't carefully read the gres.conf manpage could've configured it wrong.

You can just try it in your laptop:

Machine (7866MB)
  Package L#0 + L3 L#0 (4096KB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#2)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#1)
      PU L#3 (P#3)

Which makes even more important at this point to have it well documented.


After that, I have to say we agree in all the points. We are aware of the situation and as I've told, the plan is to work for a better plugin in 19.09. It won't be possible to have it for 18.08 since there's no enough time. I will ask the team for the reasoning behind this design but I can say this was done long time ago and is this way now for historical reasons. Nevertheless, despite not being intuitive, it is working if configured the way I described before.

Of course if you are very interested in changing this now, there may be possibilities of sponsoring the work, but I guess this should be marked as an enhancement.

Please, keep asking if it is still not clear to you, I can discuss internally with the team if there are more concerns.

Comment 4 Kilian Cavalotti 2018-07-10 18:10:05 MDT

Hi Felip. 

(In reply to Felip Moll from comment #3)

> We expect in Slurm version 19.05 that a Slurm plugin will gather the
> cores/socket/GPU topology information.
> This is at least in part driven by there possibly being a considerable
> quantity of information such as how GPUs
> are interconnected to each other using NVLink (an NVIDIA high-speed
> interconnect).

That is excellent news!

> > I understand that in many cases, the logical numbering and the physical
> > numbering are the same. But some vendors (hi Dell) decide to implement core
> > ids differently at the ACPI level, which results in that interleaved
> > numbering scheme and makes P#i and L#i different.
> 
> Mostly not. Except for some systems and virtual ones I *always* see P#i and
> L#i to be interleaved, which
> in fact is a concern because as you say many admins that haven't carefully
> read the gres.conf manpage could've configured it wrong.

I can see that now, indeed. The distinction between physical ids and logical ids wasn't very clear to me, but it appears that the logical numbering scheme is always sequential indeed, whatever the physical core numbering (sequential or interleaved by socket) may be.

That's good, because now I can make my gres.conf match the kernel logical numbering scheme. It will still be different from what nvidia-smi reports in some cases, but I guess I can live with it, as long as it stated in the documentation.

I think that referring to the kernel numbering schemes, physical and logical, in the documentation could be helpful. That's what is exposed in 
/sys/devices/system/cpu/cpu*/topology/{core,physical_package}_id

> After that, I have to say we agree in all the points. We are aware of the
> situation and as I've told, the plan is to work for a better plugin in
> 19.09. It won't be possible to have it for 18.08 since there's no enough
> time. I will ask the team for the reasoning behind this design but I can say
> this was done long time ago and is this way now for historical reasons.
> Nevertheless, despite not being intuitive, it is working if configured the
> way I described before.

That is very good to know, thanks for all the clarifications!
And we're looking forward to 19.05. :)

Cheers,
-- 
Kilian

Comment 5 Felip Moll 2018-07-18 07:25:42 MDT

(In reply to Kilian Cavalotti from comment #4)
> Hi Felip. 
> 
> (In reply to Felip Moll from comment #3)
> 
> > We expect in Slurm version 19.05 that a Slurm plugin will gather the
> > cores/socket/GPU topology information.
> > This is at least in part driven by there possibly being a considerable
> > quantity of information such as how GPUs
> > are interconnected to each other using NVLink (an NVIDIA high-speed
> > interconnect).
> 
> That is excellent news!

This is just to inform you that we've tried to include the auto-detection of the topology in 18.08 but we found that the NVLM library doesn't identify unequivocally the GPUS, that means that on each invocation the GPU id is different. This complicates the task to correctly set the topology information.
We are trying to address this situation having a discussion with Nvidia, or maybe using UUIDS instead of ids.

The most probable is that it will go for 19.05.

In what regards to the other points and aside from having to make clarifications on the documentation, is everything fine for you?

Comment 6 Kilian Cavalotti 2018-07-18 09:12:27 MDT

(In reply to Felip Moll from comment #5)
> This is just to inform you that we've tried to include the auto-detection of
> the topology in 18.08 but we found that the NVLM library doesn't identify
> unequivocally the GPUS, that means that on each invocation the GPU id is
> different. This complicates the task to correctly set the topology
> information.
> We are trying to address this situation having a discussion with Nvidia, or
> maybe using UUIDS instead of ids.

Ah yes, I'm well aware, unfortunately. That has been an ongoing issue, and we already discussed this in bug #1421 back in 2015. I tried to explain to NVIDIA how brittle that scheme was multiple times, and I'd be happy to do it again, so if you need customer support to help make things change, please don't hesitate to ask. :)

> The most probable is that it will go for 19.05.

Good.

> In what regards to the other points and aside from having to make
> clarifications on the documentation, is everything fine for you?

Yes, good to close, thanks.

Cheers,
--
Kilian