Ticket 6709

Summary:	SelectTypeParamaters Investigation
Product:	Slurm	Reporter:	Kevin Buckley <kevin.buckley>
Component:	Scheduling	Assignee:	Nate Rini <nate>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	Andrew.Elwell, darran.carey, david.schibeci, mohsin.shaikh, nate
Version:	18.08.5
Hardware:	Cray XC
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=6552
Site:	Pawsey	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	SUSE	Machine Name:
CLE Version:	CLE6UP05	Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Kevin Buckley 2019-03-17 23:25:14 MDT

This issue has been spun out of a side conversation in Bug 6552.


Nate Rini noted, as regards our "production" (Magnus) SLURM config
versus that of our "tes and dev system" (Chaos)

> The slurm.conf has this setting for Magnus:
> > SelectTypeParameters=CR_ONE_TASK_PER_CORE,CR_CORE_Memory,other_cons_res
>
> But Chaos slurm.conf has this setting:
> > SelectTypeParameters=CR_CORE_Memory,other_cons_res
>
> Slurm should only configured to schedule by cores with
> CR_Core_Memory
> or
> CR_ONE_TASK_PER_CORE
> depending if you want users to be able to choose to schedule by threads.

Pawsey will use this issue to ticket to follow up on Nate's observations.

SchedMD already have the slurm.confs from Magnus and Chaos.

I have started it out as "Medium" as there is a suggestion that what
Pawsey have in our "production" configuration may be "broken", for
some value of broken.

Comment 1 Kevin Buckley 2019-03-17 23:37:48 MDT

Given Nate's 

> Slurm should only configured to schedule by cores with
> CR_Core_Memory
> or
> CR_ONE_TASK_PER_CORE

and the fact that a persual of the slurm.conf man page suggests
that if one uses

SelectType=select/cray

then 

"By default  SelectType=select/cons_res, SelectType=select/cray, and
SelectType=select/serial use CR_CPU"

the first things I tried was take the existing Chaos config, vis

   SelectTypeParameters=CR_CORE_Memory,other_cons_res

and drop the CR_CORE_Memory, so 

   SelectTypeParameters=other_cons_res

to see if we got the default.

No.

You get told

slurmctld[17811]: fatal: Invalid SelectTypeParameters: OTHER_CONS_RES (32), You need at least CR_(CPU|CORE|SOCKET)*

Comment 2 Kevin Buckley 2019-03-17 23:42:22 MDT

Second thing I tried was  replacing the other one Nate's eitheror

which saw the existing Chaos config, vis

   SelectTypeParameters=CR_CORE_Memory,other_cons_res

changed to 

   SelectTypeParameters=OTHER_CONS_RES,CR_ONE_TASK_PER_CORE

but this too, fails with 

slurmctld[18589]: fatal: Invalid SelectTypeParameters: OTHER_CONS_RES,CR_ONE_TASK_PER_CORE (288), You need at least CR_(CPU|CORE|SOCKET)*

so the suggestion seem to be that not only are

  CR_CORE_Memory

and 

  CR_ONE_TASK_PER_CORE

orthogonal, in that you can't (SHOULDN'T) have both (as detailed in the
man page), you can't have the latter on it's own.

Comment 3 Kevin Buckley 2019-03-17 23:49:42 MDT

Given the two observations above, and that the slurm.conf man page says:

CR_ONE_TASK_PER_CORE

    Allocate one task per core by default. Without this
    option, by default one task will be allocated per thread
    on nodes with more than one ThreadsPerCore configured.
    NOTE: This option cannot be used with CR_CPU*.


implying that one can't have a CR_CPU* with the CR_ONE_TASK_PER_CORE,
what would SchedMD suggest we use to match the 

   You need at least CR_(CPU|CORE|SOCKET)*

requirement, instead of what we currently have?

Furthermore, from my reading of the CR_ONE_TASK_PER_CORE description,
one should be able to defeat Cray's "never-to-be-turned-off-via-the-BIOS" 
Hyper-threading by simply using

  CR_ONE_TASK_PER_CORE

even though, SLURM's interrogation of the hardware will have
returned

 ThreadsPerCore=2 ?

Is that enough info for SchedMD to make a recommendation as to what
we should have ?

Kevin M. Buckley
-- 
Supercomputing Systems Administrator
Pawsey Supercomputing Centre

Comment 5 Nate Rini 2019-03-18 13:54:29 MDT

Kevin

Can you please provide the first page of your config.log from when you built Slurm?

Thanks,
--Nate

Comment 8 Kevin Buckley 2019-03-18 19:35:44 MDT

> Can you please provide the first page of your config.log
> from when you built Slurm?

Do you wnat the 17.x one, from the production machines, or the 18.x
one from the TDS?

As I am sure SchedMD will be aware, from all the work you have done
with Cray, their build isn't a typical CMMI, well, not as it appears
to the user anyway, so we'll have to go looking.

Comment 9 Nate Rini 2019-03-18 22:18:15 MDT

(In reply to Kevin Buckley from comment #8)
> > Can you please provide the first page of your config.log
> > from when you built Slurm?
> 
> Do you wnat the 17.x one, from the production machines, or the 18.x
> one from the TDS?
Both.

> As I am sure SchedMD will be aware, from all the work you have done
> with Cray, their build isn't a typical CMMI, well, not as it appears
> to the user anyway, so we'll have to go looking.
I want to make sure my local test instance mirrors yours. I'm mainly looking to see if your using ALPS or native cray support.

Comment 10 Kevin Buckley 2019-03-18 23:29:42 MDT

On 2019/03/19 12:18, bugs@schedmd.com wrote:

>> Do you wnat the 17.x one, from the production machines, or the 18.x
>> one from the TDS?
> Both.

I am about to build 18.08.6 (50 bug fixes apparently!)on the TDS, so
I'll send them after that.
  
> I want to make sure my local test instance mirrors yours. I'm mainly
> looking to see if your using ALPS or native cray support.

Not ALPS

Comment 13 Nate Rini 2019-03-19 14:18:38 MDT

(In reply to Kevin Buckley from comment #2)
> which saw the existing Chaos config, vis
> 
>    SelectTypeParameters=CR_CORE_Memory,other_cons_res
> 
> changed to 
> 
>    SelectTypeParameters=OTHER_CONS_RES,CR_ONE_TASK_PER_CORE
The wording in Bug#6552 comment #25 was a little confusing. Setting CR_ONE_TASK_PER_CORE will set the default value of ntasks-per-core in a job to 1. It is not required for CR_CORE_Memory to work, but the opposite is true (and forced). As Dominik noted, per core allocations can be done a few ways:
>Slurm provides mechanisms for allocating job with only one task per core CR_ONE_TASK_PER_CORE, or task/affinity TaskPlugin option --hint/SLURM_HINT.

Bug#1328 comment #1 goes into the suggested way to configure scheduling by cores by setting CPUS=$core_count.

Comment 14 Kevin Buckley 2019-03-20 23:30:58 MDT

> Bug#1328 comment #1 goes into the suggested way to configure scheduling
> by cores by setting CPUS=$core_count.

The comment in that issue ticket got us talking so cheers for the pointer.

One thing we have noticed is that part of the Cray installation suggests
doing this

# python $SLURM_DIR/contribs/cray/csm/slurmconfgen_smw.py \
   -t $SLURM_DIR/contribs/cray/csm/ -o $SLURM_CONF_DIR \
      sdb p0

so as to generate a prototype config file, however, that only appears to
populate the NodeName definitions with (example taken from our TDS):


NodeName=nid000[32-35] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536
NodeName=nid000[36-39] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536
NodeName=nid000[13-15] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536
NodeName=nid000[16-19] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536
NodeName=nid000[24-27] Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4,gpu # RealMemory=32768


where you'll note that there are no CPUs=N values, which is why none of
our NodeName= definitions have ever had such a value.


Picking a Magnus compute node, and looking at the slurmd log, just 
as the daemon starts up:

[2019-03-21T10:52:09.761] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2 Memory=64298 TmpDisk=32149 Uptime=126 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

shows that slurmd's interrogating of the hardware leaves it to belive
that we have 48 "CPUs", however the suggestion in Bug#1328 comment #1 
would be that we override that with CPUs=24 so as to help defeat the
fact that Cray can't follow SchedMD's advice from the same comment,
and "turn off hyper-threads in the bios".

However, doing so might then clash with our PartitionName definitions, eg:

PartitionName=workq  Priority=1 Default=YES MaxTime=24:00:00 MaxNodes=1366 MaxCPUsPerNode=48  DefMemPerNode=60000  MaxMemPerNode=60000 Nodes=nid0[0018-0063,...]

wherein we set MaxCPUsPerNode=48.

Were we to follow SchedMD's advice as regards overriding of NodeName's CPUs,
would we then have to explcitly change all of the other values we have put
in, mainly so as to get rid of many of the "undefined" values that we used
to see before we put things in explictly?

As always, any pointers welcome,
Kevin

Comment 15 Nate Rini 2019-03-21 10:25:44 MDT

(In reply to Kevin Buckley from comment #14)
> > Bug#1328 comment #1 goes into the suggested way to configure scheduling
> > by cores by setting CPUS=$core_count.
> 
> The comment in that issue ticket got us talking so cheers for the pointer.
> 
> One thing we have noticed is that part of the Cray installation suggests
> doing this
> 
> # python $SLURM_DIR/contribs/cray/csm/slurmconfgen_smw.py \
>    -t $SLURM_DIR/contribs/cray/csm/ -o $SLURM_CONF_DIR \
>       sdb p0
> 
> so as to generate a prototype config file, however, that only appears to
> populate the NodeName definitions with (example taken from our TDS):
> 
> 
> NodeName=nid000[32-35] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2
> Gres=craynetwork:4 # RealMemory=65536
> NodeName=nid000[36-39] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2
> Gres=craynetwork:4 # RealMemory=65536
> NodeName=nid000[13-15] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2
> Gres=craynetwork:4 # RealMemory=65536
> NodeName=nid000[16-19] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2
> Gres=craynetwork:4 # RealMemory=65536
> NodeName=nid000[24-27] Sockets=1 CoresPerSocket=8 ThreadsPerCore=2
> Gres=craynetwork:4,gpu # RealMemory=32768

You can call 'slurmd -C' to see what hardware Slurm sees on any given node.

> where you'll note that there are no CPUs=N values, which is why none of
> our NodeName= definitions have ever had such a value.

Is your cluster homogenous? If it is, then you can specify the CPUs in the default node:
> NodeName=DEFAULT Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 RealMemory=65536 State=UNKNOWN
I suggest joining the 2 lines into one.

> Picking a Magnus compute node, and looking at the slurmd log, just 
> as the daemon starts up:
> 
> [2019-03-21T10:52:09.761] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2
> Memory=64298 TmpDisk=32149 Uptime=126 CPUSpecList=(null)
> FeaturesAvail=(null) FeaturesActive=(null)
> 
> shows that slurmd's interrogating of the hardware leaves it to belive
> that we have 48 "CPUs", however the suggestion in Bug#1328 comment #1 
> would be that we override that with CPUs=24 so as to help defeat the
> fact that Cray can't follow SchedMD's advice from the same comment,
> and "turn off hyper-threads in the bios".
> 
> However, doing so might then clash with our PartitionName definitions, eg:
> 
> PartitionName=workq  Priority=1 Default=YES MaxTime=24:00:00 MaxNodes=1366
> MaxCPUsPerNode=48  DefMemPerNode=60000  MaxMemPerNode=60000
> Nodes=nid0[0018-0063,...]
> 
> wherein we set MaxCPUsPerNode=48.
> 
> Were we to follow SchedMD's advice as regards overriding of NodeName's CPUs,
> would we then have to explcitly change all of the other values we have put
> in, mainly so as to get rid of many of the "undefined" values that we used
> to see before we put things in explictly?
That is a max and not really required since Slurm will generally (except gang and overloading) not schedule jobs above the CPU count. You should be able to set CPUS=24 on the nodes for testing at the very least.

Comment 16 Nate Rini 2019-04-04 11:49:03 MDT

I'm going to close this ticket. We can continue after the training if there are still questions.

--Nate