| Summary: | SelectTypeParamaters Investigation | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kevin Buckley <kevin.buckley> |
| Component: | Scheduling | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | Andrew.Elwell, darran.carey, david.schibeci, mohsin.shaikh, nate |
| Version: | 18.08.5 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=6552 | ||
| Site: | Pawsey | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | SUSE | Machine Name: | |
| CLE Version: | CLE6UP05 | Version Fixed: | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Kevin Buckley
2019-03-17 23:25:14 MDT
Given Nate's
> Slurm should only configured to schedule by cores with
> CR_Core_Memory
> or
> CR_ONE_TASK_PER_CORE
and the fact that a persual of the slurm.conf man page suggests
that if one uses
SelectType=select/cray
then
"By default SelectType=select/cons_res, SelectType=select/cray, and
SelectType=select/serial use CR_CPU"
the first things I tried was take the existing Chaos config, vis
SelectTypeParameters=CR_CORE_Memory,other_cons_res
and drop the CR_CORE_Memory, so
SelectTypeParameters=other_cons_res
to see if we got the default.
No.
You get told
slurmctld[17811]: fatal: Invalid SelectTypeParameters: OTHER_CONS_RES (32), You need at least CR_(CPU|CORE|SOCKET)*
Second thing I tried was replacing the other one Nate's eitheror which saw the existing Chaos config, vis SelectTypeParameters=CR_CORE_Memory,other_cons_res changed to SelectTypeParameters=OTHER_CONS_RES,CR_ONE_TASK_PER_CORE but this too, fails with slurmctld[18589]: fatal: Invalid SelectTypeParameters: OTHER_CONS_RES,CR_ONE_TASK_PER_CORE (288), You need at least CR_(CPU|CORE|SOCKET)* so the suggestion seem to be that not only are CR_CORE_Memory and CR_ONE_TASK_PER_CORE orthogonal, in that you can't (SHOULDN'T) have both (as detailed in the man page), you can't have the latter on it's own. Given the two observations above, and that the slurm.conf man page says:
CR_ONE_TASK_PER_CORE
Allocate one task per core by default. Without this
option, by default one task will be allocated per thread
on nodes with more than one ThreadsPerCore configured.
NOTE: This option cannot be used with CR_CPU*.
implying that one can't have a CR_CPU* with the CR_ONE_TASK_PER_CORE,
what would SchedMD suggest we use to match the
You need at least CR_(CPU|CORE|SOCKET)*
requirement, instead of what we currently have?
Furthermore, from my reading of the CR_ONE_TASK_PER_CORE description,
one should be able to defeat Cray's "never-to-be-turned-off-via-the-BIOS"
Hyper-threading by simply using
CR_ONE_TASK_PER_CORE
even though, SLURM's interrogation of the hardware will have
returned
ThreadsPerCore=2 ?
Is that enough info for SchedMD to make a recommendation as to what
we should have ?
Kevin M. Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
Kevin Can you please provide the first page of your config.log from when you built Slurm? Thanks, --Nate > Can you please provide the first page of your config.log
> from when you built Slurm?
Do you wnat the 17.x one, from the production machines, or the 18.x
one from the TDS?
As I am sure SchedMD will be aware, from all the work you have done
with Cray, their build isn't a typical CMMI, well, not as it appears
to the user anyway, so we'll have to go looking.
(In reply to Kevin Buckley from comment #8) > > Can you please provide the first page of your config.log > > from when you built Slurm? > > Do you wnat the 17.x one, from the production machines, or the 18.x > one from the TDS? Both. > As I am sure SchedMD will be aware, from all the work you have done > with Cray, their build isn't a typical CMMI, well, not as it appears > to the user anyway, so we'll have to go looking. I want to make sure my local test instance mirrors yours. I'm mainly looking to see if your using ALPS or native cray support. On 2019/03/19 12:18, bugs@schedmd.com wrote: >> Do you wnat the 17.x one, from the production machines, or the 18.x >> one from the TDS? > Both. I am about to build 18.08.6 (50 bug fixes apparently!)on the TDS, so I'll send them after that. > I want to make sure my local test instance mirrors yours. I'm mainly > looking to see if your using ALPS or native cray support. Not ALPS (In reply to Kevin Buckley from comment #2) > which saw the existing Chaos config, vis > > SelectTypeParameters=CR_CORE_Memory,other_cons_res > > changed to > > SelectTypeParameters=OTHER_CONS_RES,CR_ONE_TASK_PER_CORE The wording in Bug#6552 comment #25 was a little confusing. Setting CR_ONE_TASK_PER_CORE will set the default value of ntasks-per-core in a job to 1. It is not required for CR_CORE_Memory to work, but the opposite is true (and forced). As Dominik noted, per core allocations can be done a few ways: >Slurm provides mechanisms for allocating job with only one task per core CR_ONE_TASK_PER_CORE, or task/affinity TaskPlugin option --hint/SLURM_HINT. Bug#1328 comment #1 goes into the suggested way to configure scheduling by cores by setting CPUS=$core_count. > Bug#1328 comment #1 goes into the suggested way to configure scheduling > by cores by setting CPUS=$core_count. The comment in that issue ticket got us talking so cheers for the pointer. One thing we have noticed is that part of the Cray installation suggests doing this # python $SLURM_DIR/contribs/cray/csm/slurmconfgen_smw.py \ -t $SLURM_DIR/contribs/cray/csm/ -o $SLURM_CONF_DIR \ sdb p0 so as to generate a prototype config file, however, that only appears to populate the NodeName definitions with (example taken from our TDS): NodeName=nid000[32-35] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536 NodeName=nid000[36-39] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536 NodeName=nid000[13-15] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536 NodeName=nid000[16-19] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536 NodeName=nid000[24-27] Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4,gpu # RealMemory=32768 where you'll note that there are no CPUs=N values, which is why none of our NodeName= definitions have ever had such a value. Picking a Magnus compute node, and looking at the slurmd log, just as the daemon starts up: [2019-03-21T10:52:09.761] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2 Memory=64298 TmpDisk=32149 Uptime=126 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) shows that slurmd's interrogating of the hardware leaves it to belive that we have 48 "CPUs", however the suggestion in Bug#1328 comment #1 would be that we override that with CPUs=24 so as to help defeat the fact that Cray can't follow SchedMD's advice from the same comment, and "turn off hyper-threads in the bios". However, doing so might then clash with our PartitionName definitions, eg: PartitionName=workq Priority=1 Default=YES MaxTime=24:00:00 MaxNodes=1366 MaxCPUsPerNode=48 DefMemPerNode=60000 MaxMemPerNode=60000 Nodes=nid0[0018-0063,...] wherein we set MaxCPUsPerNode=48. Were we to follow SchedMD's advice as regards overriding of NodeName's CPUs, would we then have to explcitly change all of the other values we have put in, mainly so as to get rid of many of the "undefined" values that we used to see before we put things in explictly? As always, any pointers welcome, Kevin (In reply to Kevin Buckley from comment #14) > > Bug#1328 comment #1 goes into the suggested way to configure scheduling > > by cores by setting CPUS=$core_count. > > The comment in that issue ticket got us talking so cheers for the pointer. > > One thing we have noticed is that part of the Cray installation suggests > doing this > > # python $SLURM_DIR/contribs/cray/csm/slurmconfgen_smw.py \ > -t $SLURM_DIR/contribs/cray/csm/ -o $SLURM_CONF_DIR \ > sdb p0 > > so as to generate a prototype config file, however, that only appears to > populate the NodeName definitions with (example taken from our TDS): > > > NodeName=nid000[32-35] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 > Gres=craynetwork:4 # RealMemory=65536 > NodeName=nid000[36-39] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 > Gres=craynetwork:4 # RealMemory=65536 > NodeName=nid000[13-15] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 > Gres=craynetwork:4 # RealMemory=65536 > NodeName=nid000[16-19] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 > Gres=craynetwork:4 # RealMemory=65536 > NodeName=nid000[24-27] Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 > Gres=craynetwork:4,gpu # RealMemory=32768 You can call 'slurmd -C' to see what hardware Slurm sees on any given node. > where you'll note that there are no CPUs=N values, which is why none of > our NodeName= definitions have ever had such a value. Is your cluster homogenous? If it is, then you can specify the CPUs in the default node: > NodeName=DEFAULT Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 RealMemory=65536 State=UNKNOWN I suggest joining the 2 lines into one. > Picking a Magnus compute node, and looking at the slurmd log, just > as the daemon starts up: > > [2019-03-21T10:52:09.761] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2 > Memory=64298 TmpDisk=32149 Uptime=126 CPUSpecList=(null) > FeaturesAvail=(null) FeaturesActive=(null) > > shows that slurmd's interrogating of the hardware leaves it to belive > that we have 48 "CPUs", however the suggestion in Bug#1328 comment #1 > would be that we override that with CPUs=24 so as to help defeat the > fact that Cray can't follow SchedMD's advice from the same comment, > and "turn off hyper-threads in the bios". > > However, doing so might then clash with our PartitionName definitions, eg: > > PartitionName=workq Priority=1 Default=YES MaxTime=24:00:00 MaxNodes=1366 > MaxCPUsPerNode=48 DefMemPerNode=60000 MaxMemPerNode=60000 > Nodes=nid0[0018-0063,...] > > wherein we set MaxCPUsPerNode=48. > > Were we to follow SchedMD's advice as regards overriding of NodeName's CPUs, > would we then have to explcitly change all of the other values we have put > in, mainly so as to get rid of many of the "undefined" values that we used > to see before we put things in explictly? That is a max and not really required since Slurm will generally (except gang and overloading) not schedule jobs above the CPU count. You should be able to set CPUS=24 on the nodes for testing at the very least. I'm going to close this ticket. We can continue after the training if there are still questions. --Nate |