This issue has been spun out of a side conversation in Bug 6552. Nate Rini noted, as regards our "production" (Magnus) SLURM config versus that of our "tes and dev system" (Chaos) > The slurm.conf has this setting for Magnus: > > SelectTypeParameters=CR_ONE_TASK_PER_CORE,CR_CORE_Memory,other_cons_res > > But Chaos slurm.conf has this setting: > > SelectTypeParameters=CR_CORE_Memory,other_cons_res > > Slurm should only configured to schedule by cores with > CR_Core_Memory > or > CR_ONE_TASK_PER_CORE > depending if you want users to be able to choose to schedule by threads. Pawsey will use this issue to ticket to follow up on Nate's observations. SchedMD already have the slurm.confs from Magnus and Chaos. I have started it out as "Medium" as there is a suggestion that what Pawsey have in our "production" configuration may be "broken", for some value of broken.
Given Nate's > Slurm should only configured to schedule by cores with > CR_Core_Memory > or > CR_ONE_TASK_PER_CORE and the fact that a persual of the slurm.conf man page suggests that if one uses SelectType=select/cray then "By default SelectType=select/cons_res, SelectType=select/cray, and SelectType=select/serial use CR_CPU" the first things I tried was take the existing Chaos config, vis SelectTypeParameters=CR_CORE_Memory,other_cons_res and drop the CR_CORE_Memory, so SelectTypeParameters=other_cons_res to see if we got the default. No. You get told slurmctld[17811]: fatal: Invalid SelectTypeParameters: OTHER_CONS_RES (32), You need at least CR_(CPU|CORE|SOCKET)*
Second thing I tried was replacing the other one Nate's eitheror which saw the existing Chaos config, vis SelectTypeParameters=CR_CORE_Memory,other_cons_res changed to SelectTypeParameters=OTHER_CONS_RES,CR_ONE_TASK_PER_CORE but this too, fails with slurmctld[18589]: fatal: Invalid SelectTypeParameters: OTHER_CONS_RES,CR_ONE_TASK_PER_CORE (288), You need at least CR_(CPU|CORE|SOCKET)* so the suggestion seem to be that not only are CR_CORE_Memory and CR_ONE_TASK_PER_CORE orthogonal, in that you can't (SHOULDN'T) have both (as detailed in the man page), you can't have the latter on it's own.
Given the two observations above, and that the slurm.conf man page says: CR_ONE_TASK_PER_CORE Allocate one task per core by default. Without this option, by default one task will be allocated per thread on nodes with more than one ThreadsPerCore configured. NOTE: This option cannot be used with CR_CPU*. implying that one can't have a CR_CPU* with the CR_ONE_TASK_PER_CORE, what would SchedMD suggest we use to match the You need at least CR_(CPU|CORE|SOCKET)* requirement, instead of what we currently have? Furthermore, from my reading of the CR_ONE_TASK_PER_CORE description, one should be able to defeat Cray's "never-to-be-turned-off-via-the-BIOS" Hyper-threading by simply using CR_ONE_TASK_PER_CORE even though, SLURM's interrogation of the hardware will have returned ThreadsPerCore=2 ? Is that enough info for SchedMD to make a recommendation as to what we should have ? Kevin M. Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre
Kevin Can you please provide the first page of your config.log from when you built Slurm? Thanks, --Nate
> Can you please provide the first page of your config.log > from when you built Slurm? Do you wnat the 17.x one, from the production machines, or the 18.x one from the TDS? As I am sure SchedMD will be aware, from all the work you have done with Cray, their build isn't a typical CMMI, well, not as it appears to the user anyway, so we'll have to go looking.
(In reply to Kevin Buckley from comment #8) > > Can you please provide the first page of your config.log > > from when you built Slurm? > > Do you wnat the 17.x one, from the production machines, or the 18.x > one from the TDS? Both. > As I am sure SchedMD will be aware, from all the work you have done > with Cray, their build isn't a typical CMMI, well, not as it appears > to the user anyway, so we'll have to go looking. I want to make sure my local test instance mirrors yours. I'm mainly looking to see if your using ALPS or native cray support.
On 2019/03/19 12:18, bugs@schedmd.com wrote: >> Do you wnat the 17.x one, from the production machines, or the 18.x >> one from the TDS? > Both. I am about to build 18.08.6 (50 bug fixes apparently!)on the TDS, so I'll send them after that. > I want to make sure my local test instance mirrors yours. I'm mainly > looking to see if your using ALPS or native cray support. Not ALPS
(In reply to Kevin Buckley from comment #2) > which saw the existing Chaos config, vis > > SelectTypeParameters=CR_CORE_Memory,other_cons_res > > changed to > > SelectTypeParameters=OTHER_CONS_RES,CR_ONE_TASK_PER_CORE The wording in Bug#6552 comment #25 was a little confusing. Setting CR_ONE_TASK_PER_CORE will set the default value of ntasks-per-core in a job to 1. It is not required for CR_CORE_Memory to work, but the opposite is true (and forced). As Dominik noted, per core allocations can be done a few ways: >Slurm provides mechanisms for allocating job with only one task per core CR_ONE_TASK_PER_CORE, or task/affinity TaskPlugin option --hint/SLURM_HINT. Bug#1328 comment #1 goes into the suggested way to configure scheduling by cores by setting CPUS=$core_count.
> Bug#1328 comment #1 goes into the suggested way to configure scheduling > by cores by setting CPUS=$core_count. The comment in that issue ticket got us talking so cheers for the pointer. One thing we have noticed is that part of the Cray installation suggests doing this # python $SLURM_DIR/contribs/cray/csm/slurmconfgen_smw.py \ -t $SLURM_DIR/contribs/cray/csm/ -o $SLURM_CONF_DIR \ sdb p0 so as to generate a prototype config file, however, that only appears to populate the NodeName definitions with (example taken from our TDS): NodeName=nid000[32-35] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536 NodeName=nid000[36-39] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536 NodeName=nid000[13-15] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536 NodeName=nid000[16-19] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536 NodeName=nid000[24-27] Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4,gpu # RealMemory=32768 where you'll note that there are no CPUs=N values, which is why none of our NodeName= definitions have ever had such a value. Picking a Magnus compute node, and looking at the slurmd log, just as the daemon starts up: [2019-03-21T10:52:09.761] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2 Memory=64298 TmpDisk=32149 Uptime=126 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) shows that slurmd's interrogating of the hardware leaves it to belive that we have 48 "CPUs", however the suggestion in Bug#1328 comment #1 would be that we override that with CPUs=24 so as to help defeat the fact that Cray can't follow SchedMD's advice from the same comment, and "turn off hyper-threads in the bios". However, doing so might then clash with our PartitionName definitions, eg: PartitionName=workq Priority=1 Default=YES MaxTime=24:00:00 MaxNodes=1366 MaxCPUsPerNode=48 DefMemPerNode=60000 MaxMemPerNode=60000 Nodes=nid0[0018-0063,...] wherein we set MaxCPUsPerNode=48. Were we to follow SchedMD's advice as regards overriding of NodeName's CPUs, would we then have to explcitly change all of the other values we have put in, mainly so as to get rid of many of the "undefined" values that we used to see before we put things in explictly? As always, any pointers welcome, Kevin
(In reply to Kevin Buckley from comment #14) > > Bug#1328 comment #1 goes into the suggested way to configure scheduling > > by cores by setting CPUS=$core_count. > > The comment in that issue ticket got us talking so cheers for the pointer. > > One thing we have noticed is that part of the Cray installation suggests > doing this > > # python $SLURM_DIR/contribs/cray/csm/slurmconfgen_smw.py \ > -t $SLURM_DIR/contribs/cray/csm/ -o $SLURM_CONF_DIR \ > sdb p0 > > so as to generate a prototype config file, however, that only appears to > populate the NodeName definitions with (example taken from our TDS): > > > NodeName=nid000[32-35] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 > Gres=craynetwork:4 # RealMemory=65536 > NodeName=nid000[36-39] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 > Gres=craynetwork:4 # RealMemory=65536 > NodeName=nid000[13-15] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 > Gres=craynetwork:4 # RealMemory=65536 > NodeName=nid000[16-19] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 > Gres=craynetwork:4 # RealMemory=65536 > NodeName=nid000[24-27] Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 > Gres=craynetwork:4,gpu # RealMemory=32768 You can call 'slurmd -C' to see what hardware Slurm sees on any given node. > where you'll note that there are no CPUs=N values, which is why none of > our NodeName= definitions have ever had such a value. Is your cluster homogenous? If it is, then you can specify the CPUs in the default node: > NodeName=DEFAULT Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 RealMemory=65536 State=UNKNOWN I suggest joining the 2 lines into one. > Picking a Magnus compute node, and looking at the slurmd log, just > as the daemon starts up: > > [2019-03-21T10:52:09.761] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2 > Memory=64298 TmpDisk=32149 Uptime=126 CPUSpecList=(null) > FeaturesAvail=(null) FeaturesActive=(null) > > shows that slurmd's interrogating of the hardware leaves it to belive > that we have 48 "CPUs", however the suggestion in Bug#1328 comment #1 > would be that we override that with CPUs=24 so as to help defeat the > fact that Cray can't follow SchedMD's advice from the same comment, > and "turn off hyper-threads in the bios". > > However, doing so might then clash with our PartitionName definitions, eg: > > PartitionName=workq Priority=1 Default=YES MaxTime=24:00:00 MaxNodes=1366 > MaxCPUsPerNode=48 DefMemPerNode=60000 MaxMemPerNode=60000 > Nodes=nid0[0018-0063,...] > > wherein we set MaxCPUsPerNode=48. > > Were we to follow SchedMD's advice as regards overriding of NodeName's CPUs, > would we then have to explcitly change all of the other values we have put > in, mainly so as to get rid of many of the "undefined" values that we used > to see before we put things in explictly? That is a max and not really required since Slurm will generally (except gang and overloading) not schedule jobs above the CPU count. You should be able to set CPUS=24 on the nodes for testing at the very least.
I'm going to close this ticket. We can continue after the training if there are still questions. --Nate