Hi, I'd like to better understand, how Slurm allocates CPU resources within a node. The documentation says: "When using a SelectType of select/cons_tres, the default allocation method across nodes is block allocation (allocate all available CPUs in a node before using another node). The default allocation method within a node is cyclic allocation (allocate available CPUs in a round-robin fashion across the sockets within a node). https://slurm.schedmd.com/cpu_management.html#Step2 This is true for --ntasks=<number> but for --ntasks-per-node=<number> the CPU allocation on the node always appear to be blockwise. We have dual socket systems with 24 cores on each socket: $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz Stepping: 7 CPU MHz: 2100.000 CPU max MHz: 3700,0000 CPU min MHz: 1000,0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-23,48-71 NUMA node1 CPU(s): 24-47,72-95 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities $ This is the CPU allocation on a node for --ntasks=8: $ sbatch --nodes=1 --ntasks=8 -t 01:00 --wrap "$HOME/bin/whereami" Submitted batch job 12289332 $ cat slurm-12289332.out 0000 n1424 - Cpus_allowed: 00000f00,000f0000,0f00000f Cpus_allowed_list: 0-3,24-27,48-51,72-75 Apparently the CPUS are allocated cyclic here, i.e. 4 allocated CPUs (0-3 and hyperthreading siblings 48-51) on the first socket and 4 CPUs (24-27 and siblings 72-75) on the second socket. However this is the CPU alloaction on the node for --ntasks-per-node=8: $ sbatch --nodes=1 --ntasks-per-node=8 -t 01:00 --wrap "$HOME/bin/whereami" Submitted batch job 12289335 $ cat slurm-12289335.out 0000 n1101 - Cpus_allowed: 00000000,00ff0000,000000ff Cpus_allowed_list: 0-7,48-55 So the CPUs are now allocated blockwise with all 8 allocated CPUs on the first socket and no CPU on the second socket. Can you please help me understanding the rationale for having different CPU allocations with --ntasks=<number> (resulting in cyclic allocation) and --ntasks-per-node=<number> (resulting in blockwise allocation). I'll also attach our current slurm.conf in case is matters. Thank you in advance. Best regards Jürgen
Created attachment 35146 [details] Output of grep -v '^#' slurm.conf (comment lines stripped)
Hi Jürgen, I can reproduce the issue you're reporting. Just to confirm, the behavior I would expect is what you're seeing with "--ntasks", where it uses a cyclic distribution. This lines up with what our documentation says is the default in the multi-core documentation: The default distribution on multi-core/multi-threaded systems is equivalent to -m block:cyclic with --cpu-bind=thread. https://slurm.schedmd.com/mc_support.html#srun_dist I'm able to force my request using '--ntasks' to use a block distribution by adding '-mblock:block' to the job request, but I can't force the '--ntasks-per-node' job to use cyclic distribution with '-mblock:cyclic'. I wonder if you see the same behavior on your side or if this works for you. Would you mind testing that? I'm still looking into what's causing this, but wanted to let you know that I'm seeing the same behavior. Thanks, Ben
(In reply to Ben Roberts from comment #3) > I'm able to force my request using '--ntasks' to use a block distribution by > adding '-mblock:block' to the job request, but I can't force the > '--ntasks-per-node' job to use cyclic distribution with '-mblock:cyclic'. I > wonder if you see the same behavior on your side or if this works for you. > Would you mind testing that? Hi Ben, thank you for confirming. And, yes, I do see exactly the same behavior on my side: $ sbatch --nodes=1 --ntasks=8 -t 01:00 -mblock:block --wrap "$HOME/bin/whereami" Submitted batch job 12303885 $ cat slurm-12303885.out 0000 n1423 - Cpus_allowed: 00000000,00ff0000,000000ff Cpus_allowed_list: 0-7,48-55 $ $ sbatch --nodes=1 --ntasks-per-node=8 -mblock:cyclic -t 01:00 --wrap "$HOME/bin/whereami" Submitted batch job 12303888 [ul_l_jsalk@login02 ~]$ cat slurm-12303888.out 0000 n1423 - Cpus_allowed: 00000000,00ff0000,000000ff Cpus_allowed_list: 0-7,48-55 $ I think, I also never managed to get applications hints (--hint=memory_bound) to behave as expected when used together with --ntasks-per-node option. The only way I've ever found to distribute CPUs across both sockets with --ntasks-per-node was by adding --ntasks-per-socket option: $ sbatch --nodes=1 --ntasks-per-node=8 --ntasks-per-socket=4 -t 01:00 --wrap "$HOME/bin/whereami" Submitted batch job 12303899 $ cat slurm-12303899.out 0000 n1423 - Cpus_allowed: 00000f00,000f0000,0f00000f Cpus_allowed_list: 0-3,24-27,48-51,72-75 $ > I'm still looking into what's causing this, but wanted to let you know that > I'm seeing the same behavior. Thanks again for taking care of this. I've been puzzled for quite some time about why --ntasks-per-node usually results in a block-wise distribution of CPU allocations. Best regards Jürgen
Hi, is there any news on that? We have some use cases where CPU placement does matter. Just by chance, I also noticed something that confuses me even more: CPU allocation with --ntasks-per-node seems to depend on whether generic resources are requested for the job or not. We have local scratch space defined as a generic resource. This is what we get for a job that does not request scratch space: $ sbatch --nodes=1 --ntasks-per-node=8 -w n1811 -t 01:00 --wrap "$HOME/bin/whereami" Submitted batch job 12373861 $ cat slurm-12373861.out 0000 n1811 - Cpus_allowed: 00000000,00ff0000,000000ff Cpus_allowed_list: 0-7,48-55 $ And this is what we get on the very same node with `--gres:scratch=nn´ added: $ sbatch --nodes=1 --ntasks=8 -w n1811 --gres=scratch:10 -t 01:00 --wrap "$HOME/bin/whereami" Submitted batch job 12373869 $ cat slurm-12373869.out 0000 n1811 - Cpus_allowed: 00000f00,000f0000,0f00000f Cpus_allowed_list: 0-3,24-27,48-51,72-75 $ Note that the distribution of the CPUs has changed from blockwise to cyclic just by adding `--gres=scratch:10´. This behavior is now something I personally don't understand anymore, and it's becoming increasingly difficult to explain to users. I will attach our gres.conf and cgroup.conf as well in case it matters. Best regards Jürgen
Created attachment 35345 [details] gres.conf
Created attachment 35346 [details] cgroup.conf
Hi, my apologies. In comment #7 I have unfortunately copied the wrong lines from my terminal into the browser for the `--gres=scratch:10´ test case. Here are the correct lines that I actually intended to post: $ sbatch --nodes=1 --ntasks-per-node=8 -w n1811 -t 01:00 --wrap "$HOME/bin/whereami" Submitted batch job 12373994 $ cat slurm-12373994.out 0000 n1811 - Cpus_allowed: 00000000,00ff0000,000000ff Cpus_allowed_list: 0-7,48-55 $ -> blockwise CPU distribution $ sbatch --nodes=1 --ntasks-per-node=8 --gres=scratch:10 -w n1811 -t 01:00 --wrap "$HOME/bin/whereami" Submitted batch job 12373995 $ cat slurm-12373995.out 0000 n1811 - Cpus_allowed: 00000f00,000f0000,0f00000f Cpus_allowed_list: 0-3,24-27,48-51,72-75 $ -> cyclic CPU distribution just by adding `--gres=scratch:10´. So, the questions still remain: 1) Is there any rationale for having blockwise CPU distribution with `--ntasks-per-node=<number>´ by default but cyclic distribution with `--ntasks=<number>´? 2) If so, in case of `--ntasks-per-node=<number>, why does the CPU distribution change from blockwise to cyclic just by adding `--gres=scratch:10´? Best regards Jürgen
Juergen, Sorry for delay in reply. The topic is actually tangled and I'm still not sure on how to proceed. There are actually two things affected by -m/--distribution option. However, the main thing here is placement of tasks on already allocated resources. In terms of cpu_management web documentation it's called "Step 3". The internals of cons_tres may make use of the option as the last resort eliminator also in "Step 2". It will only happen if internally processing an allocion select plugin will not remove non-required resources. To make long story short you may think about the process in "Step 2" as if select plugin attempts removal of idle resources (e.g. cores) from idle bitmap based on the job spec. It tries to make it as fast as possible = remove as much as possible. For instance if we have a --ntasks-per-node specified we can "shrink" it per node very quickly and the "best effort" logic using -m at the end won't be able to influence allocated cores. If we don't have a per node limit, but just per job one collecting resources we have to leave all available to the job till the final phase. If you specify --gres it triggeres a lot of additional logic, which may affect the mechanism too. The influence of -m on "Step 2" isn't actually enforced, so it's primarely dependent on the resources being idle. For instance if you preallocated nodes in the way that you have two idle cores on two different sockets a job with -m block:block requiring two cores will still start there. If you look at man srun[1] this is in fact documented. >[...] For job allocation, this sets environment variables that will be used by >subsequent srun requests. Task distribution affects job allocation at the last >stage of the evaluation of available resources by the cons_tres plugin. >Consequently, other options (e.g. --ntasks-per-node, --cpus-per-task) may >affect resource selection prior to task distribution. To really have a full control over task placement you really have to allocate the node in --exclusive manner. Having that said. I think that we're actually having a documentation issue in cpu_management.shtml. I'll keep you posted. cheers, Marcin [1]https://slurm.schedmd.com/srun.html#OPT_distribution
(In reply to Marcin Stolarek from comment #11) > Sorry for delay in reply. Dear Marcin no problem. And thank you for your help. > The topic is actually tangled and I'm still not > sure on how to proceed. Okay, perhaps we should boil down this issue to what primarily matters to me here. I am concerned about sub-node jobs only, i.e. jobs that request a small (and even) number of CPUs, for example 8 CPUs from a node that has a total of 48 CPUS. Let's also set aside the `-m´ for the moment since it's not a standard option. I do also not go into `--exclusive´ option as this would sacrifice our current node access policy which allows multiple jobs of one and the same user to run at the same time on a node if possible. That said, I think what matters to us is what happens in step 2 as this imposes the constraints on how individual tasks can be distributed across sockets in step 3. Let's assume we have a node that is completely unallocated so far, i.e. all 48 CPUs previously idle. From the documentation ("The default allocation method within a node is cyclic allocation") one would expect that step 2 will allocate CPUs in a round robin fashion then, resulting in 4 CPUs allocated on the first socket and 4 CPUs on the second socket. Do you agree so far? But this is obviously not (always) true as shown in the test cases from the initial description and also the test case from comment #10. Instead, CPU allocation in step 2 actually depends on *how* the 8 CPUs were requested (`--ntasks=8´ versus `--ntasks-per-node=8´) and, even more surprising for me, also on `--gres=scratch´ option, which is not necessarily expected to influence CPU allocation. Agreed? From what I have observed and from you've explained so far, I do understand that the internal processes for CPU allocation in step 2 are not really straightforward such that the resulting CPU allocation for sub-node jobs is not (or not always) comprehensible and, to some extent, can be seen as unpredictable, especially from the user's perspective, right? Admittedly, this even applies to my own perspective as well. I've now tested several combinations in order to get a better idea myself of what option combination leads to what distribution of allocated CPUs on a node. I've also included some not-so-standard options to the option matrix. This is what I finally got: --nodes=1 --ntasks-per-node=8 -> block --nodes=1 --ntasks-per-node=8 --gres=scratch:10 -> cyclic --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 -> block --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 --gres=scratch:10 -> cyclic --nodes=1 --ntasks=8 -> cyclic --nodes=1 --ntasks=8 --gres=scratch:10 -> cyclic --nodes=1 --ntasks=1 --cpus-per-task=8 -> block --nodes=1 --ntasks=1 --cpus-per-task=8 --gres=scratch:10 -> cyclic --nodes=1 --ntasks-per-node=8 --ntasks-per-socket=4 -> cyclic --nodes=1 --ntasks-per-node=8 --gres=scratch:10 --ntasks-per-socket=4 -> cyclic --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 --ntasks-per-socket=4 -> block --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 --gres=scratch:10 --ntasks-per-socket=4 -> cyclic --nodes=1 --ntasks=8 --ntasks-per-socket=4 -> cyclic --nodes=1 --ntasks=8 --gres=scratch:10 --ntasks-per-socket=4 -> cyclic --nodes=1 --ntasks=1 --cpus-per-task=8 --ntasks-per-socket=4 -> block --nodes=1 --ntasks=1 --cpus-per-task=8 --gres=scratch:10 --ntasks-per-socket=4 -> cyclic --nodes=1 --ntasks-per-node=8 -m block:cyclic -> block --nodes=1 --ntasks-per-node=8 --gres=scratch:10 -m block:cyclic -> cyclic --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 -m block:cyclic -> block --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 --gres=scratch:10 -m block:cyclic -> cyclic --nodes=1 --ntasks=8 -m block:cyclic -> cyclic --nodes=1 --ntasks=8 --gres=scratch:10 -m block:cyclic -> cyclic --nodes=1 --ntasks=1 --cpus-per-task=8 -m block:cyclic -> block --nodes=1 --ntasks=1 --cpus-per-task=8 --gres=scratch:10 -m block:cyclic -> cyclic --nodes=1 --ntasks-per-node=8 -m block:block -> block --nodes=1 --ntasks-per-node=8 --gres=scratch:10 -m block:block -> cyclic --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 -m block:block -> block --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 --gres=scratch:10 -m block:block -> cyclic --nodes=1 --ntasks=8 -m block:block -> block --nodes=1 --ntasks=8 --gres=scratch:10 -m block:block -> cyclic --nodes=1 --ntasks=1 --cpus-per-task=8 -m block:block -> block --nodes=1 --ntasks=1 --cpus-per-task=8 --gres=scratch:10 -m block:block -> cyclic Not sure if this is all expected or intended behaviour on your end. For example, `--gres=scratch´ option does *always* enforce cyclic CPU allocation (i.e. 4 CPUS allocated on the first socket and 4 CPUs on the other socket) regardless of how the 8 CPUs were requested and even with `-m block:block´ option added. So, there seems to be no obvious way to have all 8 CPUs allocated on the same socket as soon as `--gres=scratch´ option is in place. Can you confirm this? Maybe I have missed other options, constellations or hidden parameters that also have an impact on CPU allocation in this scenario. For example, I did not check the influence of the moon phase yet ... Best regards Jürgen
Jürgen, I can reproduce the behavior you're describing and I generally agree with your observations. Let me first answer your questions: >From the documentation ("The default allocation method within a node is cyclic >allocation") one would expect that step 2 will allocate CPUs in a round robin >fashion then, resulting in 4 CPUs allocated on the first socket and 4 CPUs on >the second socket. > >Do you agree so far? I agree that your understanding of this statement is correct. This sentence comes from commit f4102fe7ac6 and is from 2011. The state of code changed a lot, a Slurm selection logic is much more complicated. Today this I see this statement as oversimplification. >[...]internal processes for CPU allocation [...] not really straightforward >[...]for sub-node jobs is not (or not always) [...]comprehensible and, to some >extent, can be seen as unpredictable, especially from the user's perspective, >right? It's predictible in terms that it's not stohastical, but having number of parameters one can use to specify a job and number of ways nodes can be preallocated it's not straightforward. >Not sure if this is all expected or intended behaviour on your end.[...] I'm hesitant to state that it's wrong as long as it's matching the spec - keep in mind that -m is not really "Step 2" option (we're going to update the cpu_management web doc). I understand that its impact on allocation shape in certain cases may be confusing, but that's the way it's implemented today and documented in our tools man pages. The way I'd recommend to enforce allocation that will result in specific number of tasks running per socket is to add --ntasks-per-socket specification. I'll keep you posted on the documentation update. cheers, Marcin
Hi Marcin, thank you for your efforts to shed some light on the matter. I do understand that it's challenging to document complex internal processes in an easily understandable way. I really don't want to blow things out of proportion and also have to admit that CPU allocation is mostly good by default in the vast majority of cases or doesn't play such a crucial role. The background to my questions, which may seem like nitpicking, is simply that we've had some corner cases recently where the distribution of CPUs across sockets seem to play a significant role in terms of memory locality, and the uncertainties regarding blockwise versus cyclic CPU allocation have internally stirred up a lot of confusion. And, yes, we also found that using --ntasks-per-socket does help in specific scenarios. It's just that we haven't described this option in our own user documentation so far. Partly because we didn't want to overwhelm our users with too many possible options that they wouldn't typically need. But also, maybe more important, that it's not always immediately apparent when their jobs would benefit from this particular option, because this depends on how CPUs are allocated across sockets by default ... You get the point. Things get even more complicated when hyper-threading comes into play, and the Linux kernel attempts to enforce its own ideas about process placement on allocated CPUs, which eventually must be worked against with yet other options, such as --hint=nomultithread or --threads-per-core=1 at the same time. Best regards Jürgen
Hi Jürgen, I'm happy to continue the discussion to find the best solution for you. Unfortunatelly, what we're seeing is that clusters workloads are getting more and more heterogenous and depending on the application different constraints (in general, not Slurm -C meaning) are required. That's in fact one of the reasons for new options like --ntasks-per-gpu being developed making resources selection even more complicated. The usual approach to make end users life easier is the use of CliFilterPlugins[1] or JobSubmitPlugins[2] to set some defaults based on already known job parameters. Both of those can be lua scripts, so they can implement even a complicated logic that sets job spec options for instnace based on the partition used. However, users education is still unavoidable. Based on my profesional experience (I'm a former cluster admin) applications requiring specific process distribution over the node are usually vulnerable(in terms of performance) to events like cache eviction caused by other apps running on the node. Because of that, the only way for those to perform optimally is may be to use --exclusive. Do you have some examples, Maybe I can help with some ideas for cli_filter or job_submit? cheers, Marcin [1]https://slurm.schedmd.com/slurm.conf.html#OPT_CliFilterPlugins [2]https://slurm.schedmd.com/slurm.conf.html#OPT_JobSubmitPlugins
Hi Marcin, thank you. I would indeed welcome the opportunity to continue the discussion if possible. However, that will probably have to wait a bit as we currently have some other urgent issue to address (see: bug #19308). Can we please leave this ticket open for a while? In fact, we already have JobSubmitPlugins=lua in place for queue/partition routing. But, in general, I am admittedly not such a big fan of introducing/enforcing too much homegrown magic into scheduling because this sometimes requires prior knowledge of what the users actually want to achieve and, to some extent, also takes control away from users over their own jobs. Thanks again for now. Best regards Jürgen
OK, do you want me to get back to you when bug 19308 is closed?
Jürgen, Is there anything I can help you with in the ticket? cheers, Marcin
Hi Marcin, thank you for getting back to me. Unfortunately, I've lost track of the topic a bit, although there are still some questions open for me. One question would be, for example, how to achieve blockwise allocation of CPUs when `--gres=scratch=nn´ is requested at the same time? Best regards Jürgen
>One question would be, for example, how to achieve blockwise allocation If you want to be sure that tasks are placed on CPUs with the specified distribution you have to allocate nodes exclusively (using --exclusive) then `-m` used with `srun` spawning tasks on nodes (within existing exclusive allocation) applies selected task distribution over sockets. cheers, Marcin
Hi Marcin, ok, understood. Actually, `--exclusive´ isn't exactly what we want for small jobs with just a few cores, because this will leave a large amount of cores unused on the node then. But it's good to have a clear confirmation that with non-exclusive node allocation in Slurm, it is simply not possible to always achieve the desired CPU distribution for the individual jobs. So it seems we'll have to live with this limitation for now, although this may become an even bigger issue if nodes continue to have more and more cores in the future ... Anyway, thanks again for the clarification. Best regards Jürgen
Jurgen, As a closing note I just want to summarize. Slurm has a number of options to shape resources allocation, but `-m` isn't used for it. It's purpose is to distribute tasks on existing allocation. For instance your allocation may contain four cores on socket 1 and four cores on socket 2. Distrubuting tasks requesting 2 CPUs per task for binding: 1) Block wise: Puts both CPUs of task one and task two on Socket 1 and task 3 and 4 on Socket 2. 2) Cyclic: Pus both CPUs of task one and three on Socket 1 and tasks two and four on Socket 2. 3) fcyclic: Puts first CPU of task 1,2,3,4 on Socket 1 and second CPU for task 1,2,3,4 on Socket 2. Final binding of those will still depend on TaskPlugin and --cpu-bind[1]/CpuBind[2]/TaskPluginParam[3]. I hope it's more clear now. Let me know if you're good with closing the ticket as "information given". cheers, Marcin [1]https://slurm.schedmd.com/srun.html#OPT_cpu-bind [2]https://slurm.schedmd.com/slurm.conf.html#OPT_CpuBind [3]https://slurm.schedmd.com/slurm.conf.html#OPT_TaskPluginParam
I'll go ahead and close the ticket as information given. Should you have any questions please reopen.