| Summary: | Binding problem with hyperthreads | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Regine Gaudin <regine.gaudin> |
| Component: | Configuration | Assignee: | Oscar Hernández <oscar.hernandez> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | oscar.hernandez |
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CEA | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Regine Gaudin
2022-06-02 05:41:59 MDT
Dear Regine, I'll try to bring some context here. Slurm will never allocate to a job a single thread (when in a multithreaded cluster). As it would be highly inefficient having 2 independent jobs using the same physical core. See [1] "The count of CPUs allocated to a job may be rounded up to account for every CPU on an allocated core" When setting SelectTypeParameters you are configuring how slurm will handle Accounting/scheduling (cores allocatable/used), but not the binding to tasks. The relevant SelectTypeParameters parameters you have: CR_Core_Memory -> Count each thread as a CPU to slurm. Also account memory. CR_ONE_TASK_PER_CORE -> Limits the maximum number of tasks to the physical cores of the machine (as each task is accounted 2 threads). So, in your case, your maximum tasks per node is 128. With this configuration you are not defining the binding to tasks and srun gets all resources allocated. Taking into account that, let me answer some of your questions in-line: > $ srun -n 1 -c 2 -p a100-bxi cat /proc/self/status|grep Cpus_allowed_list > Cpus_allowed_list: 0,128 > > 0,1,128,129 expected according to result obtained with -c 1 The minimum allocation slrum grants is a physical CPU, so it -c1 and -c2 will get you the same, 1 physical, 2 logical. > $ srun -n 1 -c 3 -p a100-bxi cat /proc/self/status|grep Cpus_allowe > Cpus_allowed: > 00000000,00000000,00000000,00000003,00000000,00000000,00000000,00000003 > Cpus_allowed_list: 0-1,128-129 > > 0-2,128-130 expected according to result obtained with -c 1 Here it gets 2 physical, 4 logical, necessary to allocate 3 threads. Remember [1] that we need increase cpus in groups of 2. 2nd conf has similar behavior as the 1st one. task/affinity makes no difference here, as you are not specifying any binding affinity in srun. > $ srun -n 1 -c 2 -p a100-bxi cat /proc/self/status|grep Cpus_allowed_list > Cpus_allowed_list: 0,128 > > if -c 1 gives one physical core with 2 hyperthreads > -c 2 should give 0,1,128,129 Same situation as before, cpus go in groups of 2, so you get -c1 and -c2 get you the same. > $ srun -n 1 -c 6 -p a100-bxi cat /proc/self/status|grep Cpus_allowed_list > slurmstepd-inti7800: error: task[0] unable to set taskset Have not been able to reproduce the error. Was it a punctual thing? > $ srun -n 1 -c 1 -p a100-bxi cat /proc/self/status|grep Cpus_allowed_list > Cpus_allowed_list: 0-255 For the 3rd config, I see you are changing the node layout on purpose (which now is not reflecting the real hardware). This might be the reason for getting this strange cpulist. > --hint=nomultithread is working but we want to offer the possibility to > allocate the hyperthreads: Have you tried --hint=multithread? I think it should give you what you want. Apart from that, if you do not want to add the --hint to each job, you could also modify the node description in slurm.conf, adding [2] CpuBind=Thread. The line in your case should besomething like: NodeName=inti7800 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=240000 Gres=gpu:nvidia:4 State=UNKNOWN CpuBind=Thread In addition, as you bind to Thread, if you want your node to allocate 256 tasks (one per thread) and avoid wasting half of the node, you should remove the CR_ONE_TASK_PER_CORE, which is limiting to 128. You can test it by running: srun -n 256 -N 1 -c 1 cat /proc/self/status|grep Cpus_allowed_list I guess that with CR_ONE_TASK_PER_CORE Slurm won't be able to satisfy allocation. Let me know if I missed something, or you have any doubt on my comments. Regards, Oscar [1] https://slurm.schedmd.com/slurm.conf.html#OPT_CR_Core_Memory [2] https://slurm.schedmd.com/slurm.conf.html#OPT_CpuBind Dear Regine, Hope you managed to configure it the way you intended. I will be closing this bug for now. Do not hesitate to re-open if any follow-up question arises. Kind regards, Oscar Hi why do we have this ? $ srun -n 2 -c 3 --exclusive -p a100-bxi cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 0-1,128 Cpus_allowed_list: 0,2,130 Hi Regine, Let me apologize, I missed that one. It does not look like the expected behavior, as it is giving cpu 0 to 2 different tasks. I can reproduce this by setting CR_ONE_TASK_PER_CORE (does not happen if removed), so it is related to that parameter. with SelectTypeParameters = CR_Core_Memory,CR_ONE_TASK_PER_CORE oscar@saborito:~/Projects$ srun -n 2 -c 3 cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 0-1,4 Cpus_allowed_list: 0,2,6 with SelectTypeParameters = CR_Core_Memory oscar@saborito:~/Projects$ srun -n 2 -c 3 cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 0-1,4 Cpus_allowed_list: 2,5-6 I will take a look into that, and come back to you when I have something. Regards, Oscar Dear Regine, Giving some feedback on this issue. The bug is related to the option --ntasks-per-core=1, the same one CR_ONE_TASK_PER_CORE implicitly sets. As it is documented in [1], srun does not recognize this option and in some particular situations, like this one, it can break core binding. We are currently working on a patch to address the issue. On the other hand, with regard to the binding necessities you had. Were you able to correctly bind tasks to threads by setting CpuBind=Thread? Kind regards, Oscar [1]https://slurm.schedmd.com/srun.html#OPT_ntasks-per-core Hi Using hyperthreadings: suppress ONE_TASK_PER_CORE and add -CPUBind=thread in partition;conf is ok srun srun -n 2 -c 3 --exclusive -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 2,129-130 Cpus_allowed_list: 0-1,128 [gaudinr@inti6006 gaudinr] $ srun -n 2 -c 3 -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 2,129-130 Cpus_allowed_list: 0-1,128 [gaudinr@inti6006 gaudinr] $ srun -n 2 -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 128 Cpus_allowed_list: 0 [gaudinr@inti6006 gaudinr] $ srun -n 2 --hint=nomultithreda -p rome cat /proc/self/status|grep Cpus_allowed_list srun: error: unrecognized --hint argument "nomultithreda", see --hint=help [gaudinr@inti6006 gaudinr] $ srun -n 2 --hint=nomultithread -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 1 Cpus_allowed_list: 0 -n 2 -c 3 --exclusive -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 2,129-130 Cpus_allowed_list: 0-1,128 [gaudinr@inti6006 gaudinr] $ srun -n 2 -c 3 -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 2,129-130 Cpus_allowed_list: 0-1,128 [gaudinr@inti6006 gaudinr] $ srun -n 2 -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 128 Cpus_allowed_list: 0 For the on who does not want to use hyperthread it seems ok [gaudinr@inti6006 gaudinr] $ srun -n 2 --hint=nomultithreda -p rome cat /proc/self/status|grep Cpus_allowed_list srun: error: unrecognized --hint argument "nomultithreda", see --hint=help [gaudinr@inti6006 gaudinr] $ srun -n 2 --hint=nomultithread -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 1 Cpus_allowed_list: 0 srun -n 2 -c 3 --hint=nomultithread -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 67-69 Cpus_allowed_list: 64-66 [gaudinr@inti6006 gaudinr] $ srun -n 2 -c 3 --exclusive --hint=nomultithread -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 3-5 Cpus_allowed_list: 0-2 Some users asked me this convenient request ?!! Is there a a way (parameter, option ) on the hyperthreaded nodes allowing the following "back" full core request (-c 1 but 2 hyperthreads) srun -n 2 -c 1 ( --hint=nomultithread ) -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 0,128 Cpus_allowed_list: 1,129 The aim is to have double hyperthread binding , one user task per core on one hyperthread and the other hyperthead for the MPI helper thread for instance Hi Regine, I am not sure if I understood it correctly. Do you mean a parameter that can give a similar functionality to "--hint=nomultithread" when having thread binding configured? If that is the case, adding --ntasks-per-core=1, should give the expected output. #default behavior with CpuBind=Thread oscar@comp:/TESTS$ srun -n 2 -c 1 whereami 0 c1 - Cpus_allowed: 01 Cpus_allowed_list: 0 1 c1 - Cpus_allowed: 10 Cpus_allowed_list: 4 #with ntasks option oscar@comp:/TESTS$ srun -n 2 -c 1 --ntasks-per-core=1 whereami 0 c1 - Cpus_allowed: 11 Cpus_allowed_list: 0,4 1 c1 - Cpus_allowed: 22 Cpus_allowed_list: 1,5 Let me know if I misunderstood the question. Hi In fact users would like to avoid the x2 -c option parameter as they are asking physical core and we would like to avoid to change all the accounting processes. (as -c is now thread for 1 full physical core you need to set -c 2 which is disruptive) --ntasks-per-core=1 seems to offer the possibility for -c 1 but it does not do it for -c 2... srun -n 1 -c 1 -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 0 [gaudinr@inti6006 gaudinr] $ srun -n 1 -c 1 --ntasks-per-core=1 -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 0,128 [gaudinr@inti6006 gaudinr] $ srun -n 1 -c 2 -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 0,128 [gaudinr@inti6006 gaudinr] $ srun -n 1 -c 2 --ntasks-per-core=1 -p rome cat /proc/self/status|grep Cpus_allowed_list Cpus_allowed_list: 0,128 We would like Cpus_allowed_list: 0,1,128,129 for -c2 Hi Regine, The idea of having hyperthreading and thread-binding is to treat each thread as an individual CPU, for task allocation, but also for accounting. To be clear, -c 1 will grant you one thread, and also account you for 1 cpu. What you propose, automatically giving 2 cpus when -c 1 is requested would break the logic and wouldn't have much sense in a general perspective. With your last comment though, I think I get what your initial intention was (I think I misunderstood initially), please let me know if I am wrong: Your node has 128 cores and 128*2 threads. You want the node to allocate a maximum of 128 tasks/cores. You want also to account for a maximum of 128 cpus (not 256). A slurm behavior similar to a node with no hyperthreading, but at the same time, you want available cpus to show the threads: oscar@comp:/TESTS$ srun -n 2 -c 1 whereami 0 c1 - Cpus_allowed: 11 Cpus_allowed_list: 0,4 1 c1 - Cpus_allowed: 22 Cpus_allowed_list: 1,5 oscar@comp:~/TESTS$ srun -n 1 -c 2 whereami 0 c1 - Cpus_allowed: 33 Cpus_allowed_list: 0-1,4-5 If that is the case, you could try to modify the node definition in slurm.conf. Setting [1] CPUs=128 and remove the CpuBind suggested earlier. For example, in the initial confg you sent me: NodeName=inti7800 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=240000 Gres=gpu:nvidia:4 State=UNKNOWN CPUs=128 This will cause slurm to only allocate full core though, so that won't be possible: oscar@comp:/TESTS$ srun -n 2 -c 1 whereami 0 c1 - Cpus_allowed: 01 Cpus_allowed_list: 0 1 c1 - Cpus_allowed: 10 Cpus_allowed_list: 4 Also, take into account that with this configuration, slurm will only account for 128 cpus per node. I suppose that is what you want to be consistent with the other non-hyoerthreading nodes. [1]https://slurm.schedmd.com/slurm.conf.html#OPT_CPUs Hi Regine, Have the suggestions provided in the previous comment been useful to match your needs? Regards, Oscar NodeName=inti[6006,6042] Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=240000 State=UNKNOWN CPUs=128 Yes it seems it does answer to the request, thanks: Users ask for core with -c as they are used to but can either use hyperthreads or not [root@inti6006 slurm] # ccc_mprun -n 4 -c 32 -p rome cat /proc/self/status |grep Cpus_allowed_list Cpus_allowed_list: 64-95,192-223 Cpus_allowed_list: 96-127,224-255 Cpus_allowed_list: 32-63,160-191 Cpus_allowed_list: 0-31,128-159 Great! Indeed, looks like we finally got the right configuration then. I am closing this bug. Don't hesitate to re-open if any related question arises. Regards, Oscar |