According to the documentation of srun (https://slurm.schedmd.com/srun.html#OPT_distribution), the default behaviour of --distribution is "block:cyclic" which will place tasks across nodes in block and across sockets in cycle. However, we do not observe such behaviour in our cluster with Slurm 21.08.8. A simple srun with 4 tasks in a single node always sets the same CPU bind mask and allocates the first cores on the first available CPU socket, regardless of the value of `--distribution`. In the following cases, node250 was fully empty: $ srun --nodes=1 --ntasks-per-node=4 --cpus-per-task=2 --distribution=block:cyclic --hint=compute_bound --cpu-bind=verbose sleep 1 cpu-bind=MASK - node250, task 0 0 [2105]: mask 0x3 set cpu-bind=MASK - node250, task 1 1 [2106]: mask 0xc set cpu-bind=MASK - node250, task 2 2 [2107]: mask 0x30 set cpu-bind=MASK - node250, task 3 3 [2108]: mask 0xc0 set $ srun --nodes=1 --ntasks-per-node=4 --cpus-per-task=2 --distribution=block:fcyclic --hint=compute_bound --cpu-bind=verbose sleep 1 cpu-bind=MASK - node250, task 0 0 [2105]: mask 0x3 set cpu-bind=MASK - node250, task 1 1 [2106]: mask 0xc set cpu-bind=MASK - node250, task 2 2 [2107]: mask 0x30 set cpu-bind=MASK - node250, task 3 3 [2108]: mask 0xc0 set $ srun --nodes=1 --ntasks-per-node=4 --cpus-per-task=2 --distribution=block:block --hint=compute_bound --cpu-bind=verbose sleep 1 cpu-bind=MASK - node250, task 0 0 [2105]: mask 0x3 set cpu-bind=MASK - node250, task 1 1 [2106]: mask 0xc set cpu-bind=MASK - node250, task 2 2 [2107]: mask 0x30 set cpu-bind=MASK - node250, task 3 3 [2108]: mask 0xc0 set We also tried changing the value of `--hint` to be memory bound. As explained in https://slurm.schedmd.com/mc_support.html#srun_hints. But the CPU bind mask is still the same as the previous cases: $ srun --nodes=1 --ntasks-per-node=4 --cpus-per-task=2 --distribution=block:cyclic --hint=memory_bound --cpu-bind=verbose sleep 1 cpu-bind=MASK - node250, task 0 0 [2105]: mask 0x3 set cpu-bind=MASK - node250, task 1 1 [2106]: mask 0xc set cpu-bind=MASK - node250, task 2 2 [2107]: mask 0x30 set cpu-bind=MASK - node250, task 3 3 [2108]: mask 0xc0 set Our slurm.conf is configured to use task/cgroups and task/affinity, following the example in the documentation: https://slurm.schedmd.com/cgroup.conf.html#SECTION_EXAMPLE This issue is impacting allocation of jobs requesting GPUs because we are enforcing binding of GRES resources as well. For instance, Slurm fails to allocate a 2 task job requesting 2 Gpus because it cannot bind the requested amount of GPUs if the CPU cores of the job are all bound to the same CPU socket. Thanks for the help, Alex
Hi Alex, I am looking into this. > This issue is impacting allocation of jobs requesting GPUs because we are enforcing binding > of GRES resources as well. For instance, Slurm fails to allocate a 2 task job requesting 2 > Gpus because it cannot bind the requested amount of GPUs if the CPU cores of the job are > all bound to the same CPU socket. Can you post an example of this as well?
This is the issue with GPUs I was referring to: $ srun --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 --distribution=block:cyclic --cpu-bind=verbose sleep 1 srun: Force Terminated job 6726859 srun: error: Unable to allocate resources: Requested node configuration is not available We do have nodes with 2 GPUs per node that should be candidates for this job: $ sinfo -N --partition pascal_gpu CLUSTER: hydra HOSTNAM PARTITION STATE CPUS(A/I/O/T) CPU_LOAD MEMORY MB FREE_MEM MB GRES GRES_USED node250 pascal_gpu mix 2/22/0/24 2.38 257726 MB 213594 MB gpu:p100:2(S:0-1) gpu:p100:2(IDX:0-1) Adding the option "--cpus-per-task=12" to srun, which corresponds to the amount of cores per socket in those nodes, does make the job go through $ srun --nodes=1 --ntasks-per-node=2 --cpus-per-task=12 --gpus-per-task=1 --distribution=block:cyclic --cpu-bind=verbose sleep 1 srun: job 6726865 queued and waiting for resources Let me know if I can provide any other information. Alex
Hi Alex, I can reproduce this. I also determined that it is definitely just the job allocation that seems to be ignoring the socket distribution. If the job is allocated a whole node, then it's easy to verify that the step allocation works as expected (cyclic distribution across sockets by default). For now I'm just focusing on this. Then I'll take a look at --hint and the GPU issues.
I believe this is working as intended. --distribution affects the distribution (or ordering) of tasks only after the job allocation happens, but --distribution doesn't actually affect which cores are selected. In the following example, not the order and placement of the tasks (for cyclic distribution, tasks 0 and 1 are on different sockets, but for block distribution, tasks 0 and 1 are on the same socket). Hardware topology: 2 sockets 8 cores per per socket 2 threads per core Socket 0: Core 0: P0,16 Core 1: P1,17 ... Core 7: P7,23 Socket 1: Core 8: P8,24 ... Core 15: P15,31 marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ salloc -N1 -n12 -c2 -m block:block salloc: Granted job allocation 20 salloc: Waiting for resource configuration salloc: Nodes n1-1 are ready for job marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ srun whereami|sort 0000 n1-1 - Cpus_allowed: 00010001 Cpus_allowed_list: 0,16 0001 n1-1 - Cpus_allowed: 00020002 Cpus_allowed_list: 1,17 0002 n1-1 - Cpus_allowed: 00040004 Cpus_allowed_list: 2,18 0003 n1-1 - Cpus_allowed: 00080008 Cpus_allowed_list: 3,19 0004 n1-1 - Cpus_allowed: 00100010 Cpus_allowed_list: 4,20 0005 n1-1 - Cpus_allowed: 00200020 Cpus_allowed_list: 5,21 0006 n1-1 - Cpus_allowed: 00400040 Cpus_allowed_list: 6,22 0007 n1-1 - Cpus_allowed: 00800080 Cpus_allowed_list: 7,23 0008 n1-1 - Cpus_allowed: 01000100 Cpus_allowed_list: 8,24 0009 n1-1 - Cpus_allowed: 02000200 Cpus_allowed_list: 9,25 0010 n1-1 - Cpus_allowed: 04000400 Cpus_allowed_list: 10,26 0011 n1-1 - Cpus_allowed: 08000800 Cpus_allowed_list: 11,27 marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ exit salloc: Relinquishing job allocation 20 salloc: Job allocation 20 has been revoked. marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ salloc -N1 -n12 -c2 -m block:cyclic salloc: Granted job allocation 21 salloc: Waiting for resource configuration salloc: Nodes n1-1 are ready for job marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ srun whereami|sort 0000 n1-1 - Cpus_allowed: 00010001 Cpus_allowed_list: 0,16 0001 n1-1 - Cpus_allowed: 01000100 Cpus_allowed_list: 8,24 0002 n1-1 - Cpus_allowed: 00020002 Cpus_allowed_list: 1,17 0003 n1-1 - Cpus_allowed: 02000200 Cpus_allowed_list: 9,25 0004 n1-1 - Cpus_allowed: 00040004 Cpus_allowed_list: 2,18 0005 n1-1 - Cpus_allowed: 04000400 Cpus_allowed_list: 10,26 0006 n1-1 - Cpus_allowed: 00080008 Cpus_allowed_list: 3,19 0007 n1-1 - Cpus_allowed: 08000800 Cpus_allowed_list: 11,27 0008 n1-1 - Cpus_allowed: 00100010 Cpus_allowed_list: 4,20 0009 n1-1 - Cpus_allowed: 00200020 Cpus_allowed_list: 5,21 0010 n1-1 - Cpus_allowed: 00400040 Cpus_allowed_list: 6,22 0011 n1-1 - Cpus_allowed: 00800080 Cpus_allowed_list: 7,23 Will --ntasks-per-socket work for you? Example: marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ salloc --ntasks-per-socket=2 -N1 -n4 -c2 -m block:cyclic salloc: Granted job allocation 25 salloc: Waiting for resource configuration salloc: Nodes n1-1 are ready for job marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ srun whereami |sort 0000 n1-1 - Cpus_allowed: 00010001 Cpus_allowed_list: 0,16 0001 n1-1 - Cpus_allowed: 01000100 Cpus_allowed_list: 8,24 0002 n1-1 - Cpus_allowed: 00020002 Cpus_allowed_list: 1,17 0003 n1-1 - Cpus_allowed: 02000200 Cpus_allowed_list: 9,25
(In reply to VUB HPC from comment #3) > This is the issue with GPUs I was referring to: > > $ srun --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 > --distribution=block:cyclic --cpu-bind=verbose sleep 1 > srun: Force Terminated job 6726859 > srun: error: Unable to allocate resources: Requested node configuration is > not available With my configuration this job runs. * (2 GPUs per node, although I'm using fake GPUs pointing to tty devices, but that shouldn't affect job allocation at all) # gres.conf NodeName=n1-[1-10] Name=gpu Type=tty File=/dev/tty[0-1] # slurm.conf NodeName=DEFAULT RealMemory=8000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 NodeName=n1-[1-10] NodeAddr=localhost Port=12101-12110 Gres=gpu:tty:2 marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ srun -N1 --ntasks-per-node=2 --gpus-per-task=1 -m block:cyclic whereami 0001 n1-1 - Cpus_allowed: 00010000 Cpus_allowed_list: 16 0000 n1-1 - Cpus_allowed: 00000001 Cpus_allowed_list: 0 (Notice I just got one core) > We do have nodes with 2 GPUs per node that should be candidates for this job: > > $ sinfo -N --partition pascal_gpu > CLUSTER: hydra > HOSTNAM PARTITION STATE CPUS(A/I/O/T) CPU_LOAD MEMORY MB FREE_MEM > MB GRES GRES_USED > node250 pascal_gpu mix 2/22/0/24 2.38 257726 MB 213594 > MB gpu:p100:2(S:0-1) gpu:p100:2(IDX:0-1) > > Adding the option "--cpus-per-task=12" to srun, which corresponds to the > amount of cores per socket in those nodes, does make the job go through > > $ srun --nodes=1 --ntasks-per-node=2 --cpus-per-task=12 --gpus-per-task=1 > --distribution=block:cyclic --cpu-bind=verbose sleep 1 > srun: job 6726865 queued and waiting for resources That's interesting, but I don't know why it isn't being allocated. I think this is a separate issue from comment 0. * Could you create a new bug for this? * In the new bug could you turn on debugflags=selecttype and SlurmctldDebug=debug, run the reproducer again, then upload the relevant portion of the slurmctld log file (from job submission to job rejection)? * Also in the new bug can you upload the slurm.conf (including the NodeName definition) and gres.conf for those 2-GPU nodes?
Hi Marshall, Thanks a lot for all the information. It's good to know that distribution comes after allocation, that explains a lot of what is happening with the issue at hand. As I said in my first message, the non-allocation of GPU jobs occurs for jobs that enforce binding of GRES resources. So, this is what I think is happening: 1. Job is submitted requesting (1core + 1GPU) x 2 and with enforce binding of GRES: "sbatch --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 --gres-flags=enforce-binding" 2. The resource allocator picks 1 node with 2 GPUs and the first 2 cores available in the node, which probably are in the same CPU socket 3. Task distribution does not have any freedom to apply the "block:cyclic" distribution 4. Enforce binding of GRES resources kicks in and fails because one of the cores is not local to one of the GPUs (maybe this happens earlier?) Can you confirm this behavior? Finally, I confirm that setting "--ntasks-per-socket" as you suggested does make these jobs to be allocated. However, is it possible to make Slurm do the right thing for jobs not setting any "--ntasks-per-socket" at all? If the only constrain is "--gres-flags=enforce-binding", Slurm should pick the resources that fulfill it on its own.
(In reply to VUB HPC from comment #10) > Hi Marshall, > > Thanks a lot for all the information. It's good to know that distribution > comes after allocation, that explains a lot of what is happening with the > issue at hand. > > As I said in my first message, the non-allocation of GPU jobs occurs for > jobs that enforce binding of GRES resources. So, this is what I think is > happening: > > 1. Job is submitted requesting (1core + 1GPU) x 2 and with enforce binding > of GRES: "sbatch --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 > --gres-flags=enforce-binding" > 2. The resource allocator picks 1 node with 2 GPUs and the first 2 cores > available in the node, which probably are in the same CPU socket > 3. Task distribution does not have any freedom to apply the "block:cyclic" > distribution > 4. Enforce binding of GRES resources kicks in and fails because one of the > cores is not local to one of the GPUs (maybe this happens earlier?) > > Can you confirm this behavior? Yes. Thanks for the clarification on that (--gres-flags=enforce-binding). Here I've reproduced the issue: # gres.conf NodeName=n1-[1-10] Name=gpu Type=tty File=/dev/tty0 Cores=0-7 NodeName=n1-[1-10] Name=gpu Type=tty File=/dev/tty1 Cores=8-15 Without --gres-flags=enforce-binding, this job runs. marshall@smd-server:~/slurm/smd-server/install/c1$ sbatch -Dtmp --ntasks-per-node=2 -N1 --gpus-per-task=1 --wrap='srun whereami 60' Submitted batch job 33 With --gres-flags=enforce-binding, this job is rejected. marshall@smd-server:~/slurm/smd-server/install/c1$ sbatch -Dtmp --ntasks-per-node=2 -N1 --gpus-per-task=1 --gres-flags=enforce-binding --wrap='srun whereami 60' sbatch: error: Batch job submission failed: Requested node configuration is not available But using --gpus-per-socket or --ntasks-per-socket makes the job run: marshall@smd-server:~/slurm/smd-server/install/c1$ sbatch -Dtmp -N1 --gpus-per-task=1 --ntasks-per-socket=1 --gres-flags=enforce-binding --wrap='srun whereami 60' Submitted batch job 36 marshall@smd-server:~/slurm/smd-server/install/c1$ sbatch -Dtmp -N1 --gpus-per-socket=1 --sockets-per-node=2 --ntasks-per-node=2 --gres-flags=enforce-binding --wrap='srun whereami 60' Submitted batch job 37 > Finally, I confirm that setting "--ntasks-per-socket" as you suggested does > make these jobs to be allocated. Good! I also want to mention --gpus-per-socket as a potential solution, too (required to be paired with --sockets-per-node per the salloc/sbatch/srun man pages) https://slurm.schedmd.com/salloc.html#OPT_gpus-per-socket > However, is it possible to make Slurm do > the right thing for jobs not setting any "--ntasks-per-socket" at all? If > the only constrain is "--gres-flags=enforce-binding", Slurm should pick the > resources that fulfill it on its own. I will look into this, but I don't have an answer right now beyond "it's tricky" and "I don't know yet". For now I wanted to confirm that I see the concern, and also recommend --gpus-per-socket and --sockets-per-node as an additional solution. (In reply to VUB HPC from comment #3) > This is the issue with GPUs I was referring to: > > $ srun --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 > --distribution=block:cyclic --cpu-bind=verbose sleep 1 > srun: Force Terminated job 6726859 > srun: error: Unable to allocate resources: Requested node configuration is > not available When I ran this same job, the job ran. Are you inserting --gres-flags=enforce-binding automatically (with a job_submit plugin or cli-filter plugin or environment)? I was also testing on Slurm 22.05.0, so it's possible that this was fixed in 22.05 but not 21.08. (Of course, without --gres-flags=enforce-binding the selected cpus are on the same socket even though the GPUs are on different sockets, and I showed above that adding --gres-flags=enforce-binding makes this job submission fail.)
> When I ran this same job, the job ran. Are you inserting > --gres-flags=enforce-binding automatically (with a job_submit > plugin or cli-filter plugin or environment)? I was also testing > on Slurm 22.05.0, so it's possible that this was fixed in 22.05 > but not 21.08. (Of course, without --gres-flags=enforce-binding > the selected cpus are on the same socket even though the GPUs > are on different sockets, and I showed above that adding > --gres-flags=enforce-binding makes this job submission fail.) Yes indeed, that's my fault, the srun commands in that comment are missing the --gres-flags=enforce-binding. Sorry for the confusion. Thanks a lot for following up on this. Now it is clear what is happening and we will use the proposed solutions. Alex
Hi Alex, I have submitted a set of patches to our review queue that will fix your case and other cases so that jobs are not wrongly rejected. I'll keep you updated on our progress. - Marshall
Hi Alex, We're still working on fixes. There are a few bugs here. I think we're close. - Marshall
Hi Alex, We have fixed --gres-flags=enforce-binding in the following commits: 1711dff8d3 Only enforce a minimum required core count if enforce-binding 57ee9849d7 Fix handling minimum gres core requirement 1f74a228d3 With enforce_binding, enforce a node's avail_cpus to meet required cores 1848f74798 Move variable into an outer scope 4699704f94 Ensure the job gets enough cores to satisfy GRES enforce-binding 28997ef02c Select at least as many cores as required sockets for GRES 8e1231028d Correctly determine if sufficient GRES and sockets have been picked dfb07974d6 Do not limit avail_cpus before we have picked cores These are in the slurm-22.05 branch ahead of 22.05.7. Please let me know if you have any questions. I'm closing this ticket as resolved/fixed. - Marshall
*** Ticket 15451 has been marked as a duplicate of this ticket. ***
*** Ticket 9994 has been marked as a duplicate of this ticket. ***
*** Ticket 16655 has been marked as a duplicate of this ticket. ***