Ticket 14153 - srun options --distribution and --hint being ignored
Summary: srun options --distribution and --hint being ignored
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 21.08.8
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
: 9994 15451 16655 (view as ticket list)
Depends on:
Blocks: 14852 14854 15403
  Show dependency treegraph
 
Reported: 2022-05-24 07:57 MDT by VUB HPC
Modified: 2023-07-11 04:02 MDT (History)
4 users (show)

See Also:
Site: VUB
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 22.05.7 23.02.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description VUB HPC 2022-05-24 07:57:14 MDT
According to the documentation of srun (https://slurm.schedmd.com/srun.html#OPT_distribution), the default behaviour of --distribution is "block:cyclic" which will place tasks across nodes in block and across sockets in cycle. However, we do not observe such behaviour in our cluster with Slurm 21.08.8.

A simple srun with 4 tasks in a single node always sets the same CPU bind mask and allocates the first cores on the first available CPU socket, regardless of the value of `--distribution`. In the following cases, node250 was fully empty:

$ srun --nodes=1 --ntasks-per-node=4 --cpus-per-task=2 --distribution=block:cyclic --hint=compute_bound --cpu-bind=verbose sleep 1
cpu-bind=MASK - node250, task  0  0 [2105]: mask 0x3 set
cpu-bind=MASK - node250, task  1  1 [2106]: mask 0xc set
cpu-bind=MASK - node250, task  2  2 [2107]: mask 0x30 set
cpu-bind=MASK - node250, task  3  3 [2108]: mask 0xc0 set

$ srun --nodes=1 --ntasks-per-node=4 --cpus-per-task=2 --distribution=block:fcyclic --hint=compute_bound --cpu-bind=verbose sleep 1
cpu-bind=MASK - node250, task  0  0 [2105]: mask 0x3 set
cpu-bind=MASK - node250, task  1  1 [2106]: mask 0xc set
cpu-bind=MASK - node250, task  2  2 [2107]: mask 0x30 set
cpu-bind=MASK - node250, task  3  3 [2108]: mask 0xc0 set

$ srun --nodes=1 --ntasks-per-node=4 --cpus-per-task=2 --distribution=block:block --hint=compute_bound --cpu-bind=verbose sleep 1
cpu-bind=MASK - node250, task  0  0 [2105]: mask 0x3 set
cpu-bind=MASK - node250, task  1  1 [2106]: mask 0xc set
cpu-bind=MASK - node250, task  2  2 [2107]: mask 0x30 set
cpu-bind=MASK - node250, task  3  3 [2108]: mask 0xc0 set

We also tried changing the value of `--hint` to be memory bound. As explained in https://slurm.schedmd.com/mc_support.html#srun_hints. But the CPU bind mask is still the same as the previous cases:

$ srun --nodes=1 --ntasks-per-node=4 --cpus-per-task=2 --distribution=block:cyclic --hint=memory_bound --cpu-bind=verbose sleep 1
cpu-bind=MASK - node250, task  0  0 [2105]: mask 0x3 set
cpu-bind=MASK - node250, task  1  1 [2106]: mask 0xc set
cpu-bind=MASK - node250, task  2  2 [2107]: mask 0x30 set
cpu-bind=MASK - node250, task  3  3 [2108]: mask 0xc0 set

Our slurm.conf is configured to use task/cgroups and task/affinity, following the example in the documentation: https://slurm.schedmd.com/cgroup.conf.html#SECTION_EXAMPLE

This issue is impacting allocation of jobs requesting GPUs because we are enforcing binding of GRES resources as well. For instance, Slurm fails to allocate a 2 task job requesting 2 Gpus because it cannot bind the requested amount of GPUs if the CPU cores of the job are all bound to the same CPU socket.

Thanks for the help,

Alex
Comment 2 Marshall Garey 2022-05-24 16:24:38 MDT
Hi Alex,

I am looking into this.

> This issue is impacting allocation of jobs requesting GPUs because we are enforcing binding
> of GRES resources as well. For instance, Slurm fails to allocate a 2 task job requesting 2
> Gpus because it cannot bind the requested amount of GPUs if the CPU cores of the job are
> all bound to the same CPU socket.


Can you post an example of this as well?
Comment 3 VUB HPC 2022-06-01 15:06:13 MDT
This is the issue with GPUs I was referring to:

$ srun --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 --distribution=block:cyclic --cpu-bind=verbose sleep 1
srun: Force Terminated job 6726859
srun: error: Unable to allocate resources: Requested node configuration is not available

We do have nodes with 2 GPUs per node that should be candidates for this job:

$ sinfo -N --partition pascal_gpu
CLUSTER: hydra
HOSTNAM PARTITION       STATE  CPUS(A/I/O/T) CPU_LOAD   MEMORY MB FREE_MEM MB  GRES                GRES_USED           
node250 pascal_gpu      mix        2/22/0/24     2.38   257726 MB   213594 MB  gpu:p100:2(S:0-1)   gpu:p100:2(IDX:0-1)

Adding the option "--cpus-per-task=12" to srun, which corresponds to the amount of cores per socket in those nodes, does make the job go through

$ srun --nodes=1 --ntasks-per-node=2 --cpus-per-task=12 --gpus-per-task=1 --distribution=block:cyclic --cpu-bind=verbose sleep 1
srun: job 6726865 queued and waiting for resources

Let me know if I can provide any other information.

Alex
Comment 5 Marshall Garey 2022-06-03 16:26:10 MDT
Hi Alex,

I can reproduce this. I also determined that it is definitely just the job allocation that seems to be ignoring the socket distribution. If the job is allocated a whole node, then it's easy to verify that the step allocation works as expected (cyclic distribution across sockets by default).

For now I'm just focusing on this. Then I'll take a look at --hint and the GPU issues.
Comment 8 Marshall Garey 2022-06-03 16:47:56 MDT
I believe this is working as intended. --distribution affects the distribution (or ordering) of tasks only after the job allocation happens, but --distribution doesn't actually affect which cores are selected.

In the following example, not the order and placement of the tasks (for cyclic distribution, tasks 0 and 1 are on different sockets, but for block distribution, tasks 0 and 1 are on the same socket).

Hardware topology:

2 sockets
8 cores per per socket
2 threads per core

Socket 0:
Core 0:
P0,16
Core 1:
P1,17
...
Core 7:
P7,23

Socket 1:
Core 8:
P8,24
...
Core 15:
P15,31



marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ salloc -N1 -n12 -c2 -m block:block 
salloc: Granted job allocation 20
salloc: Waiting for resource configuration
salloc: Nodes n1-1 are ready for job
marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ srun whereami|sort
0000 n1-1 - Cpus_allowed:       00010001        Cpus_allowed_list:      0,16
0001 n1-1 - Cpus_allowed:       00020002        Cpus_allowed_list:      1,17
0002 n1-1 - Cpus_allowed:       00040004        Cpus_allowed_list:      2,18
0003 n1-1 - Cpus_allowed:       00080008        Cpus_allowed_list:      3,19
0004 n1-1 - Cpus_allowed:       00100010        Cpus_allowed_list:      4,20
0005 n1-1 - Cpus_allowed:       00200020        Cpus_allowed_list:      5,21
0006 n1-1 - Cpus_allowed:       00400040        Cpus_allowed_list:      6,22
0007 n1-1 - Cpus_allowed:       00800080        Cpus_allowed_list:      7,23
0008 n1-1 - Cpus_allowed:       01000100        Cpus_allowed_list:      8,24
0009 n1-1 - Cpus_allowed:       02000200        Cpus_allowed_list:      9,25
0010 n1-1 - Cpus_allowed:       04000400        Cpus_allowed_list:      10,26
0011 n1-1 - Cpus_allowed:       08000800        Cpus_allowed_list:      11,27
marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ 
exit
salloc: Relinquishing job allocation 20
salloc: Job allocation 20 has been revoked.
marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ salloc -N1 -n12 -c2 -m block:cyclic
salloc: Granted job allocation 21
salloc: Waiting for resource configuration
salloc: Nodes n1-1 are ready for job
marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ srun whereami|sort
0000 n1-1 - Cpus_allowed:       00010001        Cpus_allowed_list:      0,16
0001 n1-1 - Cpus_allowed:       01000100        Cpus_allowed_list:      8,24
0002 n1-1 - Cpus_allowed:       00020002        Cpus_allowed_list:      1,17
0003 n1-1 - Cpus_allowed:       02000200        Cpus_allowed_list:      9,25
0004 n1-1 - Cpus_allowed:       00040004        Cpus_allowed_list:      2,18
0005 n1-1 - Cpus_allowed:       04000400        Cpus_allowed_list:      10,26
0006 n1-1 - Cpus_allowed:       00080008        Cpus_allowed_list:      3,19
0007 n1-1 - Cpus_allowed:       08000800        Cpus_allowed_list:      11,27
0008 n1-1 - Cpus_allowed:       00100010        Cpus_allowed_list:      4,20
0009 n1-1 - Cpus_allowed:       00200020        Cpus_allowed_list:      5,21
0010 n1-1 - Cpus_allowed:       00400040        Cpus_allowed_list:      6,22
0011 n1-1 - Cpus_allowed:       00800080        Cpus_allowed_list:      7,23




Will --ntasks-per-socket work for you?

Example:

marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ salloc --ntasks-per-socket=2 -N1 -n4 -c2 -m block:cyclic
salloc: Granted job allocation 25
salloc: Waiting for resource configuration
salloc: Nodes n1-1 are ready for job
marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ srun whereami |sort
0000 n1-1 - Cpus_allowed:       00010001        Cpus_allowed_list:      0,16
0001 n1-1 - Cpus_allowed:       01000100        Cpus_allowed_list:      8,24
0002 n1-1 - Cpus_allowed:       00020002        Cpus_allowed_list:      1,17
0003 n1-1 - Cpus_allowed:       02000200        Cpus_allowed_list:      9,25
Comment 9 Marshall Garey 2022-06-03 16:59:14 MDT
(In reply to VUB HPC from comment #3)
> This is the issue with GPUs I was referring to:
> 
> $ srun --nodes=1 --ntasks-per-node=2 --gpus-per-task=1
> --distribution=block:cyclic --cpu-bind=verbose sleep 1
> srun: Force Terminated job 6726859
> srun: error: Unable to allocate resources: Requested node configuration is
> not available

With my configuration this job runs.

* (2 GPUs per node, although I'm using fake GPUs pointing to tty devices, but that shouldn't affect job allocation at all)

# gres.conf
NodeName=n1-[1-10] Name=gpu Type=tty File=/dev/tty[0-1]
# slurm.conf
NodeName=DEFAULT RealMemory=8000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2
NodeName=n1-[1-10] NodeAddr=localhost Port=12101-12110 Gres=gpu:tty:2

marshall@smd-server:/mnt/marshall/voyager/slurm/smd-server/install/c1$ srun -N1 --ntasks-per-node=2 --gpus-per-task=1 -m block:cyclic whereami
0001 n1-1 - Cpus_allowed:       00010000        Cpus_allowed_list:      16
0000 n1-1 - Cpus_allowed:       00000001        Cpus_allowed_list:      0

(Notice I just got one core)

> We do have nodes with 2 GPUs per node that should be candidates for this job:
> 
> $ sinfo -N --partition pascal_gpu
> CLUSTER: hydra
> HOSTNAM PARTITION       STATE  CPUS(A/I/O/T) CPU_LOAD   MEMORY MB FREE_MEM
> MB  GRES                GRES_USED           
> node250 pascal_gpu      mix        2/22/0/24     2.38   257726 MB   213594
> MB  gpu:p100:2(S:0-1)   gpu:p100:2(IDX:0-1)
>
> Adding the option "--cpus-per-task=12" to srun, which corresponds to the
> amount of cores per socket in those nodes, does make the job go through
>
> $ srun --nodes=1 --ntasks-per-node=2 --cpus-per-task=12 --gpus-per-task=1
> --distribution=block:cyclic --cpu-bind=verbose sleep 1
> srun: job 6726865 queued and waiting for resources

That's interesting, but I don't know why it isn't being allocated.

I think this is a separate issue from comment 0.
* Could you create a new bug for this?
* In the new bug could you turn on debugflags=selecttype and SlurmctldDebug=debug, run the reproducer again, then upload the relevant portion of the slurmctld log file (from job submission to job rejection)?
* Also in the new bug can you upload the slurm.conf (including the NodeName definition) and gres.conf for those 2-GPU nodes?
Comment 10 VUB HPC 2022-06-08 03:46:00 MDT
Hi Marshall,

Thanks a lot for all the information. It's good to know that distribution comes after allocation, that explains a lot of what is happening with the issue at hand. 

As I said in my first message, the non-allocation of GPU jobs occurs for jobs that enforce binding of GRES resources. So, this is what I think is happening:

1. Job is submitted requesting (1core + 1GPU) x 2 and with enforce binding of GRES: "sbatch --nodes=1 --ntasks-per-node=2 --gpus-per-task=1 --gres-flags=enforce-binding"
2. The resource allocator picks 1 node with 2 GPUs and the first 2 cores available in the node, which probably are in the same CPU socket
3. Task distribution does not have any freedom to apply the "block:cyclic" distribution
4. Enforce binding of GRES resources kicks in and fails because one of the cores is not local to one of the GPUs (maybe this happens earlier?)

Can you confirm this behavior?

Finally, I confirm that setting "--ntasks-per-socket" as you suggested does make these jobs to be allocated. However, is it possible to make Slurm do the right thing for jobs not setting any "--ntasks-per-socket" at all? If the only constrain is "--gres-flags=enforce-binding", Slurm should pick the resources that fulfill it on its own.
Comment 11 Marshall Garey 2022-06-14 16:15:14 MDT
(In reply to VUB HPC from comment #10)
> Hi Marshall,
>
> Thanks a lot for all the information. It's good to know that distribution
> comes after allocation, that explains a lot of what is happening with the
> issue at hand.
>
> As I said in my first message, the non-allocation of GPU jobs occurs for
> jobs that enforce binding of GRES resources. So, this is what I think is
> happening:
>
> 1. Job is submitted requesting (1core + 1GPU) x 2 and with enforce binding
> of GRES: "sbatch --nodes=1 --ntasks-per-node=2 --gpus-per-task=1
> --gres-flags=enforce-binding"
> 2. The resource allocator picks 1 node with 2 GPUs and the first 2 cores
> available in the node, which probably are in the same CPU socket
> 3. Task distribution does not have any freedom to apply the "block:cyclic"
> distribution
> 4. Enforce binding of GRES resources kicks in and fails because one of the
> cores is not local to one of the GPUs (maybe this happens earlier?)
>
> Can you confirm this behavior?

Yes. Thanks for the clarification on that (--gres-flags=enforce-binding).

Here I've reproduced the issue:

# gres.conf                                                                      
NodeName=n1-[1-10] Name=gpu Type=tty File=/dev/tty0 Cores=0-7
NodeName=n1-[1-10] Name=gpu Type=tty File=/dev/tty1 Cores=8-15



Without --gres-flags=enforce-binding, this job runs.

marshall@smd-server:~/slurm/smd-server/install/c1$ sbatch -Dtmp --ntasks-per-node=2 -N1 --gpus-per-task=1 --wrap='srun whereami 60'
Submitted batch job 33

With --gres-flags=enforce-binding, this job is rejected.

marshall@smd-server:~/slurm/smd-server/install/c1$ sbatch -Dtmp --ntasks-per-node=2 -N1 --gpus-per-task=1 --gres-flags=enforce-binding --wrap='srun whereami 60'
sbatch: error: Batch job submission failed: Requested node configuration is not available

But using --gpus-per-socket or --ntasks-per-socket makes the job run:

marshall@smd-server:~/slurm/smd-server/install/c1$ sbatch -Dtmp -N1 --gpus-per-task=1 --ntasks-per-socket=1 --gres-flags=enforce-binding --wrap='srun whereami 60'
Submitted batch job 36

marshall@smd-server:~/slurm/smd-server/install/c1$ sbatch -Dtmp -N1 --gpus-per-socket=1 --sockets-per-node=2 --ntasks-per-node=2 --gres-flags=enforce-binding --wrap='srun whereami 60'
Submitted batch job 37


> Finally, I confirm that setting "--ntasks-per-socket" as you suggested does
> make these jobs to be allocated.

Good! I also want to mention --gpus-per-socket as a potential solution, too (required to be paired with --sockets-per-node per the salloc/sbatch/srun man pages)
https://slurm.schedmd.com/salloc.html#OPT_gpus-per-socket


> However, is it possible to make Slurm do
> the right thing for jobs not setting any "--ntasks-per-socket" at all? If
> the only constrain is "--gres-flags=enforce-binding", Slurm should pick the
> resources that fulfill it on its own.

I will look into this, but I don't have an answer right now beyond "it's tricky" and "I don't know yet".
For now I wanted to confirm that I see the concern, and also recommend --gpus-per-socket and --sockets-per-node as an additional solution.


(In reply to VUB HPC from comment #3)
> This is the issue with GPUs I was referring to:
> 
> $ srun --nodes=1 --ntasks-per-node=2 --gpus-per-task=1
> --distribution=block:cyclic --cpu-bind=verbose sleep 1
> srun: Force Terminated job 6726859
> srun: error: Unable to allocate resources: Requested node configuration is
> not available

When I ran this same job, the job ran. Are you inserting --gres-flags=enforce-binding automatically (with a job_submit plugin or cli-filter plugin or environment)? I was also testing on Slurm 22.05.0, so it's possible that this was fixed in 22.05 but not 21.08. (Of course, without --gres-flags=enforce-binding the selected cpus are on the same socket even though the GPUs are on different sockets, and I showed above that adding --gres-flags=enforce-binding makes this job submission fail.)
Comment 15 VUB HPC 2022-06-18 03:31:21 MDT
> When I ran this same job, the job ran. Are you inserting
> --gres-flags=enforce-binding automatically (with a job_submit
> plugin or cli-filter plugin or environment)? I was also testing
> on Slurm 22.05.0, so it's possible that this was fixed in 22.05
> but not 21.08. (Of course, without --gres-flags=enforce-binding
> the selected cpus are on the same socket even though the GPUs
> are on different sockets, and I showed above that adding
> --gres-flags=enforce-binding makes this job submission fail.)

Yes indeed, that's my fault, the srun commands in that comment are missing the --gres-flags=enforce-binding. Sorry for the confusion.

Thanks a lot for following up on this. Now it is clear what is happening and we will use the proposed solutions.

Alex
Comment 22 Marshall Garey 2022-08-26 17:24:47 MDT
Hi Alex,

I have submitted a set of patches to our review queue that will fix your case and other cases so that jobs are not wrongly rejected. I'll keep you updated on our progress.

- Marshall
Comment 34 Marshall Garey 2022-11-11 15:56:53 MST
Hi Alex,

We're still working on fixes. There are a few bugs here. I think we're close.

- Marshall
Comment 51 Marshall Garey 2022-11-28 16:31:46 MST
Hi Alex,

We have fixed --gres-flags=enforce-binding in the following commits:

1711dff8d3 Only enforce a minimum required core count if enforce-binding
57ee9849d7 Fix handling minimum gres core requirement
1f74a228d3 With enforce_binding, enforce a node's avail_cpus to meet required cores
1848f74798 Move variable into an outer scope
4699704f94 Ensure the job gets enough cores to satisfy GRES enforce-binding
28997ef02c Select at least as many cores as required sockets for GRES
8e1231028d Correctly determine if sufficient GRES and sockets have been picked
dfb07974d6 Do not limit avail_cpus before we have picked cores


These are in the slurm-22.05 branch ahead of 22.05.7. Please let me know if you have any questions.

I'm closing this ticket as resolved/fixed.

- Marshall
Comment 52 Marshall Garey 2022-11-28 16:32:36 MST
*** Ticket 15451 has been marked as a duplicate of this ticket. ***
Comment 53 Marcin Stolarek 2023-02-28 08:07:37 MST
*** Ticket 9994 has been marked as a duplicate of this ticket. ***
Comment 54 Marcin Stolarek 2023-05-30 00:09:33 MDT
*** Ticket 16655 has been marked as a duplicate of this ticket. ***