Ticket 11310

Summary: Improve srun documentation about number of CPUs that will be allocated to steps
Product: Slurm Reporter: Marshall Garey <marshall>
Component: DocumentationAssignee: Documentation <docs>
Status: OPEN --- QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: albert.gil, ben, ezellma, fullop, kilian, lena, lyeager, mcoyne, ndobson, sts
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=11303
https://bugs.schedmd.com/show_bug.cgi?id=10892
https://bugs.schedmd.com/show_bug.cgi?id=11824
https://bugs.schedmd.com/show_bug.cgi?id=11852
https://bugs.schedmd.com/show_bug.cgi?id=12462
https://bugs.schedmd.com/show_bug.cgi?id=12912
https://bugs.schedmd.com/show_bug.cgi?id=13041
https://bugs.schedmd.com/show_bug.cgi?id=13472
Site: SchedMD Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Marshall Garey 2021-04-06 16:45:10 MDT
Coming from bug 11303:

There are situations where the number of CPUs that will allocated to a *step* can be different with just a subtle change. Consider this from bug 11303 comment 0:


> $ srun --exclusive     numactl --show | grep ^physcpubind
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11
> $ srun --exclusive -n1 numactl --show | grep ^physcpubind
> physcpubind: 0 6

I explained how and why this works in bug 11303 comment 7 with some additional examples. I think it would be nice to somehow document a little more about how CPUs will be allocated to steps, considering different situations:

* srun as a job and step allocation
* srun only as a step allocation within a job
* How different options affect the number of CPUs allocated to the step, especially --ntasks, --cpus-per-task, and --exclusive

 Much of this is already documented, but I wonder if we could somehow improve it, especially the example I just gave (how srun as a job is special when no options are given alongside --exclusive). Maybe modifying --exclusive in the srun man page is the way to go.
Comment 2 Marshall Garey 2021-06-17 10:17:46 MDT
Bug 11824 has another example how this can be confusing - exclusive allocation of CPUs to steps is the default behavior, but --exact is not the default. But if you use --exclusive, then --exact is implied. We should document this.

=================================================================================
 Bug 11824 comment 1:
=================================================================================

I believe I can answer your question. I believe the confusion here is that the --exclusive option does more than just grant exclusive allocation to resources. It also implies the --exact flag, which means srun is allocated exactly the amount of CPUs it requested.

Looking at your examples:

(1) Without --exclusive:

```
$ ## start a step requesting a subset of the job's resources, without `--exclusive`, in the background:
$
$ srun -l -n 1 -c 2 sleep 1000 &
[1] 32509

$ ## check the allocated resources: it shows 20 CPUs, everything that was allocated to the job:
$
$ sacct -j $SLURM_JOBID --format user,jobid,start,end,ntasks,reqcpus,ncpus,reqmem     
     User JobID                      Start                 End   NTasks  ReqCPUS      NCPUS     ReqMem
--------- ------------ ------------------- ------------------- -------- -------- ---------- ----------
   kilian 26302313     2021-06-14T13:21:25             Unknown                20         20     4000Mc
          26302313.in+ 2021-06-14T13:21:25             Unknown        1       20         20     4000Mc
          26302313.ex+ 2021-06-14T13:21:25             Unknown        1       20         20     4000Mc
          26302313.0   2021-06-14T13:23:48 2021-06-14T13:23:49        1       20         20     4000Mc
          26302313.1   2021-06-14T13:23:58             Unknown        1       20         20     4000Mc
```


Here, srun is given all of the CPUs in the allocation because it did not use --exact (or --exclusive, which implies --exact). However, srun is also given exclusive access to these CPUs. If you tried to run srun --overlap in the allocation, those srun would not start until this step is completed. (Well, they would also not run because there's no memory available, but you can either not enforce memory or just use --mem to ensure that there's enough memory for all the srun's that you want.)


(2) With --exclusive:

```
$ ## start a new step with the same resource requirements as before, but with `--exclusive`:
$
$ srun -l -n 1 -c 2 --exclusive sleep 1000 &
[1] 311

$ ## check the allocated resources:
$
$ sacct -j $SLURM_JOBID --format user,jobid,start,end,ntasks,reqcpus,ncpus,reqmem
     User JobID                      Start                 End   NTasks  ReqCPUS      NCPUS     ReqMem
--------- ------------ ------------------- ------------------- -------- -------- ---------- ----------
   kilian 26302313     2021-06-14T13:21:25             Unknown                20         20     4000Mc
          26302313.in+ 2021-06-14T13:21:25             Unknown        1       20         20     4000Mc
          26302313.ex+ 2021-06-14T13:21:25             Unknown        1       20         20     4000Mc
          26302313.0   2021-06-14T13:23:48 2021-06-14T13:23:49        1       20         20     4000Mc
          26302313.1   2021-06-14T13:23:58 2021-06-14T13:25:11        1       20         20     4000Mc
          26302313.2   2021-06-14T13:25:21             Unknown        1        2          2     4000Mc

That one shows that it only allocated the requested resources for the step (2 CPUs).
```


Here because you use --exclusive it implied --exact, therefore srun was only given 2 CPUs.

A couple of thoughts:
(1) This is confusing - the fact that we say exclusive allocation is the default, but the default doesn't imply --exact, but specifying --exclusive does imply --exact which gives you different behavior. I'm going to research and see what we actually want. We probably need to update the documentation at least.

(2) As of bug 11275, specifying --cpus-per-task implies --exact. However, because this was a change in behavior we only pushed this change to 21.08. This means that in your first example you would see the behavior you expect - srun would only get 2 CPUs. However, if you did not use --cpus-per-task nor --exclusive, then srun would get all the CPUs in the allocation.


Does this answer your question? Would updating the documentation be sufficient?


=================================================================================
Bug 11824 comment 2:
=================================================================================


> Would updating the documentation be sufficient?

Yes, I don't think that the actual behavior needs to be changed, but I strongly believe that a documentation update (well, more like a brand new section, maybe?) is in order. Given the number of recent bug reports in this area since 20.11, it would likely benefit many Slurm sysadmins and end-users. Ideally, a general explanation of the options and a list of simple examples would go a very long way. 

Because right now, it's hard to guess the behavior you'll get from the option names only. :)
Comment 3 Marshall Garey 2021-06-17 10:26:05 MDT
*** Ticket 11824 has been marked as a duplicate of this ticket. ***
Comment 10 Marshall Garey 2021-11-15 14:51:53 MST
*** Ticket 12850 has been marked as a duplicate of this ticket. ***
Comment 15 Marshall Garey 2021-12-02 10:59:47 MST
Hi all,

We've pushed a small improvement to the srun man page. There is more we still need to do, like adding some examples and maybe a whole new section.


commit 934f3b543b6bc9f3335d1cc6813b8d95cb2c49b4
Author: Marshall Garey <marshall@schedmd.com>
Date:   Wed Nov 24 11:28:30 2021 -0700

    Docs - Clarify default behavior of srun --exclusive
    
    Bug 11310
Comment 19 Marshall Garey 2022-01-20 15:42:21 MST
I was going to make this note private, but I'll just make it public since it's good information.

The following commit has some good examples, and also shows how --mem-per-cpu is affected. I think they'd be good to incorporate into the documentation:

https://github.com/schedMD/slurm/commit/9c7d36b44f

I've copied the examples here for convenience.

Some expectations on a 16 core 2 threaded node.

salloc --exclusive --mem-per-cpu=5

We expect 32 cpus and 160M memory allocated

srun shostname

Here we expect all cpus and memory from the job

srun -c2 --exact -n1 whereami

Here we expect 2 cpus and 10M memory

srun -c1 --exact -n1 whereami

Here we expect 1 cpus and 5M of memory (Though we actually have access
to the other threads on the core)

srun -c1 --exact -n1 --threads-per-core=1 whereami

Here we expect 2 cpus since we don't want something else starting on
the other thread on the core and 5M of memory.

srun -c2 --exact -n1 --threads-per-core=1 whereami

Here we expect 4 cpus for the same reason as above and 10M of memory.

sacct -j $SLURM_JOBID -o jobid,alloctres -p
JobID|AllocTRES|
152422|billing=32,cpu=32,gres/gpu:k80=4,gres/gpu:tesla=4,gres/gpu=8,mem=160M,node=1|
152422.interactive|cpu=32,gres/gpu:k80=4,gres/gpu:tesla=4,gres/gpu=8,mem=160M,node=1|
152422.0|cpu=32,mem=160M,node=1|
152422.1|cpu=2,mem=10M,node=1|
152422.2|cpu=1,mem=5M,node=1|
152422.3|cpu=2,mem=5M,node=1|
152422.4|cpu=4,mem=10M,node=1|