Ticket 11824 - Exclusive allocation of CPUs is not the default for job steps
Summary: Exclusive allocation of CPUs is not the default for job steps
Status: RESOLVED DUPLICATE of ticket 11310
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 20.11.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-06-14 14:34 MDT by Kilian Cavalotti
Modified: 2021-06-17 10:29 MDT (History)
0 users

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2021-06-14 14:34:26 MDT
Hi!

According to the `srun` documentation:
https://slurm.schedmd.com/srun.html#OPT_exclusive[=user|mcs]
"""
The exclusive allocation of CPUs applies to job steps by default. In order to share the resources use the --overlap option.
"""

Unfortunately, it doesn't seem to be the case, and without `--exclusive`, it looks like each job step is allocating all of the job's resources.


To reproduce:

$ ## allocate a multi-CPU job:
$
$ salloc -n 10 -c 2 -p test
salloc: Pending job allocation 26302313
salloc: job 26302313 queued and waiting for resources
salloc: job 26302313 has been allocated resources
salloc: Granted job allocation 26302313
salloc: Waiting for resource configuration
salloc: Nodes sh03-01n71 are ready for job


$ ## start a step requesting a subset of the job's resources, without `--exclusive`, in the background:
$
$ srun -l -n 1 -c 2 sleep 1000 &
[1] 32509

$ ## check the allocated resources: it shows 20 CPUs, everything that was allocated to the job:
$
$ sacct -j $SLURM_JOBID --format user,jobid,start,end,ntasks,reqcpus,ncpus,reqmem     
     User JobID                      Start                 End   NTasks  ReqCPUS      NCPUS     ReqMem
--------- ------------ ------------------- ------------------- -------- -------- ---------- ----------
   kilian 26302313     2021-06-14T13:21:25             Unknown                20         20     4000Mc
          26302313.in+ 2021-06-14T13:21:25             Unknown        1       20         20     4000Mc
          26302313.ex+ 2021-06-14T13:21:25             Unknown        1       20         20     4000Mc
          26302313.0   2021-06-14T13:23:48 2021-06-14T13:23:49        1       20         20     4000Mc
          26302313.1   2021-06-14T13:23:58             Unknown        1       20         20     4000Mc

$ ## kill the step
$
$ kill %1
srun: forcing job termination
0: slurmstepd: error: *** STEP 26302313.1 ON sh03-01n71 CANCELLED AT 2021-06-14T13:25:11 ***
srun: error: sh03-01n71: task 0: Killed
srun: launch/slurm: _step_signal: Terminating StepId=26302313.1
[1]+  Exit 137                srun -l -n 1 -c 2 sleep 1000

$ ## start a new step with the same resource requirements as before, but with `--exclusive`:
$
$ srun -l -n 1 -c 2 --exclusive sleep 1000 &
[1] 311

$ ## check the allocated resources:
$
$ sacct -j $SLURM_JOBID --format user,jobid,start,end,ntasks,reqcpus,ncpus,reqmem
     User JobID                      Start                 End   NTasks  ReqCPUS      NCPUS     ReqMem
--------- ------------ ------------------- ------------------- -------- -------- ---------- ----------
   kilian 26302313     2021-06-14T13:21:25             Unknown                20         20     4000Mc
          26302313.in+ 2021-06-14T13:21:25             Unknown        1       20         20     4000Mc
          26302313.ex+ 2021-06-14T13:21:25             Unknown        1       20         20     4000Mc
          26302313.0   2021-06-14T13:23:48 2021-06-14T13:23:49        1       20         20     4000Mc
          26302313.1   2021-06-14T13:23:58 2021-06-14T13:25:11        1       20         20     4000Mc
          26302313.2   2021-06-14T13:25:21             Unknown        1        2          2     4000Mc

That one shows that it only allocated the requested resources for the step (2 CPUs).


So it appears that `--exclusive` is *NOT* applied to job steps by default, contrary to the documentation.

With all the changes around in the steps behavior in early 20.11 version, I'm not sure if the documentation should be fixed, or if the step behavior is not the one that's intended. But they definitely don't match. :)

Thanks!
--
Kilian
Comment 1 Marshall Garey 2021-06-17 08:55:02 MDT
Hi Kilian,

I believe I can answer your question. I believe the confusion here is that the --exclusive option does more than just grant exclusive allocation to resources. It also implies the --exact flag, which means srun is allocated exactly the amount of CPUs it requested.

Looking at your examples:

(1) Without --exclusive:

```
$ ## start a step requesting a subset of the job's resources, without `--exclusive`, in the background:
$
$ srun -l -n 1 -c 2 sleep 1000 &
[1] 32509

$ ## check the allocated resources: it shows 20 CPUs, everything that was allocated to the job:
$
$ sacct -j $SLURM_JOBID --format user,jobid,start,end,ntasks,reqcpus,ncpus,reqmem     
     User JobID                      Start                 End   NTasks  ReqCPUS      NCPUS     ReqMem
--------- ------------ ------------------- ------------------- -------- -------- ---------- ----------
   kilian 26302313     2021-06-14T13:21:25             Unknown                20         20     4000Mc
          26302313.in+ 2021-06-14T13:21:25             Unknown        1       20         20     4000Mc
          26302313.ex+ 2021-06-14T13:21:25             Unknown        1       20         20     4000Mc
          26302313.0   2021-06-14T13:23:48 2021-06-14T13:23:49        1       20         20     4000Mc
          26302313.1   2021-06-14T13:23:58             Unknown        1       20         20     4000Mc
```


Here, srun is given all of the CPUs in the allocation because it did not use --exact (or --exclusive, which implies --exact). However, srun is also given exclusive access to these CPUs. If you tried to run srun --overlap in the allocation, those srun would not start until this step is completed. (Well, they would also not run because there's no memory available, but you can either not enforce memory or just use --mem to ensure that there's enough memory for all the srun's that you want.)


(2) With --exclusive:

```
$ ## start a new step with the same resource requirements as before, but with `--exclusive`:
$
$ srun -l -n 1 -c 2 --exclusive sleep 1000 &
[1] 311

$ ## check the allocated resources:
$
$ sacct -j $SLURM_JOBID --format user,jobid,start,end,ntasks,reqcpus,ncpus,reqmem
     User JobID                      Start                 End   NTasks  ReqCPUS      NCPUS     ReqMem
--------- ------------ ------------------- ------------------- -------- -------- ---------- ----------
   kilian 26302313     2021-06-14T13:21:25             Unknown                20         20     4000Mc
          26302313.in+ 2021-06-14T13:21:25             Unknown        1       20         20     4000Mc
          26302313.ex+ 2021-06-14T13:21:25             Unknown        1       20         20     4000Mc
          26302313.0   2021-06-14T13:23:48 2021-06-14T13:23:49        1       20         20     4000Mc
          26302313.1   2021-06-14T13:23:58 2021-06-14T13:25:11        1       20         20     4000Mc
          26302313.2   2021-06-14T13:25:21             Unknown        1        2          2     4000Mc

That one shows that it only allocated the requested resources for the step (2 CPUs).
```


Here because you use --exclusive it implied --exact, therefore srun was only given 2 CPUs.

A couple of thoughts:
(1) This is confusing - the fact that we say exclusive allocation is the default, but the default doesn't imply --exact, but specifying --exclusive does imply --exact which gives you different behavior. I'm going to research and see what we actually want. We probably need to update the documentation at least.

(2) As of bug 11275, specifying --cpus-per-task implies --exact. However, because this was a change in behavior we only pushed this change to 21.08. This means that in your first example you would see the behavior you expect - srun would only get 2 CPUs. However, if you did not use --cpus-per-task nor --exclusive, then srun would get all the CPUs in the allocation.


Does this answer your question? Would updating the documentation be sufficient?
Comment 2 Kilian Cavalotti 2021-06-17 09:11:32 MDT
Hi Marshall, 

Thank you very much for the explanation, that definitely clarifies things.

Now, yes, totally agree with (1), this is extremely confusing. 

> we say exclusive allocation is the default, but the default doesn't imply --exact, but specifying --exclusive does imply --exact which gives you different behavior.

Yes! And not only that, but the very fact that the exacts same option (--exclusive) has completely different meanings for sbatch and srun has already been confusing for years. The added `--exact` switch makes it combinatorially more perplexing. :)

> Would updating the documentation be sufficient?

Yes, I don't think that the actual behavior needs to be changed, but I strongly believe that a documentation update (well, more like a brand new section, maybe?) is in order. Given the number of recent bug reports in this area since 20.11, it would likely benefit many Slurm sysadmins and end-users. Ideally, a general explanation of the options and a list of simple examples would go a very long way. 

Because right now, it's hard to guess the behavior you'll get from the option names only. :)

Thanks!
--
Kilian
Comment 3 Marshall Garey 2021-06-17 10:26:05 MDT
Kilian, I already have bug 11310 open about improving this documentation, and that bug links to yet another bug where there were questions about the number of CPUs that would be allocated to steps. So I'm making this bug a duplicate of bug 11310. Feel free to add yourself to CC on 11310 and feel free to comment on that one as well.

*** This ticket has been marked as a duplicate of ticket 11310 ***
Comment 4 Kilian Cavalotti 2021-06-17 10:29:10 MDT
On Thu, Jun 17, 2021 at 9:26 AM <bugs@schedmd.com> wrote:
> Kilian, I already have bug 11310 open about improving this documentation, and
> that bug links to yet another bug where there were questions about the number
> of CPUs that would be allocated to steps. So I'm making this bug a duplicate of
> bug 11310. Feel free to add yourself to CC on 11310 and feel free to comment on
> that one as well.

Sounds perfect, thank you!

Cheers,
--
Kilian