Ticket 11303

Summary:	Weird behavior in 20.11 with srun --exclusive -n1
Product:	Slurm	Reporter:	Luke Yeager <lyeager>
Component:	slurmstepd	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	fabecassis
Version:	20.11.5
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=10914 https://bugs.schedmd.com/show_bug.cgi?id=11310 https://bugs.schedmd.com/show_bug.cgi?id=12912
Site:	NVIDIA (PSLA)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Luke Yeager 2021-04-06 09:45:49 MDT

When using srun to create the allocation (so, not within an sbatch or salloc):

> $ srun --exclusive     numactl --show | grep ^physcpubind
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11
> $ srun --exclusive -n1 numactl --show | grep ^physcpubind
> physcpubind: 0 6
That's weird. '-n1' is the default - I'm surprised to see that specifying it changes the behavior of the command.

I checked, and this does not happen with Slurm 20.02. Which makes me think this is related to bug#10383.

Relevant config:
> TaskPlugin = affinity
> PartitionName=batch OverSubscribe=EXCLUSIVE ...

Comment 7 Marshall Garey 2021-04-06 14:38:33 MDT

You're right, the changes to steps in 20.11 is what is causing this behavior.

First I need to explain a couple of things, some of which you probably already know but it’s important for context:

(1) srun can be used for both a job and step allocation, or it can be used for only a step allocation.

* Job and step allocation:

srun program.bash

* Only step allocation:

sbatch --wrap=’srun program.bash’


(2) It’s important to distinguish between what the step *requests* and what the step is *allocated*.

(2a) The number of CPUs that the step requests depends on several things:

* If a job was launched by srun or not
* If the step requested --ntasks
* If the step requested --cpus-per-task
* If the step requests --overcommit

Here is the source code:

In launch_common_create_job_step():

    if (opt_local->overcommit) {
        if (use_all_cpus)    /* job allocation created by srun */
            job->ctx_params.cpu_count = job->cpu_count;
        else
            job->ctx_params.cpu_count = job->ctx_params.min_nodes;
    } else if (opt_local->cpus_set) {
        job->ctx_params.cpu_count = opt_local->ntasks *
                        opt_local->cpus_per_task;
    } else if (opt_local->ntasks_set) {
        job->ctx_params.cpu_count = opt_local->ntasks;
    } else if (use_all_cpus) {    /* job allocation created by srun */
        job->ctx_params.cpu_count = job->cpu_count;
    } else {
        job->ctx_params.cpu_count = opt_local->ntasks;
    }

This logic hasn’t changed in 20.11. You can see what the step request sent to slurmctld by using srun -vv. Look for these debug messages:

$srun -vv --exclusive whereami
...
srun: debug:  requesting job 35, user 1017, nodes 1 including ((null))
srun: debug:  cpus 16, tasks 1, name whereami, relative 65534
...

$ srun -vv --exclusive -n1 whereami
...
srun: debug:  requesting job 36, user 1017, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name whereami, relative 65534
...

So the first actually requested 16 CPUs and the second job actually requested 1 CPU.


(2b) The number of CPUs that the step is *allocated* also depends on several things:

* The number of CPUs or tasks the step requests
* The number of nodes the step requests
* Number of threads per core
* If the step requested --exclusive or not. *** This is where you can see the changes in 20.11***
* Probably some other factors that I’m missing

(3) Take this job:
srun --exclusive program.bash

In 20.02, --exclusive is *not* propagated to the step request. But in 20.11, --exclusive *is* propagated to the step.

(4) In 20.11, --exclusive implies --exact, which means “Only use the resources that the step requests.”


Now, here are some examples: (I’m using a program called “whereami” that displays cpu binding - basically the same thing as numactl --show |grep ^physcpubind but a lot less typing.)

========================================================================
20.02:

$ srun --exclusive whereami
0000 n1-1 - Cpus_allowed:       ffff    Cpus_allowed_list:      0-15
$ srun --exclusive -n1 whereami
0000 n1-1 - Cpus_allowed:       ffff    Cpus_allowed_list:      0-15

The step in the first job requests 16 CPUs. The step in the second job requests the number of CPUs equal to the number of tasks, which is 1 CPU. However, because --exclusive doesn’t propagate to the step, slurmctld allocates all the CPUs in the job to the steps.

If the step does specify --exclusive, it is given both exclusive access to those resources and is only given the CPUs explicitly requested:

$ salloc --exclusive
salloc: Granted job allocation 38
<Inside the job allocation>
$ srun --exclusive -n1 whereami
0000 n1-1 - Cpus_allowed:       0101    Cpus_allowed_list:      0,8

(I’m given 2 CPUs instead of 1 because I have 2 threads per core, and the step is given the entire core.)

Another important thing to show in 20.02 is that with --exclusive, --ntasks must be specified.

<still inside the job allocation>
$ srun --exclusive whereami
srun: error: --ntasks must be set with --exclusive

So in 20.02, --exclusive in the job step *must* specify --ntasks. This is not the case in 20.11.

========================================================================
20.11:


$ srun -vv --exclusive -n1 whereami
…
srun: debug:  requesting job 261, user 1017, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name whereami, relative 65534
…
0000 n1-1 - Cpus_allowed:       0101    Cpus_allowed_list:      0,8

In 20.11, --exclusive propagates to the step here. So the step requests --exclusive which implies --exact. The step requests 1 CPU because it requested 1 task. The step is allocated just what it asked for (except I have hyperthreading so I’m given the whole core == 2 CPUs).


$ srun -vv --exclusive whereami
srun: debug:  requesting job 262, user 1017, nodes 1 including ((null))
srun: debug:  cpus 16, tasks 1, name whereami, relative 65534
0000 n1-1 - Cpus_allowed:       ffff    Cpus_allowed_list:      0-15

Again, --exclusive propagates to the step. ***However, because I didn’t request --ntasks or --cpus-per-task *and* because the job was started by srun, the step requests all the CPUs in the job allocation.*** slurmctld still gives the step exclusive access to these CPUs. 

This is different when srun is run from inside a job allocation that was not started by srun (I’ll show this.)



$ salloc --exclusive
salloc: Granted job allocation 263

<inside the job allocation>
$ srun -vv whereami
srun: debug:  requesting job 263, user 1017, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name whereami, relative 65534
0000 n1-1 - Cpus_allowed:       ffff    Cpus_allowed_list:      0-15

This step doesn’t request --ntasks or --cpus. However, because the job was not allocated by srun, the step doesn’t request all the CPUs in the job, it only requests 1 task and 1 CPU (the default). However, because the step doesn’t request --exact (or --exclusive which implies --exact), slurmctld uses the default behavior: the step is allocated all the CPUs in the job and is allowed to overlap CPUs with other steps.

<still inside the job allocation>
$ srun --exclusive -vv whereami
srun: debug:  requesting job 263, user 1017, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name whereami, relative 65534
0000 n1-1 - Cpus_allowed:       0101    Cpus_allowed_list:      0,8

This step requests the same number of CPUs as the last step. But because it requests --exclusive which implies --exact, it is given only the resources it requested.

Note that in 20.11, job steps don’t need to specify --ntasks while using --exclusive, but in 20.02 we had this error:
srun: error: --ntasks must be set with --exclusive


<still inside the job allocation>
$ srun --exact -vv -n1 whereami
srun: debug:  requesting job 263, user 1017, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name whereami, relative 65534
0000 n1-1 - Cpus_allowed:       0101    Cpus_allowed_list:      0,8

Here I requested -n1. It’s the same behavior as the last step.

So why did the step request 1 CPU, not all the CPUs? Because this step was inside a job allocation that wasn’t started by srun.




Do you have questions about any of this? We welcome any documentation suggestions that could make that srun man page clearer.

Comment 8 Luke Yeager 2021-04-06 15:18:02 MDT

Thanks for the lengthy response!

(In reply to Marshall Garey from comment #7)
> In 20.02, --exclusive is *not* propagated to the step request. But in 20.11,
> --exclusive *is* propagated to the step.
Oh. This seems like the crux of the confusion. I didn't realize this.

> (4) In 20.11, --exclusive implies --exact, which means “Only use the
> resources that the step requests.”
I wasn't aware of the '--exact' flag. Looks like that's new to 20.11. It doesn't appear in NEWS at all.

> Do you have questions about any of this? We welcome any documentation
> suggestions that could make that srun man page clearer.
I've previously complained (here: https://bugs.schedmd.com/show_bug.cgi?id=10383#c33) that the NEWS items related to these changes weren't clear or loud enough.

And I'll quote my suggestion from another comment in that bug. Finding a way to communicate loudly and clearly about breaking changes would be a welcome improvement to the release process. Perhaps something like a guide called "How to Upgrade Your User Scripts for Slurm 20.11."

> Yes, I feel this warrants a blog post or something. Unless you can find a
> way to make it loud and clear enough in the manpages, NEWS, slurm-users
> updates, and/or any of the other existing channels.
> https://bugs.schedmd.com/show_bug.cgi?id=10383#c45
I'll close this as INFOGIVEN - the actual "bug" reported here is not major, we were just confused.

Comment 9 Marshall Garey 2021-04-06 15:45:57 MDT

(In reply to Luke Yeager from comment #8)
> Thanks for the lengthy response!

You're welcome, I'm glad it helped!

> (In reply to Marshall Garey from comment #7)
> > In 20.02, --exclusive is *not* propagated to the step request. But in 20.11,
> > --exclusive *is* propagated to the step.
> Oh. This seems like the crux of the confusion. I didn't realize this.
>
> > (4) In 20.11, --exclusive implies --exact, which means “Only use the
> > resources that the step requests.”
> I wasn't aware of the '--exact' flag. Looks like that's new to 20.11. It
> doesn't appear in NEWS at all.

It does appear in RELEASE_NOTES. Sadly I guess we missed adding it to NEWS. It was also added in 20.11.3 (as opposed to 20.11.0) in response to the breaking changes to MPI. Actually in 20.11.0 --exact didn't exist and the parameter to use was --whole and --overlap.

Here's RELEASE_NOTES:

 -- By default, a step started with srun will be granted exclusive (or non-
    overlapping) access to the resources assigned to that step. No other
    parallel step will be allowed to run on the same resources at the same
    time. This replaces one facet of the '--exclusive' option's behavior, but
    does not imply the '--exact' option described below. To get the previous
    default behavior - which allowed parallel steps to share all resources -
    use the new srun '--overlap' option.
 -- In conjunction to this non-overlapping step allocation behavior being the
    new default, there is an additional new option for step management
    '--exact', which will allow a step access to only those resources requested
    by the step. This is the second half of the '--exclusive' behavior.
    Otherwise, by default all non-gres resources on each node in the allocation
    will be used by the step, making it so no other parallel step will have
    access to those resources unless both steps have specified '--overlap'.


 
> > Do you have questions about any of this? We welcome any documentation
> > suggestions that could make that srun man page clearer.
> I've previously complained (here:
> https://bugs.schedmd.com/show_bug.cgi?id=10383#c33) that the NEWS items
> related to these changes weren't clear or loud enough.
> 
> And I'll quote my suggestion from another comment in that bug. Finding a way
> to communicate loudly and clearly about breaking changes would be a welcome
> improvement to the release process. Perhaps something like a guide called
> "How to Upgrade Your User Scripts for Slurm 20.11."
> 
> > Yes, I feel this warrants a blog post or something. Unless you can find a
> > way to make it loud and clear enough in the manpages, NEWS, slurm-users
> > updates, and/or any of the other existing channels.
> > https://bugs.schedmd.com/show_bug.cgi?id=10383#c45
> I'll close this as INFOGIVEN - the actual "bug" reported here is not major,
> we were just confused.

Yes, as you know and as is discussed at length in bug 10383 we definitely missed the mark on this one.

We try to do this with RELEASE_NOTES and NEWS, but especially RELEASE_NOTES - that is supposed to be the major place to look for breaking changes in newer versions. 


Thank you for your feedback.

I'll still look into modifying the srun man page to be more clear about what the behavior is.

Comment 10 Luke Yeager 2021-04-06 15:58:54 MDT

Oh wow, I was totally unaware of RELEASE_NOTES - that one's on me! Thanks, yes, that's the most clearly written summary of the changes I've seen so far.

Comment 11 Marshall Garey 2021-04-06 16:46:38 MDT

I'm glad I pointed out RELEASE_NOTES then. It's distributed with Slurm, found in the same directory as NEWS.

I've opened bug 11310 to look at improving the documentation for srun to clarify how CPUs will be allocated to steps.

Also, I found out we already have a bug open to add --exact to NEWS (though I realize it's late at this point). It's hung up in our review queue right now, so I don't know if we'll push it through or not, but at least somebody here noticed the lack of --exact in NEWS. Bug 10914 (though it is private so you can't see it).

Comment 12 Marshall Garey 2021-04-06 16:58:07 MDT

Looks like bugzilla automatically re-opened this bug. Re-closing as resolved/infogiven.