Ticket 11863 - Add a srun option to always overlap to externally debug into a running job
Summary: Add a srun option to always overlap to externally debug into a running job
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 20.11.7
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Marshall Garey
QA Contact:
URL:
: 12462 13683 (view as ticket list)
Depends on:
Blocks: 9961
  Show dependency treegraph
 
Reported: 2021-06-18 08:43 MDT by Jonas Stare
Modified: 2022-04-08 07:56 MDT (History)
5 users (show)

See Also:
Site: SNIC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: NSC
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: tetralith.nsc.liu.se
CLE Version:
Version Fixed: 22.05.0pre1
Target Release: 22.05
DevPrio: 1 - Paid
Emory-Cloud Sites: ---


Attachments
patches for current Slurm master (will be 22.05) (40.75 KB, patch)
2022-03-08 16:52 MST, Marshall Garey
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Jonas Stare 2021-06-18 08:43:30 MDT
After upgrading from Slurm 20.02 to 20.11.7, we have noticed several regressions. It’s possible some are intentional changes by SchedMD, but in that case we would like to know why these changes were made and how to work around them.

We have tried to understand what is happening and to some extent work around the changes (I believe some applications were made to work by setting $SLURM_OVERLAP), but we have not been able to make all applications work as they did with 20.02.

For example, we have third-party applications that launch tasks by running srun with hard-coded options that are no longer working in 20.11.7. As command-line options have precedence over environment variables we cannot work such issues.


There seems to be at least two distinct types of issues in 20.11.7:

Type 1: Job steps don’t start because Slurm believes that not enough resources are available. The affected jobs typically have some job steps for the main workload that allocates exclusive resources (e.g all CPUs on all allocated nodes) but also some job steps that are used for e.g monitoring and should not be allocated exclusive resources.

These problems are typically 100% reproducible.

Simplified example that works in 20.02 but breaks in 20.11.7:

#!/bin/bash
#SBATCH -n16 -t 00:02:00
module load buildenv-intel/2018a-eb
if [ "$1" == "monitor" ]; then
    echo "Starting simulated background monitor task that should not consume resources"
    srun -n1 -N1 --mem-per-cpu=0 --overlap sleep 365d &
else
    echo "Monitoring OFF"
fi
# Start main app
date; echo "Starting app"
time mpiexec.hydra -bootstrap slurm ./mdrbench_intel_2018a-eb
echo "mpiexec exit: $?"
date

When we run this with “sbatch script.sh monitor” the main app will not start. When run as “sbatch script.sh” the main app will start.

We have tried various combinations of --overlap and other srun options without finding one that allocates zero resources for the step. The only possible exception is --interactive, which seems to work, but has the serious limitation that you can only have one such job step per job.

Is there a way to start srun with zero resources?

Type 2: Intermittent problems in jobs that run multiple parallel MPI applications serially.

E.g

#!/bin/bash
#SBATCH -n64
mpiexec.hydra -bootstrap slurm ./myapp
mpiexec.hydra -bootstrap slurm ./myapp 
mpiexec.hydra -bootstrap slurm ./myapp
[...]


Sample script to reproduce this problem (it just runs a trivial MPI application in a loop):

#!/bin/bash 
#SBATCH -N2 -t 10:00 -A nsc
SLEEP_BETWEEN_STEPS=${1:-0.1}
# Loop running a simple MPI app 100 times, expected runtime <300s                                                      
module load buildenv-intel/2018a-eb
for run in $(seq 1 100); do
    t1=$(date +%s)
    OUTPUT="output-${SLURM_JOBID}.${run}"
    expected_tasks=$(hostlist -e --repeat-slurm-tasks=$SLURM_TASKS_PER_NODE "$SLURM_JOB_NODELIST" | wc -l)
    mpiexec.hydra -bootstrap slurm ./hello-2021-05-10-buildenv-intel_2018b-eb > $OUTPUT 2>&1
    exit=$?
    output_lines=$(wc -l $OUTPUT | awk '{print $1}')
    t2=$(date +%s)
    elapsed=$(( t2 - t1 ))
    echo "# $(date): run $run, exit $exit, elapsed $elapsed s, now sleep $SLEEP_BETWEEN_STEPS s"
    if [ $output_lines -ne $expected_tasks ]; then
        echo "WARNING: unexpected output:"
        cat $OUTPUT
    fi
    sleep $SLEEP_BETWEEN_STEPS
done


When the problem occurs, the srun that mpiexec.hydra uses to launch pmi_proxy (one per node in the job) will print this and exit:

srun: error: Unable to create step for job 14005836: Memory required by task is not available

This is then handled somewhat bad by mpiexec.hydra. It does not wait() on srun so it will never detect that the job launch failed, so it hangs forever. But even if mpiexec.hydra was bug free, the job launch would still have failed as srun exited rather than start pmi_proxy.

On our main system, the script above failed 75-100% of the time. But it is a rather extreme example, most jobs will only launch one or a handful of MPI tasks, so the chance that any of them will fail is low. But we do have some production jobs that run thousands of MPI tasks in a few hours.

While writing this ticket I found bug https://bugs.schedmd.com/show_bug.cgi?id=11857 and I think it might be similar to our problem. If there is a patch would it be possible for us to have a look at it?
Comment 3 Albert Gil 2021-06-21 11:34:48 MDT
Hi Jonas,

> After upgrading from Slurm 20.02 to 20.11.7, we have noticed several
> regressions. It’s possible some are intentional changes by SchedMD, but in
> that case we would like to know why these changes were made and how to work
> around them.
> 
> We have tried to understand what is happening and to some extent work around
> the changes (I believe some applications were made to work by setting
> $SLURM_OVERLAP), but we have not been able to make all applications work as
> they did with 20.02.

Yes, in 20.11 we changed some behavior and related options of srun to improve how resources are managed within a job.
We restored some previous default behavior on 20.11.3 to make the transition easier, though.
You can see some of the rationale of the changes and restoring on bug 10383 comment 63.

In the RELEASE_NOTES you can also read about it:
- https://github.com/SchedMD/slurm/blob/slurm-20.11/RELEASE_NOTES#L57
- https://github.com/SchedMD/slurm/blob/slurm-20.11/RELEASE_NOTES#L64

A small summary could be:
- Now steps within a job doesn't share resources by default as they did, but you should use --overlap to do that.
- By default a step still uses the whole job allocation, but you can use --exact to limit the step to what it requested.
- The --exclusive option is still to avoid sharing nodes with other jobs, but it also imply --exact.

We are also working to improve the documentation to avoid confusion (bug 11310).
Actually, the source of the confusion is probably that "exclusive" was previously used for two different purposes (as bug 10383 comment 63 explains).

> For example, we have third-party applications that launch tasks by running
> srun with hard-coded options that are no longer working in 20.11.7. As
> command-line options have precedence over environment variables we cannot
> work such issues.

This is unfortunate.
What third party apps are those?
We tried or best to keep back compatibility, but that can always happen between major releases.

> Type 1: Job steps don’t start because Slurm believes that not enough
> resources are available. The affected jobs typically have some job steps for
> the main workload that allocates exclusive resources (e.g all CPUs on all
> allocated nodes) but also some job steps that are used for e.g monitoring
> and should not be allocated exclusive resources.
>
> We have tried various combinations of --overlap and other srun options
> without finding one that allocates zero resources for the step. The only
> possible exception is --interactive, which seems to work, but has the
> serious limitation that you can only have one such job step per job.
> 
> Is there a way to start srun with zero resources?

There is no way to use srun with zero resources.
Please don't use --interact, it is an internal option used by salloc, but not meant to be used manually.

> These problems are typically 100% reproducible.
> 
> Simplified example that works in 20.02 but breaks in 20.11.7:
> 
> #!/bin/bash
> #SBATCH -n16 -t 00:02:00
> module load buildenv-intel/2018a-eb
> if [ "$1" == "monitor" ]; then
>     echo "Starting simulated background monitor task that should not consume
> resources"
>     srun -n1 -N1 --mem-per-cpu=0 --overlap sleep 365d &
> else
>     echo "Monitoring OFF"
> fi
> # Start main app
> date; echo "Starting app"
> time mpiexec.hydra -bootstrap slurm ./mdrbench_intel_2018a-eb
> echo "mpiexec exit: $?"
> date
> 
> When we run this with “sbatch script.sh monitor” the main app will not
> start. When run as “sbatch script.sh” the main app will start.

Here the --overlap is well placed.
The problem is using mpiexec instead of srun, because mpiexec is not using --overlap.
A workaround could be using/exporting SLURM_OVERLAP before calling mpiexec, so when mpiexec calls srun that variable would allow that step to overlap with the monitor step started before.


> Type 2: Intermittent problems in jobs that run multiple parallel MPI
> applications serially.
> 
> When the problem occurs, the srun that mpiexec.hydra uses to launch
> pmi_proxy (one per node in the job) will print this and exit:
> 
> srun: error: Unable to create step for job 14005836: Memory required by task
> is not available
> 
> This is then handled somewhat bad by mpiexec.hydra. It does not wait() on
> srun so it will never detect that the job launch failed, so it hangs
> forever. But even if mpiexec.hydra was bug free, the job launch would still
> have failed as srun exited rather than start pmi_proxy.
> 
> On our main system, the script above failed 75-100% of the time. But it is a
> rather extreme example, most jobs will only launch one or a handful of MPI
> tasks, so the chance that any of them will fail is low. But we do have some
> production jobs that run thousands of MPI tasks in a few hours.
> 
> While writing this ticket I found bug
> https://bugs.schedmd.com/show_bug.cgi?id=11857 and I think it might be
> similar to our problem. If there is a patch would it be possible for us to
> have a look at it?

Yes, this could be a duplicate of bug 11857.
The fix is already pushed (see commits on bug 11857 comment 17) and will be released as part of next 20.11.8.

Regards,
Albert
Comment 4 Albert Gil 2021-06-28 02:43:12 MDT
Hi Jonas,

If this is ok for you I'm closing this ticket as infogiven assuming that comment 3 solved your questions.
But please, don't hesitate to reopen it if you need further support.

Regards,
Albert
Comment 5 Jonas Stare 2021-06-28 06:40:11 MDT
Hi, sorry I didn't get back to you earlier. 

The problem is partly solved (I think) in 20.11.8.

But the issue with being able to create a "zero resource" step is still unresolved. Apart from using it to get an interactive shell we (and our users) use it quite a lot to check a running job.

In this case --overlap on the "monitor step" won't help since the steps trying to start after will hang.

Will this work with pam_slurm_adopt? Or will jobs that "adopted" a step hang?

Even if that works, it would also be a partial fix, what we really _need_ in the end is to be able to create multiple "--interactive" steps like we were able to do before.
Comment 9 Albert Gil 2021-06-29 09:29:37 MDT
Hi Jonas,

> But the issue with being able to create a "zero resource" step is still
> unresolved. Apart from using it to get an interactive shell we (and our
> users) use it quite a lot to check a running job.
> 
> In this case --overlap on the "monitor step" won't help since the steps
> trying to start after will hang.

I'm not certain why you think that --overlap won't help.
Let me show an example of running parallel sruns from sbatch using --overlap:

$ cat test.sh 
#!/bin/bash
srun --overlap --mem 10M sleep 360 &
srun --overlap --mem 10M sleep 360 &
srun --overlap --mem 10M sleep 360 &

sleep 2
sacct -j $SLURM_JOB_ID

$ sbatch -N1 -c2 --mem 1G test.sh                                                                      
Submitted batch job 64

$ cat slurm-64.out           
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
64              test.sh      debug       acct          2    RUNNING      0:0 
64.batch          batch                  acct          2    RUNNING      0:0 
64.0              sleep                  acct          2    RUNNING      0:0 
64.1              sleep                  acct          2    RUNNING      0:0 
64.2              sleep                  acct          2    RUNNING      0:0 

As you can see, with --overlap we can run parallel steps.
If I'm not wrong, this is what you are looking for, right?

One possible issue is that memory cannot be overlapped.
So, if I submit the same job that before without enough memory to run all steps in parallel, only some of the steps will run in parallel:

$ sbatch -N1 -c2 --mem 20M test.sh 
Submitted batch job 65

$ cat slurm-65.out 
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
65              test.sh      debug       acct          2    RUNNING      0:0 
65.batch          batch                  acct          2    RUNNING      0:0 
65.0              sleep                  acct          2    RUNNING      0:0 
65.1              sleep                  acct          2    RUNNING      0:0 

Note that step .2 was never started (due memory limitation):

$ sacct -j 65
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
65              test.sh      debug       acct          2  COMPLETED      0:0 
65.batch          batch                  acct          2  COMPLETED      0:0 
65.0              sleep                  acct          2  CANCELLED     0:15 
65.1              sleep                  acct          2  CANCELLED     0:15 

If my batch waits at the end, the .2 would be started after the previos ones.
In your script is seems that you are using --mem-per-cpu=0, so it shouldn't be a problem in your case, though.

> Will this work with pam_slurm_adopt? Or will jobs that "adopted" a step hang?

Yes, using pam_slurm_adopt is an alternative way to have an interactive shell into a node with one job of yours running.
Note that the shell adopted by the job will be placed in a special step named ".extern" and similar to the ".batch" it's not actually consuming any of the resources it has access to.

> Even if that works, it would also be a partial fix, what we really _need_ in
> the end is to be able to create multiple "--interactive" steps like we were
> able to do before.

I'll ignore some details, but in general I would say that the --overlap flag is providing the same feature that srun (as step allocator) was proving before by default.
Please see my example above and let me know if that is what you are looking for or if I'm missing something?

Regards,
Albert
Comment 10 Jonas Stare 2021-07-01 09:22:39 MDT
Hi,

Yes, --overlap would work in that case. But it is not the only case where we would need it. For example, we have a script to help our users debug/monitor their jobs that in essence will run "srun --jobid=<their jobid>" and give them a shell inside the job. Using --overlap in that case kind of works but if your job have multiple consecutive sruns, you would be forced to always use --overlap in case you would want to debug it. And pam_slurm_adopt would also, kind of, work. But it is still not a perfect solution.

I guess what I'm trying to say is that srun used to have a way to get a zero-resource step. Maybe it was an unintentional feature, but it was very useful in some cases.

I don't know how much work it would be. but it would be extremely valuable to have some kind of "--extern" flag, that would give a jobstep similar resource allocation as one started with pam_slurm_adopt or --interactive.
Comment 11 Albert Gil 2021-07-02 03:46:14 MDT
Hi Jonas,

> Yes, --overlap would work in that case. But it is not the only case where we
> would need it. For example, we have a script to help our users debug/monitor
> their jobs that in essence will run "srun --jobid=<their jobid>" and give
> them a shell inside the job. Using --overlap in that case kind of works but
> if your job have multiple consecutive sruns, you would be forced to always
> use --overlap in case you would want to debug it. And pam_slurm_adopt would
> also, kind of, work.

Ok, now I think that we are on the same page.

> But it is still not a perfect solution.
> 
> I guess what I'm trying to say is that srun used to have a way to get a
> zero-resource step. Maybe it was an unintentional feature, but it was very
> useful in some cases.
> 
> I don't know how much work it would be. but it would be extremely valuable
> to have some kind of "--extern" flag, that would give a jobstep similar
> resource allocation as one started with pam_slurm_adopt or --interactive.

Yes, I would say that the old "zero-resource step" was actually a workaround to allow some interactive workflow in an allocated node.
But because that had some issues, lead to confusions, and as it was mainly used in SallocDefaultCommand, we improved it with the new interactive step for salloc.

In your use case I agree that you need to specify --overlap, or use pam_slurm_adopt. Maybe sattach can also help you a bit.

Anyway, I would say that we are not interested on that feature right now, but let me discuss this a bit more internally to see if we could convert this into a RFE.

I'll keep you posted,
Albert
Comment 12 Albert Gil 2021-07-02 04:20:22 MDT
Hi Jonas,

> > For example, we have a script to help our users debug/monitor
> > their jobs that in essence will run "srun --jobid=<their jobid>" and give
> > them a shell inside the job. Using --overlap in that case kind of works but
> > if your job have multiple consecutive sruns, you would be forced to always
> > use --overlap in case you would want to debug it.

Just to clarify: --overlap is needed only in 1 of the steps, so you should only need to change your script in this use case.

For example, we can run a normal job with 1 step without overlap:

$ sbatch -N1 -c2 --mem 1G --wrap "srun sleep 300"
Submitted batch job 72

And later on, we can add an step overlapping with an interactive bash:

$ srun   -N1 -c2 --mem 0 --overlap --pty --jobid=72 /bin/bash
$ echo $SLURM_JOB_ID 
72

As you can see, all can run in parallel:

$ sacct -j 72
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
72                 wrap      debug       acct          2    RUNNING      0:0 
72.batch          batch                  acct          2    RUNNING      0:0 
72.0              sleep                  acct          2    RUNNING      0:0 
72.1               bash                  acct          2    RUNNING      0:0 

$ exit

Isn't that good enough for your use case?

Regards,
Albert
Comment 13 Albert Gil 2021-07-09 08:20:44 MDT
Hi Jonas,

After my last comment 12, do you still think that --overlap is not a solution in your use case?
I think it is, but maybe there is some details that I'm missing about.

Regards,
Albert
Comment 14 Albert Gil 2021-07-16 11:18:19 MDT
Hi Jonas,

If this is ok for you I'm closing this ticket as infogiven, but please don't hesitate to reopen it if you need further support.

Regards,
Albert
Comment 15 Jonas Stare 2021-08-11 04:13:20 MDT
Hi, sorry for the late reply. I've been on vacation.

The issue is partly solved. We have figured out how to use the new --overlap, --exact and --exclusive flags, but there is still the issue with "zero allocation" steps.

We (mostly our Application Experts) use it when monitoring or debugging jobs, or helping users.

Using --overlap works in some cases, but not all, depending on how the user started their jobs. This has already bitten us a couple of times when we've tried to debug a job by starting an srun with --overlap to get access to the job allocation, and it then blocked the actual job steps from starting. 

Using pam_slurm_adopt would also work in some cases. But then we have the problem with nodes running multiple jobs from the same user (not uncommon) and we currently have a spank plugin that doesn't play well with pam_slurm_adopt.

So, at the moment we do have jobs running but our ability to provide support to our users, the way we used to, is very very limited.

I don't like to use the word "demand", but we really need a way to create those kinds of job steps. The functionality seem to (partially) already be there using pam_slurm_adopt. but we need to be able to use it from srun. It really is an invaluable feature for us.

So, it is resolved as far as getting jobs to run properly. But I would like the issue with "zero allocation" steps, or extern-steps to become an RFE.
Comment 16 Albert Gil 2021-08-18 08:55:39 MDT
Hi Jonas!

> sorry for the late reply. I've been on vacation.

My turn to say sorry now, I've on vacation too! ;-)
I hope yours went as well as mine, or better! 

> The issue is partly solved. We have figured out how to use the new
> --overlap, --exact and --exclusive flags,
> So, it is resolved as far as getting jobs to run properly.

Good.

> But I would like
> the issue with "zero allocation" steps, or extern-steps to become an RFE.

I understand.

> Using --overlap works in some cases, but not all, depending on how the user
> started their jobs. This has already bitten us a couple of times when we've
> tried to debug a job by starting an srun with --overlap to get access to the
> job allocation, and it then blocked the actual job steps from starting. 

I think that we'll need more details to understand in which cases using --overlap doesn't work as you need.
Depending on it, maybe this is not a RFE but an actual bug on --overlap & Co?
And even if this is not a bug my guess is that, if this becomes a RFE, the way to implement that feature may end up being totally related to --overlap too.

So, please try to provide some example or explanation that help us to better understand why --overlap & Co are not a solution in some cases, so we can work on it fixing a bug or developing the right improvement.

Regards,
Albert
Comment 17 Jonas Stare 2021-08-24 03:19:49 MDT
Hi Albert,

One example that doesn't work is this:

The user have a script that run multiple consecutive srun.


#!/bin/sh
#SBATCH -n1
for i in $(seq 1 10); do
  echo $i
  srun sleep 10
done


If something is wrong and you need to debug or want to monitor the job we would use something like this:

srun --jobid=<id of the job> --pty --mem=0 /bin/bash -l


That would give you a shell inside the job with all its allocations and environment variables which is very helpful during debugging .

This used to work, but it fails now and you need to add --overlap to the srun with the shell, but then the script will halt as soon as it is trying to run the next srun in the loop.

srun --jobid=<id of the job> --pty --mem=0 --overlap /bin/bash -l


So, to sum things up, we used to be able to run these "zero resource" steps before but there is no (good) way to create them now. Being able to create a "zero resource" or extern-step from srun was a big part of how we provided support to users or let users monitor their jobs while developing scripts.

regards,
Jonas
Comment 20 Albert Gil 2021-08-24 05:14:33 MDT
Hi Jonas,

Thanks for example, now I see the case.

> This used to work, but it fails now and you need to add --overlap to the
> srun with the shell,

Yes, the --overlap is the way to make this work on 20.11.

> but then the script will halt as soon as it is trying
> to run the next srun in the loop.

Ok, that's the key part.
You can access to the job and --overlap with a running step (that's all what I though that you need), but no more steps will be started (unless they are also overlapping ones, which is not the usual case).
I see that this may be desired behavior in some cases, but I also see why is not in your case.

Let me discuss this use case internally and come back to you with some decision about how to handle it.

Thanks,
Albert
Comment 22 Albert Gil 2021-09-09 11:49:40 MDT
Hi Jonas,

We have been discussing this internally and we agreed that such an enhancement does make sense, so I'm converting it into a RFE.
The initial idea improve the --overlap option to follow a similar pattern that the one that we already follow with OverSubscribe adding the "force" option, to make that step always overlapped no matter what the other steps are set with overlapping or not (in your words, "a zero-resource step").
But this is just an initial idea, it can change.

As you know we cannot commit about when this feature will be added (unless it's sponsored).

In the meantime, using both --overlap or --interactive are the (limited) workarounds for your use case.
Just note that --interactive may change in the future as it was created for internal usage only, and that it will be allocated always into the batch node.

Regards,
Albert
Comment 29 Marshall Garey 2022-01-13 10:47:26 MST
Hi Jonas,

I'm just letting you know that we're in the reviewing stage for this bug. I don't know exactly when we'll get it all finished and pushed out, but just wanted to give you an update saying that we've made progress.
Comment 30 Jonas Stare 2022-03-02 03:47:26 MST
Hi,

I was just curious if there has been any more progress with this issue and if there is any code/patches that we can test on our test-cluster?

Regards,
Jonas
Comment 31 Marshall Garey 2022-03-07 17:38:30 MST
Hi Jonas,

I hope to get you a patch soon. We have a patch that is in our review process. So, it may still change, but we've made progress.
Comment 32 Jonas Stare 2022-03-08 01:14:46 MST
Would it be possible to get the current patch so we can do some testing here? It wouldn't matter much if things change later, we just want to try out the new feature and figure out the best way to use it when it is available.
Comment 33 Marshall Garey 2022-03-08 16:52:40 MST
Created attachment 23777 [details]
patches for current Slurm master (will be 22.05)

Hi Jonas,

The attached file has a series of patches that apply cleanly on top of the current Slurm master branch. I just applied it on top of commit 966746a7ea. Please let us know if you have any issues applying or testing these patches. We hope to push this to github soon, but there are some things with the first behavioral change that we need to look into first.

(Also, the filename has a different bug ID (bug12462) - that is an internal bug which we made to track the work for this bug, but also work on other fixes with --overlap.)


This patchset has two important functional changes:

(1) Steps that specify --overlap cannot overlap with steps that do not specify --overlap. Here is the motivation behind this change:

Currently (in Slurm 21.08), the following two srun steps will run in parallel:

> $ sbatch -N1 -c2 --mem 1G --wrap "srun sleep 300"
> Submitted batch job 72
> $ srun   -N1 -c2 --mem 0 --overlap --pty --jobid=72 /bin/bash


However, the following two steps will *not* run in parallel:

> $ sbatch -N1 -c2 --mem 1G --wrap "srun --overlap sleep 300"
> Submitted batch job 72
> $ srun   -N1 -c2 --mem 0 --pty --jobid=72 /bin/bash # Not started in parallel


Why does it work this way in 21.08? The steps that don't have overlap (therefore, exclusive access to resources) won't use CPUs that are already being used. They don't know if the CPUs are being used by steps with --overlap or not.

This is confusing and inconsistent. Therefore, we decided to change --overlap such that they may only overlap other steps that also specify --overlap.

This will break some users' use of --overlap, and will break some current debugging tools such as Arm Forge Ddt! We would like to make this change in 22.05, but we don't want to break tools unexpectedly. So we are attempting to reach out to Arm at minimum to discuss this change with them. I'm not sure how that will go and whether or not we'll need to communicate with other entities.



(2) The behavior change which SNIC has sponsored: Add --overlap=force as an option to srun. This is a "zero allocation" step - resources allocated to it are not counted against the job allocation. Therefore, srun --overlap=force will create a step that overlaps all other steps on all resources (CPUs, memory, GRES). Without the "force" option, --overlap will only overlap on CPUs as it always has.

A step that requests --overlap=force will always be given CPUs starting from the lowest CPU. Because --overlap=force does not count against the job allocation, we do not track which CPUs have already been allocated to --overlap=force steps. Let's say you have a job that has CPUs 0-7 on a node. If you run the following:

srun --overlap=force -n1 -c1 --exact sleep 60 &
srun --overlap=force -n1 -c1 --exact sleep 60 &
wait

Each of these steps will be allocated the same CPU: CPU 0. 
If you don't want overlapping CPUs on your steps, simply don't use --overlap or --overlap=force.

We feel that this behavior makes sense for the --overlap=force step. The --overlap=force step can of course request all the resources within the job.


Let me know if you have any questions or concerns about these patches. We'd certainly like to address any concerns before 22.05.

- Marshall
Comment 34 Marshall Garey 2022-03-09 09:46:27 MST
Jonas,

Quick update. We ended up pushing this to the master branch, so you can just get the current master branch and test it

Commits:

fe9f416ec2 Add --overlap=force option to srun
751b1b4288 Steps may only overlap with steps that also used --overlap
84d602dd7f Pack/unpack cpus_overlap
cfbd78601b Add a way to track overlapped cpus in a job (--overlap)


We will of course fix anything if we find any issues with it. If we do find issues with the fix to --overlap we will address those as well.

Would you like us to keep this bug open until you get a chance to test it and give us feedback, or should I just close this bug?

- Marshall
Comment 37 Marshall Garey 2022-03-10 14:18:02 MST
Jonas,

We have a proposal:

What if we swap the behavior of --overlap and --overlap=force?

* --overlap=force becomes --overlap. Therefore, using --overlap will get this new overlap behavior of overlapping on all resources (by not being counted against the job's allocation).
* --overlap becomes --overlap=mutual. This would be here to opt into the 20.11/21.08 behavior but with the fixes to ensure that these steps only overlap with other steps that specify --overlap=mutual.

This way we wouldn't break existing use of --overlap with debug tools or other possible use cases. We also suspect that people who use --overlap really just want the steps to overlap on all resources.

What do you think?
Comment 38 Jonas Stare 2022-03-11 06:37:54 MST
I've asked around and the people here seem to agree with me that letting --overlap=force become --overlap and --overlap become --overlap=mutual, is a good idea. That would fit with how we thought it worked in the beginning. :)

I think you can close the ticket for now. I haven't had time to do a lot of testing yet or letting our application experts try it, if we run into any huge problems I guess we could always open a new ticket?
Comment 39 Marshall Garey 2022-03-11 13:40:45 MST
(In reply to Jonas Stare from comment #38)
> I've asked around and the people here seem to agree with me that letting
> --overlap=force become --overlap and --overlap become --overlap=mutual, is a
> good idea. That would fit with how we thought it worked in the beginning. :)

Great, sounds good. We'll make that change and let you know when it's done.


> I think you can close the ticket for now. I haven't had time to do a lot of
> testing yet or letting our application experts try it, if we run into any
> huge problems I guess we could always open a new ticket?

I'm going to leave this open at least until we've finished making the above change.
Comment 40 Marshall Garey 2022-03-11 14:24:23 MST
*** Ticket 12462 has been marked as a duplicate of this ticket. ***
Comment 56 Marshall Garey 2022-03-30 09:57:20 MDT
*** Ticket 13683 has been marked as a duplicate of this ticket. ***
Comment 57 Marshall Garey 2022-03-30 10:07:49 MDT
Hi Jonas,

I have an update. We decided to throw out --overlap=mutual for the following reasons:

We wanted to have --overlap=mutual to preserve the old (20.11 and 21.08) behavior of --overlap; however, that behavior was buggy and did unexpected things, as you found. Implementing --overlap=mutual properly (fixing these bugs) turned out to be more difficult than we anticipated and the implementation I came up with had poor performance. Also, it wasn't part of the enhancement request.

So, we felt comfortable just ditching the idea of --overlap=mutual.

So now we just ended up with your original request to change the behavior of --overlap to overlap all resources (not just CPUs) and don't subtract its allocated resources from the job's available resources.

This has been pushed to the master branch in the following commit:

5e446730c8 Make --overlap=force the only behavior for --overlap


I'm keeping this bug open while we finish QA for a new regression test for --overlap.

- Marshall
Comment 65 Marshall Garey 2022-04-08 07:56:23 MDT
Jonas,

We finished the work for this ticket - making --overlap have the same behavior as --overlap=force and removing --overlap=force, fixing a bug, and adding a regression test:

5e446730c8 Make --overlap=force the only behavior for --overlap
8b00476873 Fix --overlap=force step not allocated GRES if another step had GRES
4f1c9ef9ce Testsuite - Add test1.12 for srun --overlap


All of these will be part of the 22.05 Slurm release. I'm closing this ticket as resolved/fixed.

If you have any concerns about the new behavior of --overlap, please test it and let us know on this ticket *this month (April)*. The 22.05 release is scheduled for May (2022/05), so we need any negative feedback ASAP.

At this point, if find any bugs with the behavior, please open a new ticket.

Thanks!

- Marshall