11341 – Job steps can not share resources

Ticket 11341 - Job steps can not share resources

Summary: Job steps can not share resources

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	20.11.2
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:	Ben Roberts

URL:

Depends on:
Blocks:

Reported:	2021-04-11 01:44 MDT by Ahmed Essam ElMazaty
Modified:	2021-12-06 10:22 MST (History)
CC List:	0 users

See Also:
Site:	KAUST
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ahmed Essam ElMazaty 2021-04-11 01:44:23 MDT

Hello,
Our colleague David is having some difficulties running his work flows. He gets the following errors on both Ibex (20.11.2) and our small test cluster (20.11.5)

"""
I am developing some best practices on using Tensorboard on Ibex and am having a bit of trouble. Here is my job script.


#!/bin/bash --login
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --gpus-per-node=v100:1
#SBATCH --cpus-per-gpu=6  
#SBATCH --mem-per-gpu=64G
#SBATCH --constraint=intel 
#SBATCH --partition=debug 
#SBATCH --job-name=launch-jupyter-server
#SBATCH --mail-type=ALL
#SBATCH --output=bin/%x-%j-slurm.out
#SBATCH --error=bin/%x-%j-slurm.err

# job fails if any line in the script fails
set -e

# script should be run from the project root directory
PROJECT_DIR="$PWD"

# setup the environment
module purge
ENV_PREFIX="$PROJECT_DIR"/env
conda activate "$ENV_PREFIX"

# create the logging directory for tensorboard (if necessary)
TENSORBOARD_LOGDIR="$PROJECT_DIR"/results/"$SLURM_JOB_NAME"/"$SLURM_JOB_ID"/tensorboard
mkdir -p "$TENSORBOARD_LOGDIR"

# jupyterlab_tensorboard plugins are brittle so for now just run separate server
srun --resv-ports=1 "$PROJECT_DIR"/bin/launch-tensorboard-server.srun "$TENSORBOARD_LOGDIR" &
TENSORBOARD_PID=$!

# use srun to launch Jupyter server in order to reserve a port
srun --resv-ports=1 "$PROJECT_DIR"/bin/launch-jupyter-server.srun

# kill off the Tensorboard server
kill $TENSORBOARD_PID

What I expected would happen is that the first srun command would launch the Tensorboard server (reserving an unused port to prevent contention). Since the Tensorboard server will run for the duration of the session I don't want this srun command to block so I add the & operator to run the task in the background. I then expected the second srun command to run immediately and launch the Jupyter server (also reserving a port to avoid contention). My expectation was that both sun commands would just share the same underlying pool of resources allocated for the job.

What happens in practice is that one of the other of the srun commands gets ahold of the entire resource allocation and then errors like the following are generated.

srun: Job 629 step creation temporarily disabled, retrying (Requested nodes are busy)

srun: Job 629 step creation still disabled, retrying (Requested nodes are busy)

srun: Job 629 step creation still disabled, retrying (Requested nodes are busy)

srun: Job 629 step creation still disabled, retrying (Requested nodes are busy)

srun: Job 629 step creation still disabled, retrying (Requested nodes are busy)

 What is the best way to fix this?
"""

Thanks 
Ahmed

Comment 2 Marshall Garey 2021-04-12 15:17:49 MDT

This happens because in 20.11 steps have exclusive access to their resources by default. You can override that default and get the pre-20.11 behavior with the --overlap flag for srun (steps can overlap resources).

This change is detailed in our RELEASE_NOTES file:

-- By default, a step started with srun will be granted exclusive (or non-
overlapping) access to the resources assigned to that step. No other
parallel step will be allowed to run on the same resources at the same
time. This replaces one facet of the '--exclusive' option's behavior, but
does not imply the '--exact' option described below. To get the previous
default behavior - which allowed parallel steps to share all resources -
use the new srun '--overlap' option.
-- In conjunction to this non-overlapping step allocation behavior being the
new default, there is an additional new option for step management
'--exact', which will allow a step access to only those resources requested
by the step. This is the second half of the '--exclusive' behavior.
Otherwise, by default all non-gres resources on each node in the allocation
will be used by the step, making it so no other parallel step will have
access to those resources unless both steps have specified '--overlap'.

Another note: 20.11.0 through 20.11.2 have a change that breaks MPI, and that is fixed in 20.11.3. So, I strongly recommend upgrading your production cluster. Also what I just copied from RELEASE_NOTES is from 20.11.3. This is all discussed at length in bug 10383, and Tim Wickberg talked about the changes and fixes in bug 10383 comment 63.

https://bugs.schedmd.com/show_bug.cgi?id=10383#c63

Does this answer your question?

Comment 3 Ahmed Essam ElMazaty 2021-04-13 01:41:37 MDT

Dear Marshall,
Thanks for your reply!
We've already set SLURM_WHOLE=1 in users' environment since last December.
But the mentioned errors still appear. Do we still need to use "--overlap" even when "SLURM_WHOLE" is set to 1 ?
Thanks 
Ahmed

Comment 4 Marshall Garey 2021-04-13 09:55:36 MDT

(In reply to Ahmed Essam ElMazaty from comment #3)
> Dear Marshall,
> Thanks for your reply!
> We've already set SLURM_WHOLE=1 in users' environment since last December.
> But the mentioned errors still appear. Do we still need to use "--overlap"
> even when "SLURM_WHOLE" is set to 1 ?
> Thanks 
> Ahmed

Yes, you still need to use --overlap. SLURM_WHOLE doesn't imply SLURM_OVERLAP.

Quoting Tim from bug 10383 comment 63:

"As further background behind this change: there was a customer request that the "--exclusive" srun option be made the default in 20.11, and this was done ahead of 20.11.0. Unfortunately some aspects of this had unforeseen impacts as have been discussed extensively on this ticket, most especially with external MPI stacks, and half of the functional changes described here have been reverted ahead of 20.11.3 to address this.

The --exclusive option (when used for step layout; no changes were made in respect to how that option works on job allocations) has had two orthogonal pieces:

- Controlling whether the job step is permitted to overlap on the assigned resources with other job steps. (The --overlap flag was introduced to opt-in to this, and the default behavior for 20.11 was changed and remains changed to providing non-overlapping allocations.)

- Restricting the job allocation to the minimum resources required, rather than permitting access to all resources assigned to the job on each node. (Which was made available through the --whole flag.)

The first change to non-overlapping behavior is what I believe was originally intended by that request, and that aspect remains the new default behavior going forward. That can be overridden by all steps in the job requesting --overlap, but we believe workflows that would intentionally desire such behavior to be rare in practice."

And he goes on to explain how MPI was broken and how it is fixed by making --whole the default.

I hope that helps clear things up!

If you want --overlap to still be the default, there are different ways you can do that. I recommend using a cli_filter plugin and setting --overlap in the function cli_filter_p_setup_defaults(). That's the best place to set any defaults you want for users, and users can override those defaults in their job request.

Comment 5 Ahmed Essam ElMazaty 2021-04-14 05:57:07 MDT

Dear Marshall,
Adding "--overlap" didn't help. we still see the same errors.
We're getting the same errors even on our test cluster where we have 20.11.5

Here's the updated batch script we use

#!/bin/bash --login
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-gpu=6  
#SBATCH --mem-per-gpu=64G
##SBATCH --constraint=intel 
#SBATCH --partition=batch 
#SBATCH --job-name=launch-jupyter-server
#SBATCH --mail-type=ALL
#SBATCH --output=bin/%x-%j-slurm.out
#SBATCH --error=bin/%x-%j-slurm.err
# job fails if any line in the script fails
set -e
# script should be run from the project root directory
PROJECT_DIR="$PWD"
# setup the environment
module purge
ENV_PREFIX="$PROJECT_DIR"/env
conda activate "$ENV_PREFIX"
# create the logging directory for tensorboard (if necessary)
TENSORBOARD_LOGDIR="$PROJECT_DIR"/results/"$SLURM_JOB_NAME"/"$SLURM_JOB_ID"/tensorboard
mkdir -p "$TENSORBOARD_LOGDIR"
# jupyterlab_tensorboard plugins are brittle so for now just run separate server
srun --overlap --resv-ports=1 "$PROJECT_DIR"/bin/launch-tensorboard-server.srun "$TENSORBOARD_LOGDIR" &
TENSORBOARD_PID=$!
# use srun to launch Jupyter server in order to reserve a port
srun --overlap --resv-ports=1 "$PROJECT_DIR"/bin/launch-jupyter-server.srun
# kill off the Tensorboard server
kill $TENSORBOARD_PID


Thanks
Ahmed

Comment 6 Marshall Garey 2021-04-14 09:39:30 MDT

Ahmed,

--overlap doesn't allow sharing GRES (which includes GPUs). It only allows sharing non-GRES resources (CPUs, memory). This is the behavior in pre-20.11 (like 20.02) anyway - GRES aren't shared by steps. I see this isn't documented, so I will work on a doc patch to the srun man page.


On an unrelated note:
--mem-per-gpu is broken. You should not use it right now. I'm working on fixing it, but the changes are extensive so I'm targeting the fix for 21.08. See bug 9229.

Comment 10 Marshall Garey 2021-04-14 16:15:14 MDT

(In reply to Marshall Garey from comment #6)
> Ahmed,
> 
> --overlap doesn't allow sharing GRES (which includes GPUs). It only allows
> sharing non-GRES resources (CPUs, memory). This is the behavior in pre-20.11
> (like 20.02) anyway - GRES aren't shared by steps. I see this isn't
> documented, so I will work on a doc patch to the srun man page.

Correcting myself: --overlap only allows sharing CPUs. It does *not* allow steps to share memory (or other tres/gres). This seems in line with the 20.02 behavior.

Comment 11 Ahmed Essam ElMazaty 2021-04-15 03:21:02 MDT

Hi Marshall,

Thanks for your detailed reply.
Is there a plan so such a feature can be also available for GRES?

Ahmed

Comment 13 Marshall Garey 2021-04-15 15:35:54 MDT

(In reply to Ahmed Essam ElMazaty from comment #11)
> Thanks for your detailed reply.

You're welcome.

> Is there a plan so such a feature can be also available for GRES?

I'm not aware of any plans to do this.

The closest thing we have to it is CUDA MPS which is available on some nvidia cards. But this only allows sharing GPUs between jobs. Here's the documentation for MPS:

https://slurm.schedmd.com/gres.html#MPS_Management

There's one way to allow GPUs to be shared between job steps: that is by setting ConstrainDevices=no in cgroup.conf or simply not using cgroups, but this is generally not recommended because it means any job on the node can use any GPU (or other device) on the node.

If you'd like to have job steps be able to share GRES, feel free to submit a new ticket requesting it.


By the way, we've clarified the documentation in commit 3dad7012d7b. This will be live on our website when the next 20.11.6 is released (which hopefully will happen in the next two weeks).

Is there anything else I can help you with on this bug?

Comment 14 Ahmed Essam ElMazaty 2021-04-19 07:55:24 MDT

Thanks for your help Marshall
I have no other questions
Regards,
Ahmed

Comment 15 Marshall Garey 2021-04-19 08:02:52 MDT

Sounds good! I'm closing this as infogiven.

Comment 16 Ahmed Essam ElMazaty 2021-12-06 02:25:04 MST

Dear Marshall,

I have another question related to the same script in the ticket.
If I have multiple srun commands, If one of them doesn't need any "GRES", are there any parameters to add so the first "srun" for example allocates CPUs only so it doesn't block the second "srun" which needs GPUs?
We currently use 20.11.8

Thanks,
Ahmed

Comment 17 Marshall Garey 2021-12-06 10:22:12 MST

--gres=none