Hello, Our colleague David is having some difficulties running his work flows. He gets the following errors on both Ibex (20.11.2) and our small test cluster (20.11.5) """ I am developing some best practices on using Tensorboard on Ibex and am having a bit of trouble. Here is my job script. #!/bin/bash --login #SBATCH --time=2:00:00 #SBATCH --nodes=1 #SBATCH --gpus-per-node=v100:1 #SBATCH --cpus-per-gpu=6 #SBATCH --mem-per-gpu=64G #SBATCH --constraint=intel #SBATCH --partition=debug #SBATCH --job-name=launch-jupyter-server #SBATCH --mail-type=ALL #SBATCH --output=bin/%x-%j-slurm.out #SBATCH --error=bin/%x-%j-slurm.err # job fails if any line in the script fails set -e # script should be run from the project root directory PROJECT_DIR="$PWD" # setup the environment module purge ENV_PREFIX="$PROJECT_DIR"/env conda activate "$ENV_PREFIX" # create the logging directory for tensorboard (if necessary) TENSORBOARD_LOGDIR="$PROJECT_DIR"/results/"$SLURM_JOB_NAME"/"$SLURM_JOB_ID"/tensorboard mkdir -p "$TENSORBOARD_LOGDIR" # jupyterlab_tensorboard plugins are brittle so for now just run separate server srun --resv-ports=1 "$PROJECT_DIR"/bin/launch-tensorboard-server.srun "$TENSORBOARD_LOGDIR" & TENSORBOARD_PID=$! # use srun to launch Jupyter server in order to reserve a port srun --resv-ports=1 "$PROJECT_DIR"/bin/launch-jupyter-server.srun # kill off the Tensorboard server kill $TENSORBOARD_PID What I expected would happen is that the first srun command would launch the Tensorboard server (reserving an unused port to prevent contention). Since the Tensorboard server will run for the duration of the session I don't want this srun command to block so I add the & operator to run the task in the background. I then expected the second srun command to run immediately and launch the Jupyter server (also reserving a port to avoid contention). My expectation was that both sun commands would just share the same underlying pool of resources allocated for the job. What happens in practice is that one of the other of the srun commands gets ahold of the entire resource allocation and then errors like the following are generated. srun: Job 629 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 629 step creation still disabled, retrying (Requested nodes are busy) srun: Job 629 step creation still disabled, retrying (Requested nodes are busy) srun: Job 629 step creation still disabled, retrying (Requested nodes are busy) srun: Job 629 step creation still disabled, retrying (Requested nodes are busy) What is the best way to fix this? """ Thanks Ahmed
This happens because in 20.11 steps have exclusive access to their resources by default. You can override that default and get the pre-20.11 behavior with the --overlap flag for srun (steps can overlap resources). This change is detailed in our RELEASE_NOTES file: -- By default, a step started with srun will be granted exclusive (or non- overlapping) access to the resources assigned to that step. No other parallel step will be allowed to run on the same resources at the same time. This replaces one facet of the '--exclusive' option's behavior, but does not imply the '--exact' option described below. To get the previous default behavior - which allowed parallel steps to share all resources - use the new srun '--overlap' option. -- In conjunction to this non-overlapping step allocation behavior being the new default, there is an additional new option for step management '--exact', which will allow a step access to only those resources requested by the step. This is the second half of the '--exclusive' behavior. Otherwise, by default all non-gres resources on each node in the allocation will be used by the step, making it so no other parallel step will have access to those resources unless both steps have specified '--overlap'. Another note: 20.11.0 through 20.11.2 have a change that breaks MPI, and that is fixed in 20.11.3. So, I strongly recommend upgrading your production cluster. Also what I just copied from RELEASE_NOTES is from 20.11.3. This is all discussed at length in bug 10383, and Tim Wickberg talked about the changes and fixes in bug 10383 comment 63. https://bugs.schedmd.com/show_bug.cgi?id=10383#c63 Does this answer your question?
Dear Marshall, Thanks for your reply! We've already set SLURM_WHOLE=1 in users' environment since last December. But the mentioned errors still appear. Do we still need to use "--overlap" even when "SLURM_WHOLE" is set to 1 ? Thanks Ahmed
(In reply to Ahmed Essam ElMazaty from comment #3) > Dear Marshall, > Thanks for your reply! > We've already set SLURM_WHOLE=1 in users' environment since last December. > But the mentioned errors still appear. Do we still need to use "--overlap" > even when "SLURM_WHOLE" is set to 1 ? > Thanks > Ahmed Yes, you still need to use --overlap. SLURM_WHOLE doesn't imply SLURM_OVERLAP. Quoting Tim from bug 10383 comment 63: "As further background behind this change: there was a customer request that the "--exclusive" srun option be made the default in 20.11, and this was done ahead of 20.11.0. Unfortunately some aspects of this had unforeseen impacts as have been discussed extensively on this ticket, most especially with external MPI stacks, and half of the functional changes described here have been reverted ahead of 20.11.3 to address this. The --exclusive option (when used for step layout; no changes were made in respect to how that option works on job allocations) has had two orthogonal pieces: - Controlling whether the job step is permitted to overlap on the assigned resources with other job steps. (The --overlap flag was introduced to opt-in to this, and the default behavior for 20.11 was changed and remains changed to providing non-overlapping allocations.) - Restricting the job allocation to the minimum resources required, rather than permitting access to all resources assigned to the job on each node. (Which was made available through the --whole flag.) The first change to non-overlapping behavior is what I believe was originally intended by that request, and that aspect remains the new default behavior going forward. That can be overridden by all steps in the job requesting --overlap, but we believe workflows that would intentionally desire such behavior to be rare in practice." And he goes on to explain how MPI was broken and how it is fixed by making --whole the default. I hope that helps clear things up! If you want --overlap to still be the default, there are different ways you can do that. I recommend using a cli_filter plugin and setting --overlap in the function cli_filter_p_setup_defaults(). That's the best place to set any defaults you want for users, and users can override those defaults in their job request.
Dear Marshall, Adding "--overlap" didn't help. we still see the same errors. We're getting the same errors even on our test cluster where we have 20.11.5 Here's the updated batch script we use #!/bin/bash --login #SBATCH --time=2:00:00 #SBATCH --nodes=1 #SBATCH --gpus-per-node=1 #SBATCH --cpus-per-gpu=6 #SBATCH --mem-per-gpu=64G ##SBATCH --constraint=intel #SBATCH --partition=batch #SBATCH --job-name=launch-jupyter-server #SBATCH --mail-type=ALL #SBATCH --output=bin/%x-%j-slurm.out #SBATCH --error=bin/%x-%j-slurm.err # job fails if any line in the script fails set -e # script should be run from the project root directory PROJECT_DIR="$PWD" # setup the environment module purge ENV_PREFIX="$PROJECT_DIR"/env conda activate "$ENV_PREFIX" # create the logging directory for tensorboard (if necessary) TENSORBOARD_LOGDIR="$PROJECT_DIR"/results/"$SLURM_JOB_NAME"/"$SLURM_JOB_ID"/tensorboard mkdir -p "$TENSORBOARD_LOGDIR" # jupyterlab_tensorboard plugins are brittle so for now just run separate server srun --overlap --resv-ports=1 "$PROJECT_DIR"/bin/launch-tensorboard-server.srun "$TENSORBOARD_LOGDIR" & TENSORBOARD_PID=$! # use srun to launch Jupyter server in order to reserve a port srun --overlap --resv-ports=1 "$PROJECT_DIR"/bin/launch-jupyter-server.srun # kill off the Tensorboard server kill $TENSORBOARD_PID Thanks Ahmed
Ahmed, --overlap doesn't allow sharing GRES (which includes GPUs). It only allows sharing non-GRES resources (CPUs, memory). This is the behavior in pre-20.11 (like 20.02) anyway - GRES aren't shared by steps. I see this isn't documented, so I will work on a doc patch to the srun man page. On an unrelated note: --mem-per-gpu is broken. You should not use it right now. I'm working on fixing it, but the changes are extensive so I'm targeting the fix for 21.08. See bug 9229.
(In reply to Marshall Garey from comment #6) > Ahmed, > > --overlap doesn't allow sharing GRES (which includes GPUs). It only allows > sharing non-GRES resources (CPUs, memory). This is the behavior in pre-20.11 > (like 20.02) anyway - GRES aren't shared by steps. I see this isn't > documented, so I will work on a doc patch to the srun man page. Correcting myself: --overlap only allows sharing CPUs. It does *not* allow steps to share memory (or other tres/gres). This seems in line with the 20.02 behavior.
Hi Marshall, Thanks for your detailed reply. Is there a plan so such a feature can be also available for GRES? Ahmed
(In reply to Ahmed Essam ElMazaty from comment #11) > Thanks for your detailed reply. You're welcome. > Is there a plan so such a feature can be also available for GRES? I'm not aware of any plans to do this. The closest thing we have to it is CUDA MPS which is available on some nvidia cards. But this only allows sharing GPUs between jobs. Here's the documentation for MPS: https://slurm.schedmd.com/gres.html#MPS_Management There's one way to allow GPUs to be shared between job steps: that is by setting ConstrainDevices=no in cgroup.conf or simply not using cgroups, but this is generally not recommended because it means any job on the node can use any GPU (or other device) on the node. If you'd like to have job steps be able to share GRES, feel free to submit a new ticket requesting it. By the way, we've clarified the documentation in commit 3dad7012d7b. This will be live on our website when the next 20.11.6 is released (which hopefully will happen in the next two weeks). Is there anything else I can help you with on this bug?
Thanks for your help Marshall I have no other questions Regards, Ahmed
Sounds good! I'm closing this as infogiven.
Dear Marshall, I have another question related to the same script in the ticket. If I have multiple srun commands, If one of them doesn't need any "GRES", are there any parameters to add so the first "srun" for example allocates CPUs only so it doesn't block the second "srun" which needs GPUs? We currently use 20.11.8 Thanks, Ahmed
--gres=none